State of the [CROW]

Wednesday, March 4


Hope all are staying dry in the wet weather. Just doing a quick (“scrum”) check in to sum up the past week, identify some of the issues that are impeding our progress, and sum up our next steps. Will send these weekly/biweekly so these posts should be a good opportunity for all to catch-up if you weren’t involved in all of the pasta discussions - and yes, future ones will be shorter. Here we go.

What’s new?

Data exploration
Many of you have already starting exploring and testing the state of the current data. @mattalhonte did some preliminary analysis of the blurb that is the “Additional a description text” using NLTK, looking at the structure and analyzing phrases that repeat themselves. @csudama has also been active and had converted the data into JSON which might be easier to feed your parsers.

Schema standardization talk
Last week I spoke with the good people at - a government schema standardization initiative, and got some valuable feedback on our proposed Public Notice Data Standard. I have integrated some of these notes under the Schema board on

Leading the way
Representing Commune [kuh-myoon], Mikael, undersigned, has since CodeAccross taken up a bigger role on the CROW project and will lead the community development going forward. There are many overlaps between CROW and Commune (standardizing schemas, structuring notice publication, making public meetings more accessible, etc.), and we will be dedicating a substantial amount of our time to take a lead role in contributing to the realization of CROWs objectives. The successful implementation of these will benefit us and any other person, startup or organization interested in structured public notice information.

Timeline - starting simple
As many of you may have seen, @dclark added a proposed timeline for how we can proceed for parsing the objects. The idea is to first set up the testing suite and a mvp library that “kind of works”. With this in place, we can start adding incremental changes to become better and better, and closer to our goals.

We like the approach and I have added it to trello for the address card as a suggestion, but those working on other parsers are also welcome to take and draw some inspiration from it. (Be sure to read @dclark original post though as my rendition does not do it proper justice.)

Anyone have issues or are stuck? Add it to the issue tracker. There is a lot of expertise on this team - you will get an answer.

What’s next?

Important! Break your task down this week
That is, define the user story/goal, and list 3 or more steps that you will need to accomplish that goal. This will help the others to know what you are thinking, as well as to offer their expertise on individual points. Feel free to add steps if your card has an approach - this will only spear discussion. See the schema or the address card for examples. I will come back to you to discuss some expected delivery dates.

The hacknights are a perfect opportunity to take some of your well developed ideas and hack them out in the real life. We wont do weekly full team meetings, but you are free, nay urged, to coordinate with the other person(s) on your card on trello and letting them know that you’ll be going. (E.g. commenting “Going to hacknigh tonight” on a card is enough to notify the people on your team about what you are doing.) Also, you are encouraged to meet with your partner somewhere else convenient, which I know some of you have done already. The thing we ask is only that you sum up quickly the take aways from the meeting.

Good data card
After last weeks discussions I’m creating a new issue on github dedicated to building a pipeline to connect, and massage, the DCAS data output so it’s clean and easily accessible by parsing libraries

Since all the libraries depend on some level of clean data, I thought we could perhaps centralize at least of this effort. Anyone interesting in taking the lead on this? All are welcome to - and should probably - contribute how they want the data to be like. This group will also be an important member in the discussions with DCAS team.

We will move a substantial part of our issue tracking and development management to GitHub. More on this together with timelines towards the rest of next week.

And finally…Watch
Make sure you are "Watching” the City Record Workgroup on Talk. That’s the only way you’ll make sure to not miss any developments, or my engaging email. I will be following up with all of you individually in the next few days to make sure no one is in the dark.

Thats’s it for today! Enjoy hacknight for those of you are doing, and you will hear back from me later this week (albeit in shorter form) to discuss the delivery time tables…



Is there a list of different categories for the entries? Cuz if not, then I guess a useful next-step on the way to building a classifier would be to do some exploratory analysis and figure out what our different classes of entry will be.

If there isn’t already a list of categories to work with, over the course of the next week I’ll play with some cluster analyses to figure out the kinds of entries we’ve got.

Hi Matt. Do you mean the types of categories that we are extracting? Like ‘addresses’, ‘subjects’, etc.? Or are you talking about the values within for these?

In either case, can you give me an example?

Here is the unified header schema that shows most of the objects to be extracted if that helps; (will be updated so don’t take it as a the final doc.)


1 Like

I was wondering about the values.

“Liquor License Hearing”, “Rezoning Meeting”, “Information Session”

Looks like the schema’s already got those, though!

Wednesday, April 22


Spring is here and CROW is more active than ever. If you have been contemplating whether/when to join, now is the perfect time to get off your stool and start hacking!

What’s new?
Since last update, we have had two meetings with DCAS and more are to come. In these meetings we have reiterated our shared goals, and just last week, started the technical collaboration where we showcased our parser MVP and discussed next steps. Interested in joining the next meeting with the City? All our welcome! Join our next weekly sync-up for details on the when, where and on what.

What are we doing?

The diagram sums up the current implementation proposal of what we are working on. Our goal for this phase is to develop and facilitate the implementation of the schema and parsers above, effectively being a crucial part of the conversation on how the City publishes the City Record as open and structured data. Yes, pretty exciting, and you can join too!

What we have done in the last month?

Thanks to @cds and his introduction of Python notebooks and a new development framework, we now have a much better understanding of the data we’re dealing with, and where we should start our efforts. We are approaching the parsing tasks based on Agency/NoticeType and the python books makes it easier to follow the breakdown of the task at hand. For instance, this is shows a very useful breakdown of Solicitation (procurement) adverts coming from DCAS.

We also created an initial MVP of the parsers that was showcased to DCAS. And our current tasks is to create address and date parser and you can help and contribute to this tasks here.

We are also currently moving from representing the idealized schema in the spreadsheet format to a JSON Schema. (Thanks to @cds again for helping us close issue#25)

Next steps

We have spent much of the last month trying to lower the barrier of entry for volunteer contribution to make it easier for everyone to add to the project.

The CROW development can now be found under two repos; CROL-Schema and CROL-Parsing. CROL-Schema aims at developing the optimized public schema through which the City Record will be made available on the open data platform, while the CROL parsing aims to develop the crucial pipeline that connects DCAS internal database with the public schema. We invite you to contribute to both!

Great! I’m in - how do I get (re)started?

  • Find a tasks: GitHub issues and milestones ( Parsing and Schema) These contain the development roadmaps and it is here you are most likely to find a task (called issue) that you can contribute to. Issues are organized under milestones, which identifies a broader goal that we are trying to achieve. Take a look and indicate your interest by commenting on an issue you would like to contribute to - each issue should be sufficiently documented for you to get started. We would love your contribution.

  • GitHub Contribution Guidelines. Take a quick look if you’re new to GitHub. Applying best practices makes your contribution much more effective and meaningful.

  • Environment Setup. Most of our code is in Python, and we create separate guides for how to setup the environment. Check here for; Parsing and here for Schema. NB! Not all task require technical skills so make sure you check out where you can help even if you’re not technical. (Please not that the technical barrier for participation is on many of the tasks quiet low, and we are happy to help you get started.)

Collaboration Framework: How we work

We have updated some of our collaboration tools, and @noneck just posted a very useful post to get you up to speed on that.

While most of our work is online, offline is also a good way of getting things done. The betaNYC Hacknights are a perfect opportunity to take some of your well developed ideas and hack them out in the real life. Also, if you are working on an issue with someone, you are encouraged to meet up with your partner somewhere else convenient, which we know some of you have done already. Also, betaNYC now has a community membership at CivicHall so get in touch with @noneck if you would like to work on the CROW project there. The thing we ask is only that you sum up quickly the take aways from the meeting, and post them on GitHub or here at talk.

And as always, please don’t hesitate to add questions, updates or comments below.

1 Like