Instead of using email to coordinate the group, let’s eat our own dogfood and try to conduct all CROW discussions here.
Here’s the brainstorming doc for everyone’s reference used during the unconference session.
Looks like I won’t be able to make it today – code on without me!
Some thoughts after trying to produce a gold standard set of testing data (I’ve annotated a decently large number of rows):
-Lots of addresses are vague. Hearings sometimes pertain to geographical regions rather than specific locations, and those regions can be defined by rectangles of points. Addresses often also have room number information inside them (‘22 Mumble Street, Floor 7, Room Fifty-Six, Borough of the Bronx’), which will need to be reconciled.
-Lots of meetings contain multiple addresses and this is not something the schema manages currently, which means that the gold standard data will need updating when this is fixed. In particular, a meeting tends to happen at place A, with documents available for advance review at place B, and will pertain to places C-M or something. I’m not sure how to address this in the schema; since we’re talking about making sense of text boxes, there may be some real work here.
-An idea that may be useful for future work/parser writers: If you can identify accurately what kind of announcement a thing is, you can almost parse it with array lookups as you would an actual log file. An example is the Notice of Intent to Issue New Solicitation entry type: these are agonizingly structured and probably won’t change soon. They also constitute about 10% of the sample, so killing these dead could be a decent-sized win for very little effort. I haven’t been able to find anything else that’s as large a fraction, but I haven’t been looking very hard. It may also be the case that many of the addresses are contained within highly specialized entry types, so even if these documents are a small fraction of the input, parsing them could be a large fraction of the value.
Keep doing great stuff,
Are you available to do a Google hangout to discuss the gold standard? I also have a few questions about the dataset – 1) is it CSV?
Last night I played around with one CSV file converting the csv file to a json file, cleaning up the description field by rendering the html fragments into text. Where could I upload that JSON if anyone else would find it interesting?
I also noted that “Notice of Intent to Issue New Solicitation entry” has a very interesting format.
How are the current csvs collected? Are PDFs and or websites scraped?
For example, how are these csvs in github generated? Is the code available?
The CSVs were given to us by DCAS, and was generated by them internally as an export directly from their DB. I added the the unmodified files in the github with the label [original]. No PDFs were scraped in other words, and our work now is focused on this (and upcoming) DB exports. Let me know if you want to know more about the export process, and we can address those questions to DCAS directly.
I’ll let @dclark answer the bulk of this, but to answer the question about where to upload the json, I would put on github in the sample database folder. https://github.com/CityOfNewYork/CROL-PDF/tree/master/Sample%20Database
Happy also to join on the gold standard discussion if I can be of any help. In either case, great if you can put the summary of that conversation here so others can follow.