Looks like I won’t be able to make it today – code on without me!
Some thoughts after trying to produce a gold standard set of testing data (I’ve annotated a decently large number of rows):
-Lots of addresses are vague. Hearings sometimes pertain to geographical regions rather than specific locations, and those regions can be defined by rectangles of points. Addresses often also have room number information inside them (‘22 Mumble Street, Floor 7, Room Fifty-Six, Borough of the Bronx’), which will need to be reconciled.
-Lots of meetings contain multiple addresses and this is not something the schema manages currently, which means that the gold standard data will need updating when this is fixed. In particular, a meeting tends to happen at place A, with documents available for advance review at place B, and will pertain to places C-M or something. I’m not sure how to address this in the schema; since we’re talking about making sense of text boxes, there may be some real work here.
-An idea that may be useful for future work/parser writers: If you can identify accurately what kind of announcement a thing is, you can almost parse it with array lookups as you would an actual log file. An example is the Notice of Intent to Issue New Solicitation entry type: these are agonizingly structured and probably won’t change soon. They also constitute about 10% of the sample, so killing these dead could be a decent-sized win for very little effort. I haven’t been able to find anything else that’s as large a fraction, but I haven’t been looking very hard. It may also be the case that many of the addresses are contained within highly specialized entry types, so even if these documents are a small fraction of the input, parsing them could be a large fraction of the value.
Keep doing great stuff,
Dennis