From my perspective, the biggest issue we face in moving from where we are to a finished product is actually getting some kind of usable stack up and running so that people can make incremental changes to it – changing NLP choices is pretty easy once you can get something capable of being validated. This is why I would focus on:
(a) Creating a testing suite. What I’ve done is pretty basic, but maybe not all we’d want. It’d be nice to add a facility to write out the tested results, and in particular, because it’s unlikely that a string comparison will truly handle the address situation, finding a way to compare based on GPS coordinates or something would be useful.
(b) Creating some basic functionality that kind-of works. Before we can get addresses with high accuracy, we need to get something that works in some way at all. Improving the algorithms can be done after the fact, but making the framework is vital because it lowers the barriers for people to make meaningful contributions.
© Figuring out what’s easy and what’s hard. As of now it’s not clear what the high-value-to-effort-ratio stuff is, because we don’t really understand what we can and can’t do. The testing suite and gold standard work should help there.
(d) From an infrastructure perspective, we would want expansion from the ends of the pipeline: right now output is a spreadsheet and we’d like it to be a database, an api, a webpage. Right now, input is a spreadsheet and we’d like it to be a pdf, a database, or whatever. Incrementally building outwards will help too.
Roughly, I think that you’d like to complete (a) and (b) in the next 2-3 weeks – target at least a few data extractors working for dates and addresses and be able to say how good they are. Take another 3 weeks to improve them to the point where they’re not terrible. Then put a database on the end and you should be able to have something you can build on by end-April.