A community data portal is not just about publishing datasets, it’s also about the community curating data gathered from various sources.
In the process of curation, we often have to massage data to make it suit our needs. This often means writing scripts with tools like R and python, and sometimes, just manually editing the data, with the massaged data often ending up in another database, creating yet another silo disconnected from the data commons.
This leads to people doing the same task over and over again - because even if we decide to republish the massaged data, there was no simple way to document and share what data wrangling operations were done to the dataset.
But not anymore with the updated OpenRefine CKAN extension. Open Refine - the “Mr. Clean for Data”, has had a CKAN extension since 2011, but it was outdated and needed some updates to make it work with the latest CKAN v3 API.
With OpenRefine+CKAN, you can download data directly from data.beta.nyc, refine it, and then pump the refined data back into the portal, complete with the recipe used to refine the data.
Check out how the extension was used to refine the speed-humps dataset, separating the combined latlong field into their own separate columns…
Here’s a more involved example with the WNYC School Vaccination Data. The dataset only had school names and no location info. It was geocoded using a combination of the Google Places API and the Geoclient API, with some manual edits for schools that neither geocoder could decipher - all fully documented in the recipe.
After the dataset was geocoded and republished in the portal, a viz was quickly created using the integrated CartoDB CKAN plugin.
Mr. Clean Nappa taken from http://budgies.deviantart.com/art/Mr-Clean-Nappa-289512796