OpenRefine +

Tags: #<Tag:0x00007f8627d451e8> #<Tag:0x00007f8627d45058> #<Tag:0x00007f8627d44e78>

(Joel Natividad) #1

A community data portal is not just about publishing datasets, it’s also about the community curating data gathered from various sources.

In the process of curation, we often have to massage data to make it suit our needs. This often means writing scripts with tools like R and python, and sometimes, just manually editing the data, with the massaged data often ending up in another database, creating yet another silo disconnected from the data commons.

This leads to people doing the same task over and over again - because even if we decide to republish the massaged data, there was no simple way to document and share what data wrangling operations were done to the dataset.

But not anymore with the updated OpenRefine CKAN extension. Open Refine - the “Mr. Clean for Data”, has had a CKAN extension since 2011, but it was outdated and needed some updates to make it work with the latest CKAN v3 API.

With OpenRefine+CKAN, you can download data directly from, refine it, and then pump the refined data back into the portal, complete with the recipe used to refine the data.

Check out how the extension was used to refine the speed-humps dataset, separating the combined latlong field into their own separate columns…

Here’s a more involved example with the WNYC School Vaccination Data. The dataset only had school names and no location info. It was geocoded using a combination of the Google Places API and the Geoclient API, with some manual edits for schools that neither geocoder could decipher - all fully documented in the recipe.

After the dataset was geocoded and republished in the portal, a viz was quickly created using the integrated CartoDB CKAN plugin.

Using NYC Open Data to Strengthen Tenants Rights Activism
What would you like in future geocoder?
(Joel Natividad) #2

Here’s another example of OpenRefine+CKAN integration, using the City’s Geoclient API to enrich the dataset, beyond lat/long, the recipe also added:

  • Normalized Address
  • Neighborhood Tabulation Area
  • Council District
  • Community District
  • Building ID Number (BIN)
  • Fire Division/Battalion/Company Info
  • Number of Street Frontages
  • Street Corner code
  • Census 2010 Block/Tract
  • the original JSON response from Geoclient

As you can see, Geoclient gives detailed NYC information that no other geocoder can provide. And in the case of the building ID number, we even created a deep-link into the City’s BIN Lookup system so users can get even more info about the gas station (e.g. inspections, violations, permits, etc.)

cc @h2oboxer, @noneck, @colin_reilly, @mlipper

(Joel Natividad) #3

And here’s a quick viz, complete with BIN Lookup, quickly pulled together with the portal’s integrated CartoDB visualizer.

(Tim) #4

This would be a perfect project/file for the education session on the 13th.
Raw data to meaningful map display. Excellent!

(Jackie Lu) #5

this is amazing! thank you!

(Marc) #6

Can we use a green nontoxic cleanser?

(Noel Hidalgo) #7

@h2oboxer & @jqnatividad Will both of you be at the May 13th Civic Hall hacknight?

If so, write me up a few lines and I’ll toss it up on the Meetup page. It would be great to advertise this discussion.

(Joel Natividad) #8

I’ll be there! And as per @h2oboxer’s suggestion, we’ll use the Gas Station Price dataset, taking it from Request to Data Scraped, Refined, and Mapped. :slight_smile:

Though OpenRefine will be the meat of the session, it will also touch aspects of how talk+data works together, and how its community features allow the team to collaborate on data.

(Marc) #9

@joel can open refine help deal with multiple. datasets? I have several that share many attributes that should be normalized, combined/joined and then mapped.

(Joel Natividad) #10

Yes. OpenRefine’s cell.cross GREL function is your friend. :wink:

(seanluciotolentino) #11

I’d love to see a tutorial on this! I have a lot of cleaning tasks for the crash data that would great if I could pump back into the data portal.

(Joel Natividad) #12

We’ll have one during CivicHacknight on the 13th@Civic Hall :smile:

(Marc) #13

Google search “geoclient api open refine” ranks this page #1!