as some of you know I’m working on a dissertation in which I compare and contrast Open Data initiatives in different cities.

To show the trajectory of Open Data adoption in these cities, I want to create timelines showing the monthly number of uploaded data sets on a data portal as well as the number of different agencies that have uploaded data sets.

I thought about using the “waybackmachine” but they only crawled the NYC six times in 2014, thats it.

Does anyone has an idea how I could get this kind of data on the data (pun intended)?

NYC has a specific “dataset of datasets”. Each row is a dataset, and the last two columns contain the date created and date updated for that given dataset.

You can access a machine-readable version of many open data catalogs at /data.json, but for Socrata portals, the ideally useful field, issued, looks like it always shows the date of most recent update.

Thomas Levine wrote up instructions on how to do this a while back, but his site appears to be down. :frowning: Here’s a github archive of some publication data he grabbed 2 years ago.

We also took a weekly snapshot of the “dataset of datasets” for about a year (2012-2013). We used it to trigger the NYCFacets Crawler which then sucked down descriptive statistics and metadata for each dataset.

We took down NYCFacets a while back, but will see if I can dig up the DOfD archive.

You may also want to look into opendatacache.com. That’s a current project, and I think that’s a great enhancement for it. Not only keeping dataset counts, but characterizing the data as well. If @talos or some contributor added that functionality, you’ll automatically get it for all Socrata portals cached by it!

Finally, I know some folks at NYU CUSP are doing some extended characterization of NYC’s data portal. I dunno if its publicly available, but will ping my contacts there.

Opendatacache keeps some statistics on datasets when they change. If you navigate to an individual portal and click to show extended attribute columns on the top right, there’s one called “logs” at the bottom. Click on it and it will take you to a page like:


This is a CSV for each dataset, which gets a new row every time the dataset changes. Only tabular and I think shapefile data is covered by this – “external” data resources would not be. The tracking only goes back to June of last year, too.

If this looks like a data resource you’re interested in using, I can point you to the headers for those CSVs and give you some advice on gathering them all.

Thanks a lot to all of you guys. I really appreciate your help!

I will look through your suggestions over the next days.

I just started sorting the material of my three case studies and it feels like I just emptied the box of a 1000-piece puzzle on my floor. You kind of know it will add up in the end but the beginning is hard :slightly_smiling: