What's the best way to publish this dataset?

(Chris Whong) #1

This morning, Matt Hampel tweeted to me asking if CartoDB can pull in data directly from the Socrata API. It can, but the dataset in question, DOB Permits, does not contain any geo data, so it would need an additional step before it would be “mappable”.

This prompted some more twitter discussion on the best way to publish this data. My assertion was that mapping was probably a top use case, and that the city ought to at least provide a point geometry with each row.

@sromalewski pointed out that this is additional work that the city shouldn’t have to do when publishing data, and that the dataset included Borough, Block, Lot, and BIN, which could be used with a little SQL to join to either PLUTO or building footprints.

So, what do you think? Should the city take the extra step of adding geometries to this data, or leave it be?

Next question, if they should add geometries, should they be points or polygons? Building Footprints or Parcels? Should they be in NY State Plane that the city uses internally, or in WGS84 (lat/lon) that is more recognizable to a wider audience?

(Chris Whong) #2

Permits have a many-to-one relationship with buildings, so trying to map them out-of-the-box will result in lots of overlapping geometries. Is this a bad thing? Various mapping tools can overcome this with clustering. It all depends on what you want to show.

(sromalewski) #3

Another consideration is that the permit data represents information about the locations, rather than the locations themselves. There can be multiple permits on any given day for the same location. There can be multiple permits spanning multiple days for the same location. It’s up to the user how they want to disentagle all that and represent it spatially (or not). But keeping the attributes separate from geometry in a many-to-one relationship like this makes sense. I would imagine that’s what’s implied by Colin’s reply on Twitter that “permit data is non-spatial.”

(sromalewski) #4

You beat me to it :slight_smile:

(Matt Hampel) #5

I’m with Steve here. I’m happy to have the data in the format it’s published by the city. If the City would find it useful to have a copy of the data with geometries of some kind (points, polys) attached, then it should publish that; otherwise, it’s up to us.

Mapping is a common use case, and I personally would like a “value-added” version of the dataset that does have geodata attached. I feel like that may be the job of a third-party provider (eg CartoDB, Enigma, betanyc…) to provide enhanced public datasets. If they come into common use, the city should incorporate those features into their ETL workflow.

If I remember correctly, the last time I used this dataset, it wasn’t a clean join with BBL. That might be because some buildings didn’t yet exist when the permit was issued – or had been demolished. It could also have been a data quality issue. What version of the footprint data should permits be joined with? Should we get centroids or polygons? There are complexities that would diminish the audience for any one approach.

(Matt Hampel) #6

What does it mean that the data is “non-spatial”? Aren’t permits inherently tied to a physical location (even if geodata is not included)?

(sromalewski) #7

The list of permits itself doesn’t inherently include information about the geometry of the buildings. That’s why there’s a separate building footprint file that includes each footprint polygon, a building ID (BIN), and not much more. Building footprint file is spatial. Permit file is non-spatial, in that respect.

(Matt Hampel) #8

Got it. I was wondering if there were some / a lot of permits that didn’t actually map to a physical location, which would be strange for the concept (but anything is possible here…)

(sromalewski) #9

Not sure about any permits that would not map to a physical building, but you’re right to note that this (or anything!) could be possible. NYC DOB (or perhaps DoITT) would know for sure.

(Michael Schnuerle) #10

I’ll throw in my opinions here based on CfA brigade experiences.

I think the city should add location to any point open data they publish. City govs have more accurate street and address names and locations than any third party. They can geocode addresses that private companies can’t, ones that use colloquial names, are new housing developments, are place names, or landmark names. Real examples of ‘street addresses’ in Louisville that, say, Google could not have geocoded, have included Waterfront Park Lot D, On ramp to I-64 at Oak St, 1292 Everforest Way (new development road), Thomas Jefferson Statue.

And I think the location point data should be WGS84 lat/lon data fields, since that is going to have the widest use cases. Converting from state plane to this should not be much work for the city, but can be a larger burden for the general developer community.

For polygons, linking to an external geo data set by ID is acceptable, using WGS84. Having the data published two ways might be best in these cases, one data set in CSV with the raw data, an ID, and the lat/lon of the center point, and second data set as a SHP file bundle with the same data columns, ready to go in free open source programs like QGIS.

(Chris Whong) #11

This is a reason why CKAN’s notion of a dataset ( a collection of resources ) is advantageous over Socrata’s (a single cloud table), as the city could publish multiple “builds” of the same data for various use cases.

(Chris Whong) #12

It’s also worth noting that we aren’t talking about geocoding here, as there is a good lookup field for an existing spatial dataset (BIN gets you a good geometry from building footprints). Consumers have a very reliable way to attach spatial data to this that is the same as what the city can do. However, they must be aware that the footprints dataset exists, AND know where to find it, AND know what a BIN is.

(Andrew Nicklin) #13

What about encouraging a BLDS-compatible structure, which does include recommended (but not required) spatial attributes in implicit WGS84?

(Chris Whong) #14

Are the NYC agencies involved at any capacity in BLDS planning?

(Peter) #15

I’ve been working with this data quite some time and realized that I needed another dataset to “show” the building permit locations.

Feel free to look at my rudimentary resolution on nycdevelopmenttracker.com and for more explanation on my blog peterkowalewska.wordpress.com
I got the centroids of the buildings outline data and uploaded those coordinates to CartoDB, where I request it when I need it.

The buidling permit dataset is definately lacking a spatial element, in addition to needing better data representation (e.g. The column ‘zip’ refers to the filer’s zip, not the actual location of the building).

My long-term solution in the work is to just create a better set on my end that will reduce the number and amounts of requests.