Community norms of sharing FOILed data?


(Noel Hidalgo) #1

Through the grace of all mighty gaia, I have come to posses a large dataset liberated via NY’s freedom of information law (FOIL). This is the State/City’s version of the freedom of information act (FOIA).

Being a good steward of data, I wonder if anyone has developed a framework / guidelines, around reviewing FOIL’ed data for personal identifiable information and what are the best ways to publish data that contains personal identifiable information (PII).

Do you know of a guide or guidelines that helps data nurds review datasets for personal identifiable information?

Depending on the dataset, what are the best practices to redact personal identifiable information?

Note - To the “open-all-the-bits” community, I get that these are government records and if they were handed to the public through the FOIL process that they should be published as is. Frustratingly, this particular dataset might contain names of vulnerable individuals. As a community of dataonauts, shouldn’t we have a hippocratic oath?


(Daniel Beeby) #2

As a community of dataonauts, shouldn’t we have a hippocratic oath?

Yes, we should. Isn’t it basically: “Do no harm”? I think that’s a good place to start!

Do you know of a guide or guidelines that helps data nurds review datasets for personal identifiable information?
Depending on the dataset, what are the best practices to redact personal identifiable information?

I don’t, but wouldn’t you think that randomizing the PII columns and then copying/pasting the resulting dataset (into a new document without any version history) would do the trick? That way you’d be able to correlate data, but not to any individuals.


(Mheadd) #3

I remember a similar case that came up in Philly a few years back. An individual had submitted a Right to Know request (RTK- the Commonwealth of PA’s version of FOIL) for all expenditure data from the city over several years. This turned out to be several hundred thousand rows of data, and it was impractical for lawyers in the city law office to review it all and redact everything sensitive (because humans).

The data was released and it was pretty much accepted internally (at least from what I can tell) that there was PII in it, littered throughout several different fields. The city did warn the requester that publishing the data might make “sensitive” information available. I remember hearing folks in the city attorney’s office say that if PII was disclosed and someone sued the city, they would contend that they had been compelled to release the data via RTK and could not be held liable.

All that being said, are you looking for guidelines that tell you what kind of checks to conduct? Or are you looking for tools to help you redact data?

One strategy might be to work with the public agency - telling them that you plan to release the data and ask for their help in finding the kinds of things to redact. Knowing that you plan on publishing it might help encourage them to work with you.

Alternatively, you can conduct your own checks and release the data iteratively - leaving out the bits that might contain PII until they have been more thoroughly reviewed, or actually redacting things you might be sensitive.

Hope this helps.


(Noel Hidalgo) #4

YES! Exactly.


(Drew Gordon) #5

Hi Noel,

I know this is a bit late on for replying, did you already figure out a solution?

Either way, your concern for protecting individuals in data is appreciated and I don’t know what can really be gained by having people’s names or other non-anonymous identifiers included in the data.

Just wanted to chime in because I think this is a really interesting but subtle topic and to some extent it would be great to have a neat, accessible guide to refer to when working with data that comes about via channels other than open data portals.

Though I think the problem is that PII can be highly context dependent, and so while there are some hard and fast rules (addresses, names, birthdates, SSN, License ID’s, images of individuals, biometric data etc - see: http://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-122.pdf), individuals might still be sussed out of potentially innocuous data even in the absence of these and that’s where the case-by-case analysis has to take place.

For instance, you might take perfectly open and accessible Twitter data, scrub it of all personally identifying information, but you could still re-identify individuals by just searching tweet text or analyzing patterns in geocodes, if available. Probably a silly example, but you could imagine doing the same with other data depending on its content and what one might know about particular individuals in the dataset. So you have to bring in considerations of what what’s in data, what usage the data might undergo, and what the potential risks and harms that follow from that.

That said, there are some decent government documentation (not riveting bedtime reading, but thorough) that cover how PII protection and disclosure risks are handled in administrative data contexts.

OMB white paper, a good overview: https://www.whitehouse.gov/sites/default/files/omb/mgmt-gpra/privacy_and_confidentiality_in_the_use_of_administrative_and_survey_data_0.pdf

OMB M-14-06: https://www.whitehouse.gov/sites/default/files/omb/memoranda/2014/m-14-06.pdf

Overall, the first NIST document I linked to should give you a good idea of what constitutes PII (i.e., what you should look for) and ways to handle that.

There are also procedures for obfuscating identifiers (such as hashing IDs with a hash function like sha-2) if your analyses still require some way to keep track of anonymized individuals (though again, even if that hash isn’t broke, there might be ways to re-ID individuals depending on what other information could potentially be linked to records in the data).

Hope that helps some, but I think overall it comes down to what data you have that you want to release and anticipating ways it might be used (or misused).