I know this is a bit late on for replying, did you already figure out a solution?
Either way, your concern for protecting individuals in data is appreciated and I don’t know what can really be gained by having people’s names or other non-anonymous identifiers included in the data.
Just wanted to chime in because I think this is a really interesting but subtle topic and to some extent it would be great to have a neat, accessible guide to refer to when working with data that comes about via channels other than open data portals.
Though I think the problem is that PII can be highly context dependent, and so while there are some hard and fast rules (addresses, names, birthdates, SSN, License ID’s, images of individuals, biometric data etc - see: http://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-122.pdf), individuals might still be sussed out of potentially innocuous data even in the absence of these and that’s where the case-by-case analysis has to take place.
For instance, you might take perfectly open and accessible Twitter data, scrub it of all personally identifying information, but you could still re-identify individuals by just searching tweet text or analyzing patterns in geocodes, if available. Probably a silly example, but you could imagine doing the same with other data depending on its content and what one might know about particular individuals in the dataset. So you have to bring in considerations of what what’s in data, what usage the data might undergo, and what the potential risks and harms that follow from that.
That said, there are some decent government documentation (not riveting bedtime reading, but thorough) that cover how PII protection and disclosure risks are handled in administrative data contexts.
OMB white paper, a good overview: https://www.whitehouse.gov/sites/default/files/omb/mgmt-gpra/privacy_and_confidentiality_in_the_use_of_administrative_and_survey_data_0.pdf
OMB M-14-06: https://www.whitehouse.gov/sites/default/files/omb/memoranda/2014/m-14-06.pdf
Overall, the first NIST document I linked to should give you a good idea of what constitutes PII (i.e., what you should look for) and ways to handle that.
There are also procedures for obfuscating identifiers (such as hashing IDs with a hash function like sha-2) if your analyses still require some way to keep track of anonymized individuals (though again, even if that hash isn’t broke, there might be ways to re-ID individuals depending on what other information could potentially be linked to records in the data).
Hope that helps some, but I think overall it comes down to what data you have that you want to release and anticipating ways it might be used (or misused).