Useful Resources

Thought it might be good to have a centralized repository of the resources folks find useful to have on hand. Post cheat sheets, tutorials, whatever you find helps you do your thing! Both for stuff you still use and useful guides you’ve followed in the past.

Here’s some of mine.

Python Regular Expression cheat sheet: https://www.debuggex.com/cheatsheet/regex/python

Google’s guide to Regular Expressions in Python: https://developers.google.com/edu/python/regular-expressions

Data Cleaning in Python: http://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/

Data Munging with Pandas: http://www.analyticsvidhya.com/blog/2014/09/data-munging-python-using-pandas-baby-steps-python/

NLTK guide: http://www.nltk.org/book/

NLTK Cheat Sheet: https://blogs.princeton.edu/etc/files/2014/03/Text-Analysis-with-NLTK-Cheatsheet.pdf

Fuzzy sentence-matching in Python: http://bommaritollc.com/2014/06/fuzzy-match-sentences-python/

6 Likes

Thanks so much for these @mattalhonte! As someone new to Python and parsing, these are extremely helpful. Please keep sharing any helpful tutorials, (especially for beginners)!

1 Like

I was looking at the US government’s new Digital Services Agency and their github repo. 2 things look interesting and may be useful to us:

  1. Regulations Parser: https://github.com/18F/regulations-parser
  2. Python library to extract text from PDF, and default to OCR when text extraction fails: https://github.com/18F/doc_processing_toolkit
1 Like

For regex and python, check out Pythex as well. really useful to be able to experiment, and view a cheatsheet.

1 Like

Came across a tutorial that I think will be useful once we’ve got some human-classified data to test against. It’s about building a classifier for Twitter. I think it’ll be useful because the entries are kinda formatted like tweets - we’re comparing a bunch of short-ish snippets of text with each other

Twitter sentiment analysis using Python and NLTK: http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/

Some more scattered stuff:

Pandas cheat sheet: http://nbviewer.ipython.org/github/pybokeh/ipython_notebooks/blob/master/pandas/PandasCheatSheet.ipynb

Useful Pandas Snippets: http://www.swegler.com/becky/blog/2014/08/06/useful-pandas-snippets/

Working with Text Data (in Pandas): http://pandas.pydata.org/pandas-docs/dev/text.html

Here’s a guide to using this thing called Pickle, which is useful for saving weird transitional forms of your data: https://freepythontips.wordpress.com/2013/08/02/what-is-pickle-in-python/

I just saw this on the Chicago Hacknight list…

---------- Forwarded message ----------
From: Derek Eder derek.eder@gmail.com
Date: Tue, Mar 3, 2015 at 3:41 PM
Subject: [chihacknight] parsing names - it’s easy, right?
To: chihacknight chihacknight@googlegroups.com

Hey hack-nighters!

DataMade just launched two new tools for parsing names.

The first is called probablepeople, and it uses machine learning to parse unstructured name strings into their constituent parts (given, middle, surname, etc.). It functions much like usaddress, a tool for parsing addresses that we released late last year.

In creating probablepeople and usaddress, we also developed a toolkit called parserator that you can use to build your own probabilistic parser for tackling similar problems in different domains.

You can read more about probablepeople and parserator in our blog post Parse names and parse … anything, really.

As with usaddress, probablepeople gets better as more people use it and contribute more tagged examples that the parser can learn from. See probablepeople’s github repo for more information on how you can contribute.

Happy parsing!


Derek Eder
@derekeder
derek.eder@gmail.com

2 Likes

Practical Business Python: Taking care of business, one python script at a time has some great follow-with-code-and-data examples of useful tools.

Two I would recommend starting with:

  1. Pandas Pivot Tables explained - Yea, this helped me understand what a pivot table is. cool stuff.
  2. Web Scraping: it’s your civic duty examines the Minnesota 2014 Capital Budget using pandas, beautifulsoup.

–Amal

1 Like

Lovely post on GIT workflow
Git Pull Request Tutorial

1 Like

Thanks Amal, the Git workflow shared above makes a lot of sense to me. Have you created a development branch where we can be included as collaborators? I’m assuming the current master branch would be the 0.. version.

Hi Bahij,

I’m glad the link was useful. I personally think it’s a nice workflow and we should adopt it once we run it by everyone at the next sync up. Lets present a case for it.

Where might I find some CROL pdf"s. There was a doc with links but it be gone

Thx, Marc

Hi Marc,

The PDFs can be found here; https://drive.google.com/drive/folders/0B98QOZfGax93eWQyOHB4dWRWczg/0B98QOZfGax93R0owdV9RVS1GMkk/0B98QOZfGax93anpRaWp1bTU2czA.
The text files can be found here; https://github.com/CityOfNewYork/CROL-PDF/tree/master/Current%20DCAS%20Implementation/City%20Record%20(2008-2014)%20Text%20Files

Cheers,

1 Like

Google drive, ugh, sorry to say this but shared Google docs [and I’m using chrome] are a nightmare of suckage. Worst of all it only feeds my longstanding google docs angst. Still ever so thankful Mikael

Prob makes sense to move the content into the github repo?

Marc

Glad you found them ok. It was actually a conscious choice not to add them Github repo as we are working on DCASs database and not the PDFs at the moment. Didn’t want any confusion. Happy to transfer them over if you still prefer though? Let me know.

M.

@mikael I guess that is another request we can make… While we can dump the ones we have, it would be great for the City to publish archives to GitHub.