Tools for data scraping


(Noel Hidalgo) #1

On Tuesday, I’ll be standing in front of 20 high school students and talk about NYC open data. @hayleyrichardson and I asked them what they want to talk about, and a majority of them want to talk about “scraping.”

For someone who is starting to scrape data, what tools would you recommend?

I’m going to show them Kimono and import.io.

/cc @cwhong @MaxGalka @talos @h2oboxer @fma2


(aileen) #2

Can you share more on why the students are interested in scraping?
Curious about their experience and interests.


(Noel Hidalgo) #3

@aileen I can’t. This is an introduction course to “civic tech” and “open data.”


(Joel Natividad) #4

Apart from scraping data virtually, you may also want to investigate having them generate their own data based on real-world observations, for example:

  • using EpiCollect, they can do a survey of their environment (field observations - complete with GPS, media, etc.), clean it up, aggregate and analyze it
  • they can do this in connection not only with their STEM classes, but you can also loop in some social issues - e.g. surveying public housing, street furniture, public vandalism, foot traffic, commute time, garbage collection, etc.
  • maybe they can even correlate their observations with 311 data to see if these non-emergency, quality-of-life issues are being reported
  • perhaps they can even do these surveys in coordination with their local Community Boards, so they can get data behind pressing issues that CB is dealing with

In that way, not only do they get training in STEM skills, they also become more civically engaged by connecting with their community. #ItTakesACommunity


(Chris Whong) #5

Defer to @talos who is actually good at this.

I like to use node for scraping, specifically the module “cheerio” which
lets you use jQuery-style selectors to navigate a page after it’s been
downloaded.


(John Krauss) #6

Define “scraping”. Judging from the tools you posted, you’re looking to give them an introduction to scraping that doesn’t require writing code. I don’t know what the tools in that realm are like, for better or for worse.

For people who are writing code, I would recommend using Python + requests library if what they’re pulling are straight files/HTML (for example, the tax bills scraper). To parse HTML, I would recommend BeautifulSoup4 in Python.

I would recommend JavaScript/node (and whatever libraries @cwhong has found useful) and a headless browser if they are trying to capture JavaScript-rendered portions of pages.

However, I’ve found headless systems are very slow, and it can be better to use Chrome dev tools to isolate out the AJAX requests that get the necessary data, and make/read those directly. Oftentimes those will return well-formatted JSON to boot!

Pdftotext is a great library if you’re trying to extract text from PDFs that are text-based; tesseract will work otherwise but is slow & memory intensive (as OCR is).


(Joel Selanikio) #7

Sorry to be late to the discussion, but I thought I’d add our Magpi mobile tools to the discussion, following up on the note by @jqnatividad. It’s a favorite tool in a lot of international activities, particularly in health, but lots of domestic stuff, too. Many, many case studies here.

Magpi has an Android-based app for deploying mobile forms (like EpiCollect, already mentioned) but also has

  • iOS mobile data collection app
  • SMS data collection
  • IVR (voice) data collection
  • built-in reporting & visualization
  • built-in broadcast SMS and audio messaging

The basic Magpi account (to enable app-based data collection on iOS or Android) is completely free, with no time limit. Sign up at our site.