Build your own datasets

Modified

June 26, 2023

“Finding stories in datasets” is a misnomer in data journalism. Most open source datasets were created with a use case in mind. Seldom is that original use case compatible with a hypothesis you’ll want to test, let alone a source that could lead any form of accountability.

The same can be said for the academy. Analyzing existing datasets will not lead to drastically different conclusions, and can disproportionately inflate the importance of a topic, just because the data is readily available.

Instead, you can build your own datasets by synthesizing publicly available data and records obtained from records requests.

Public data sources explained

There’s a difference between “open data” and “publicly available data”:

Open data is typically already combined into a spreadsheet or database. Additionally, open data is usually documented, and easily available for the public to use. See for example, climate data from NOAA, or the U.S. Census Bureau’s American Community Survey.
Publicly available data lives on the open web, but has yet to be synthesized into a cohesive data set. It’s up to you to collect these data points, responsibly. Search engines and many other technology companies (such as AI developers) depend on “crawling” these sources.

At a minimum: only collect data with intention, do not overload websites’ servers, and abstain from collecting personally identifiable information without user consent.

What to expect in this section?

Publicly available data is a useful tool to audit and investigate technologies and their underlying business practices.

The following sections will cover programmatic data collection and best-practices.

We’ll discuss data collection techniques such as:

Finding undocumented APIs
Browser automation
App automation
Parsing HTML and JSON

Use these techniques to build datasets that allow you to test original hypotheses, design clear experiments, and understand the limitations that come along with the decisions you make.

Legal Precedents

Although big tech giants and data brokers often depend on web scraping for their business models, they seldom use that data in the public interest or release data that could be used to hold themselves accountable.

This guide exists to teach you how to build evidence that leads to accountability. However, know that using data to investigate powerful entities is not without risks.

If you’re in the United States: know what violates the Computer Fraud and Abuses Act (CFAA), which primarily prohibits unauthorized access to a computer network.

Recent cases such as Van Buren v. United States, hiQ v Linkedin, and Sandvig v. Barr helped shape interpretations of CFAA for collecting public data with automated means, such as web scraping.

Although the legal landscape is changing to favor web scraping in the public interest, we still see governments and industry titans attempt to shut down accountability efforts. Take for example:

A journalist in Missouri was called a hacker by the governor and threatened prosecution for identifying a flaw that revealed social security numbers of school employees after inspecting the page source.
Academic researchers at NYU received a cease-and-desist notice for crowdsourcing Political ads from Facebook.

Even if your activity does not fall within CFAA’s purview or violate any other law, online services can suspend your account(s) for breaking their terms of service. For that reason, be careful involving your personal/institutional accounts in web scraping, and volunteers’ if you’re crowdsourcing data.

If you want more information on the topic, several of the field’s top researchers explore the legal and ethical considerations in Section 4.1 of Metaxa et al. (2021).

This is NOT legal advice. Discuss your intentions and your plan to collect data with your editor and legal counsel (if you’re a journalist), or your advisor and ethics board (if you’re a researcher).

Having institutional support is essential to make sure you are protected, and that you and your superiors are well-informed about the risks.

Aaron Swartz

Contemporary legal interpretations of CFAA and web scraping can be traced back to the late activist and engineer Aaron Swartz.

In 2008, Swartz was investigated by the F.B.I. for scraping 2.7 million public court records from PACER and sharing it with the public. Swartz redistributed information that is in the public domain, but hosted by a central entity that charges fees for accessing that public information.

The bureau concluded that Swartz did not violate any laws, but three years later, Swartz was arrested and federally indicted for mass-downloading academic articles from JSTOR using a laptop stored in an MIT closet. Although neither JSTOR, MIT, nor state prosecutors chose to litigate, federal prosecutors sought maximal penalties: Swartz faced $1 million in fees and 35 years in prison– charges that were deeply criticized by lawyers and experts.

Swartz’s prosecution and untimely passing would have a chilling effect on web scraping in the academy for years to come. But attitudes are changing slowly, with journalists, researchers, and other public interest technologists receiving more legal and institutional protections to collect publicly available data.

You can learn more about Aaron Swartz in the documentary “The Internet’s Own Boy,“ directed by Brian Knappenberger, and on the website AaronSwartzDay.org.