Build your own datasets

Modified

June 26, 2023

“Finding stories in datasets” is a misnomer in data journalism. Most open source datasets were created with a use case in mind. Seldom is that original use case compatible with a hypothesis you’ll want to test, let alone a source that could lead any form of accountability.

The same can be said for the academy. Analyzing existing datasets will not lead to drastically different conclusions, and can disproportionately inflate the importance of a topic, just because the data is readily available.

Instead, you can build your own datasets by synthesizing publicly available data and records obtained from records requests.

Public data sources explained

There’s a difference between “open data” and “publicly available data”:

  • Open data is typically already combined into a spreadsheet or database. Additionally, open data is usually documented, and easily available for the public to use. See for example, climate data from NOAA, or the U.S. Census Bureau’s American Community Survey.

  • Publicly available data lives on the open web, but has yet to be synthesized into a cohesive data set. It’s up to you to collect these data points, responsibly. Search engines and many other technology companies (such as AI developers) depend on “crawling” these sources.

At a minimum: only collect data with intention, do not overload websites’ servers, and abstain from collecting personally identifiable information without user consent.

What to expect in this section?

Publicly available data is a useful tool to audit and investigate technologies and their underlying business practices.

The following sections will cover programmatic data collection and best-practices.

We’ll discuss data collection techniques such as:

Use these techniques to build datasets that allow you to test original hypotheses, design clear experiments, and understand the limitations that come along with the decisions you make.