Best practices for data collection

Author

Leon Yin

Modified

February 14, 2023

In the previous chapters, we covered techniques and methods of data collection.

You can apply those techniques towards building a data pipeline, which can increase the reliability and scale of your data collection significantly. The professional pursuit of this task is data engineering, and it often involves APIs, cloud computing, and databases.

Here are some helpful tips for building datasets.

Don’t repeat work

Before you collect data, check if you’ve already collected it.

Create a programmatic naming structure for a “target”– this could be a filename or a unique ID in a database, and check if it exists.

If it already exists, move on.

Below is a dummy example of a scraper for video metadata that checks if a file with the same video_id has already been saved.

import os
import time

def collect_video_metadata(video_id):
    """
    This is an example of a data collection function
    that checks if a video_id has already been collected.
    """
    # consistently structure the target filename (fn_out)
    fn_out = f"video_metadata_{video_id}.csv"
    
    # check if the file exists, if it does: move on
    if os.path.exists(fn_out):
        print("already collected")
        return
        
    # collect the data (not actually implemented)
    print("time to do some work!")
    
    # save the file. Instead of real data, we'll save text that says, "Collected".
    with open(fn_out, 'w') as f:
        f.write("Collected")
    return

Let’s try to collect some video metadata for a video_id of our choosing.

video_id = "schfiftyfive"
collect_video_metadata(video_id = video_id)
time to do some work!

Let’s try to run the same exact function with the same input:

collect_video_metadata(video_id = video_id)
already collected

The second time you call it, the function ends early.

When collecting a large dataset, these steps are essential to make the best use of time.

Make a todo list

In addition to not repeating yourself, keep tabs on what needs to be done. That could be a simple CSV file, or something more advanced like a queuing system such as AWS SQS. For queuing systems, you can clear tickets that have been finished, and re-do tickets that might have failed.

Save receipts

Save the output of every step, especially the earliest steps of collecting a JSON response from a server, or the HTML of a website.

You can always re-write parsers that turn that “raw” data into something neat and actionable.

Websites and API responses can change, so web parsers can break easily. It is safer to just save the data straight from the source, and process it later.

If you’re collecting a web page through browser automation, save a screenshot. It’s helpful to have reference material of what the web page looked like when you captured it.

This is something we did at the Markup when we collected Facebook data from a national panel over several months, and again, when we collected Google search results.

These receipts don’t just play a role in the underlying analysis, they can be used as powerful exhibits in your investigation.

Break up the work, and make it as small as possible

Break scraping tasks into the smallest units of work. This makes scaling up easier, and it also prevents a single point of failure disrupting your entire workflow.

Certain components of a scraper can be slower than others. By dividing the tasks, you can better identify bottlenecks to optimize the pipeline. Use to-do lists, and check for existing files to help communicate between tasks.

Remember that big problems can be broken up into smaller problems. Being smart can help you get to the finish line faster and debug issues quicker.

Bigger isn’t always better

Be smart with how you use data, rather than depend on big numbers. Data isn’t in-itself valuable.

It’s better to start off smaller, with a trial analysis (we often call it a quick-sniff in the newsroom) to make sure you have a testable hypothesis.

This is always a step I use at my newsroom to plan longer data investigations, and see what kind of story we could write if we spent more time on the data collection and honing the methodology.

Spotcheck everything

Manually check your programmatically saved results with the live results. Small errors can be systematic errors if you don’t catch them manually. Choose a reasonable sample size (such as N=100), to assure what you’re analyzing is exactly what you think you are.

This is something we did to bullet-proof almost every investigation, even if we didn’t publish the results of that hand-check.

Conclusion

These tips are not definitive. If you want to share tips, please make a suggestion via email or GitHub.