Browser Automation

Author

Piotr Sapiezynski and Leon Yin

Published

June 11, 2023

Modified

July 9, 2024

Browser automation is a fundamental web scraping technique for building your own dataset.

It is essential for investigating personalization, working with rendered elements, and waiting for scripts and code to execute on a web page.

However, browser automation can be resource intensive and slow compared to other data collection approaches.

👉Click here to jump to the Playwright tutorial.

Intro

If you’ve tried to buy concert tickets to a popular act lately, you’ve probably watched in horror as the blue “available” seats evaporate before your eyes the instant tickets are released. Part of that may be pure ✨star power✨, but more than likely, bots were programmed to buy tickets to be resold at a premium.

These bots are programmed to act like an eager fan: waiting in the queue, selecting a seat, and paying for the show. These tasks can all be executed using browser automation.

Browser automation is used to programmatically interact with web applications.

The most frequent use case for browser automation is to run tests on websites by simulating user behavior (mouse clicks, scrolling, and filling out forms). This is routine and invisible work that you wouldn’t remember, unlike seeing your dream of crowd surfing with your favorite musician disappear thanks to ticket-buying bots.

But browser automation has another use, one which may make your dreams come true: web scraping.

Browser automation isn’t always the best solution for building a dataset, but it is necessary when you need to:

  1. Analyze rendered HTML: see what’s on a website as a user would.
  2. Simulate user behavior: experiment with personalization and experience a website as a user would.
  3. Trigger event execution: retrieve responses to JavaScript or network requests following an action.

These reasons are often interrelated. We will walk through case studies (below) that highlight at least one of these strengths, as well as why browser automation was a necessary choice.

Some popular browser automation tools are Puppeteer, Playwright, and Selenium.

Headless Browsing

Browser automation can be executed in a “headless” state by some tools.

This doesn’t mean that the browser is a ghost or anything like that, it just means that the user interface is not visible.

One benefit of headless browsing is that it is less resource intensive, however there is no visibility into what the browser is doing, making headless scrapers difficult to debug.

Luckily, some browser automation tools (such as Playwright) allow you to toggle headless browsing on and off. Other tools, such as Puppeteer only allow you to use headless browsing.

If you’re new to browser automation, we suggest not using headless browsing off the bat. Instead try headed Playwright, which is exactly what we’ll do in the tutorial below (see the same tutorial in Selenium here).

Using Playwright to automate browsing TikTok’s “For You” page for food videos.

Case Studies

Case Study 2: Deanonymizing Google’s Ad Network

Google ad sellers offer space on websites like virtual billboards, and are compensated by Google after an ad is shown. However, unlike physical ad sellers, almost all of the ~1.3 million ad sellers on Google are anonymous. To limit transparency further, multiple websites and apps can be monetized by the same seller, and it’s not clear which websites are part of Google’s ad network in the first place.

As a result, advertisers and the public do not know who is making money from Google ads. Fortunately, watchdog groups, industry analysts, and reporters have developed methods to hold Google accountable for this oversight.

The methods boil down to triggering a JavaScript function that sends a request to Google to show an ad on a loaded web page. Importantly, the request reveals the seller ID used to monetize the website displaying the ad, and in doing so, links the seller ID to the website.

In 2022, reporters from ProPublica used Playwright to automate this process to visit 7 million websites and deanonymize over 900,000 Google ad sellers. Their investigation found some websites were able to monetize advertisements, despite breaking Google’s policies.

ProPublica’s investigation used browser automation tools to trigger event execution to successfully load ads. Often, this required waiting a page to fully render, scrolling down to potential ad space, and browsing multiple pages. The reporters used a combination of network requests, rendered HTML, and cross-referencing screenshots to confirm that each website monetized ads from Google’s ad network.

Browser automation can help you trawl for clues, especially when it comes to looking for specific network requests sent to a central player by many different websites.

Case Study 3: TikTok Personalization

An investigation conducted by the Wall Street Journal, “Inside TikTok’s Algorithm” found that even when a user does not like, share, or follow any creators, TikTok still personalizes the “For You” page based on how long they watch the recommended videos.

In particular, the WSJ investigation found that users who watch content related to depression and skip other content are soon presented with mental health content and little else. Importantly, this effect happened even when the users did not explicitly like or share any videos, nor did they follow any creators.

You can watch the WSJ’s video showing how they mimic user behavior to study the effects of personalization:

Source: WSJ

This investigation was possible only after simulating user behavior and triggering personalization from TikTok’s “For You” recommendations.

Tutorial

In the hands-on tutorial we will attempt to study personalization on TikTok with a mock experiment.

We’re going to teach you the basics of browser automation in Playwright, but the techniques we’ll discuss could be used to study any other website using any other automation tool.

We will try to replicate elements of the WSJ investigation and see if we can trigger a personalized “For You” page. Although the WSJ ran their investigation using an Android on a Raspberry Pi, we will try our luck with something you can run locally on a personal computer using browser automation.

In this tutorial we’ll use Playwright to watch TikTok videos where the description mentions keywords of our choosing, while skipping all others. In doing so, you will learn practical skills such as:

  • Setting up the automated browser in Python
  • Finding particular elements on the screen, extracting their content, and interacting with them
  • Scrolling
  • Taking screenshots

Importantly, we’ll be watching videos with lighter topics than depression (the example chosen in the WSJ investigation.).

Pro tip: Minimizing harms

When developing the data collection methodology for an audit or investigation, start with low-stakes themes. This minimizes your exposure to harmful content and avoids boosting their popularity, unnecessarily.

Step 1: Installing playwright

Playwright will take care of finding and installing the browser binary that’s suitable for your operating system. Such setup is much more straightforward than Selenium, which requires the user to manage each browser version.

The first line below installs the Python library, the second line installs the browser binaries.

!pip install playwright
!playwright install
Requirement already satisfied: playwright in /Users/lyin72/miniconda3/lib/python3.11/site-packages (1.44.0)
Requirement already satisfied: greenlet==3.0.3 in /Users/lyin72/miniconda3/lib/python3.11/site-packages (from playwright) (3.0.3)
Requirement already satisfied: pyee==11.1.0 in /Users/lyin72/miniconda3/lib/python3.11/site-packages (from playwright) (11.1.0)
Requirement already satisfied: typing-extensions in /Users/lyin72/miniconda3/lib/python3.11/site-packages (from pyee==11.1.0->playwright) (4.7.1)

Let’s see if the installation worked correctly! Run the cell below to open a new Firefox window. We’re going to use Firefox in this tutorial because Playwright’s default browser (Chromium) does not support video playback in TikTok’s format.

from playwright.async_api import async_playwright

# Start the browser
playwright = await async_playwright().start()
browser = await playwright.firefox.launch(headless=False)

# Create a new browser window
page = await browser.new_page()

# Open the default tiktok For You page
await page.goto("https://www.tiktok.com/foryou")
<Response url='https://www.tiktok.com/foryou' request=<Request url='https://www.tiktok.com/foryou' method='GET'>>

What is await? We’re running Playwright asynchronously, which is the only way to be compatible with Jupyter Notebooks. You can run Playwright synchronously (aka in regular Python) as a script, but not as a notebook. In practice you’ll want to tinker and iterate, so a notebook is preferred.

We explicitly call await after each line of Playwright code so that each command is run sequentially. Otherwise, every line of code runs at the same time.

If everything works fine and you have the browser with TikTok open, our setup is complete!

Unfortunately, depending on your system this setup might not work: * It will not work at all in Google Colab - you need to run this on your own machine * It might not work on a Windows machine. If you’re using Windows, you will need to downgrade your ipykernel to a version that supports Playwright. Uncomment the next code cell and run it, then restart this notebook:

## Only uncomment and run the next line if you're using windows and the cell above did not give you an open browser window.
#!pip install ipykernel==6.28.0

Step 2: Finding elements on page and interacting with them

We will perform our mock experiment without logging in (but we will also learn how to create multiple accounts and how to log in later).

Press the arrow down button on your keyboard a few times until a dialog pops up asking you do log in:

Instead of logging in, our first interaction will be to click the “Continue as guest” button.

Playwright has built-in tools called Locators to find and interact with elements on the page. One helpful locator is based on the text of a button you want to press. We can use the get_by_text locator to find the button that says “Continue as guest” on the page and click it:

await page.get_by_text("Continue as guest").click()

If Playwright successfully finds the button with the text you specified, it will be clicked. However, if Playwright does not find the element – because the element hasn’t loaded yet or you misspelled the text, you will get a TimeoutError.

This error is thrown because Playwright waits a short period of time for an element to appear on screen. The default is 30,000 milliseconds (30 seconds). You can specify a different timeout as an argument to click(), for example 1,000 milliseconds (1 second):

await page.get_by_text("Continue as guest").click(timeout = 1000)

Did you notice a change on the page? Congratulations! You just automated the browser to click something.

Step 4: Scrolling

We now have a browser instance open and displaying the For You page. Let’s scroll through the videos.

If you are a real person who (for whatever reason) visits TikTok on their computer, you could press the down key the keyboard to see new videos. We will do that programmatically using a virtual keyboard instead:

await page.keyboard.press("ArrowDown")

When you run the cell above you will see that your browser scrolls down to the next video.

Step 5: Finding TikTok videos on the page

Now that we have the building blocks for swiping through the For You page, let’s view the recommended TikTok videos and parse out information (called metadata) for each video.

When we asked Playwright to search for the “Continue as guest” button (Step 3), we used a locator function based on text. Playwright had other locator functions to find what you’re looking for:

  • get_by_role() to locate by explicit and implicit accessibility attributes.
  • get_by_text() to locate by text content.
  • get_by_label() to locate by the associated label’s text.
  • get_by_placeholder() to locate an input by placeholder.
  • get_by_alt_text() to locate an element, usually image, by its text alternative.
  • get_by_title() to locate an element by its title attribute.
  • get_by_test_id() to locate an element based on its data-testid attribute (other attributes can be configured).

The developers suggest using these recommended locators. This will make your code more legible and reliable. Other browser automation tools have comparable functions.

Unfortunately for us, none of these will work for our task. If you look at the source code for TikTok videos, you won’t find any of these locators useful. However, there are fields that we can use to identify videos another way.

  1. Right click on the white space around a TikTok video and choose “Inspect”. Inspect Element
  2. Hover your mouse over the surrounding <div> elements and observe the highlighted elements on the page to see which ones correspond to each TikTok video. Inspect Element
  3. You will see that each video is in a separate <div> container but each of these containers has the same data attribute (data-e2e) with the value of recommend-list-item-container.
  4. We can now use this to find all videos on page (you can search by attribute value using square brackets):

Playwright has a generic locator function that accepts both xpath and CSS selectors.

The same <div> can be identified in xpath as //div[@data-e2e="recommend-list-item-container"] or as a CSS selector as [data-e2e="recommend-list-item-container"].

videos = await page.locator('//div[@data-e2e="recommend-list-item-container"]').all()
videos
[<Locator frame=<Frame name= url='https://www.tiktok.com/foryou'> selector='//div[@data-e2e="recommend-list-item-container"] >> nth=0'>,
 <Locator frame=<Frame name= url='https://www.tiktok.com/foryou'> selector='//div[@data-e2e="recommend-list-item-container"] >> nth=1'>,
 <Locator frame=<Frame name= url='https://www.tiktok.com/foryou'> selector='//div[@data-e2e="recommend-list-item-container"] >> nth=2'>,
 <Locator frame=<Frame name= url='https://www.tiktok.com/foryou'> selector='//div[@data-e2e="recommend-list-item-container"] >> nth=3'>,
 <Locator frame=<Frame name= url='https://www.tiktok.com/foryou'> selector='//div[@data-e2e="recommend-list-item-container"] >> nth=4'>,
 <Locator frame=<Frame name= url='https://www.tiktok.com/foryou'> selector='//div[@data-e2e="recommend-list-item-container"] >> nth=5'>,
 <Locator frame=<Frame name= url='https://www.tiktok.com/foryou'> selector='//div[@data-e2e="recommend-list-item-container"] >> nth=6'>,
 <Locator frame=<Frame name= url='https://www.tiktok.com/foryou'> selector='//div[@data-e2e="recommend-list-item-container"] >> nth=7'>,
 <Locator frame=<Frame name= url='https://www.tiktok.com/foryou'> selector='//div[@data-e2e="recommend-list-item-container"] >> nth=8'>,
 <Locator frame=<Frame name= url='https://www.tiktok.com/foryou'> selector='//div[@data-e2e="recommend-list-item-container"] >> nth=9'>,
 <Locator frame=<Frame name= url='https://www.tiktok.com/foryou'> selector='//div[@data-e2e="recommend-list-item-container"] >> nth=10'>,
 <Locator frame=<Frame name= url='https://www.tiktok.com/foryou'> selector='//div[@data-e2e="recommend-list-item-container"] >> nth=11'>,
 <Locator frame=<Frame name= url='https://www.tiktok.com/foryou'> selector='//div[@data-e2e="recommend-list-item-container"] >> nth=12'>,
 <Locator frame=<Frame name= url='https://www.tiktok.com/foryou'> selector='//div[@data-e2e="recommend-list-item-container"] >> nth=13'>,
 <Locator frame=<Frame name= url='https://www.tiktok.com/foryou'> selector='//div[@data-e2e="recommend-list-item-container"] >> nth=14'>]

When we searched for the “Continue as guest” button we didn’t need to use the all method because we were only expecting one element to match our locator.

Now we’re trying to find all videos on page, so we will chain the locator and all functions to return a full list of elements that match the locator.

Step 6: Parsing TikTok metadata

With all the TikTok videos on the page, let’s extract the description from each. Later, we’ll use this metadata to decide whether to watch a video, or to skip it. The process of extracting a specific field from a webpage is “parsing”.

  1. Pick any description, right click, “Inspect”.
  2. Let’s locate the <div> that contains the whole description (including any hashtags) and make note of its data attribute.
  3. Now let’s write the code that extracts the description from a single video. You can get the text of any located element by calling the inner_text function.
for video in videos:
    print(await video.locator('//div[@data-e2e="video-desc"]').inner_text())
Unbelievable fish trap technique #fish  #fishing  #fishinglife  #wild  #wildlife  #nature  #asmr  #river  #fyp   
Super winner! Whoever clears the board first wins👌Sling Puck Game #viral  #viralvideo  #2024 #satisfying 
Head on to your nearest retail store and spot the 900g promo pack! Prepare your child for all school age challenges with NIDO!*Applicable on select retail stores nationwide.
#india  #streetfood  #food  #fpy  #foryou  #longervideos  
Jajaja YO NO SOY LA QUE REACCIONA, es una nena que se muere por el juhador  #richardRios  #colombia  
Every car needs this!🤯 #lifehack  #cars  #diy  #sports  
This was insane 🫣🤣
How North Korea is Now Impossible to Escape 🇰🇵🇰🇷 #northkorea  #korea  #southkorea  #northkoreafact  #northkorealife  #border  #maps  #geography  #learn  #history  #geotok  #historytok  #funfacts  #fyp 
Geeze im tired of hurting 
#momsoftiktok  #baseketball  #nba  #tiktok  #fyp  #foryou  
should dweeb count? 🤣 #trivia  
Apple watch hidden camera

Nah fam. I’m not for this. I had to get back into line so I could record this. #ai  #artificialintelligence  #wendys  
Part 1#foryou  #viral  

Note: We previously searched for elements using page.locator(). That allowed us to search the whole page. Here we’re using a locator within a previously located element: video.locator(). This allows us to access attributes and elements within an element on the page, rather than on the whole page.

Step 7: Finding the TikTok video that’s currently playing

We know how to scroll to the next video, and we know how to find all videos that are loaded. At this point we could either:

  1. Assume that at the beginning, the 0th video is playing, and then every time we press arrow down, the next video is being displayed
  2. Or, assume that the arrow down does not always work and each time verify which video is actually playing

The problem with the first approach is that even if scrolling fails just once, our experiment will be compromised (after it happens we will be watching and skipping different videos that our script tells us). This is why we will go with the second approach and verify which video is actually playing. Back to our favorite tool- inspect element!

When you right click on the playing video, you will see that instead of our familiar UI we get a custom TikTok menu, so that won’t work. Try right-clicking on the description of the video instead, then hovering over different elements in the inspector and expanding the one that highlights the video in the browser. Dig deep until you get to the div that only contains the video.

Still in the inspector try looking at the video below. You will see that the div that contains the video is missing and there is no element with the tag name video. That’s how we can find if the video is currently playing - its div will contain the video element that we can find by TAG_NAME <- ???:

for video in videos:
    # let's get the description of each video using the method we already know
    description = await video.locator('//div[@data-e2e="video-desc"]').inner_text()

    # now let's count all the <video> elements within. If there is one, that's the one that's playing!
    if await video.locator('video').count() > 0:
        playing = 'playing'
    else:
        playing = 'not playing'
    print(playing, description)
not playing Unbelievable fish trap technique #fish  #fishing  #fishinglife  #wild  #wildlife  #nature  #asmr  #river  #fyp   
not playing Super winner! Whoever clears the board first wins👌Sling Puck Game #viral  #viralvideo  #2024 #satisfying 
playing Head on to your nearest retail store and spot the 900g promo pack! Prepare your child for all school age challenges with NIDO!*Applicable on select retail stores nationwide.
not playing #india  #streetfood  #food  #fpy  #foryou  #longervideos  
not playing Jajaja YO NO SOY LA QUE REACCIONA, es una nena que se muere por el juhador  #richardRios  #colombia  

Step 8: Taking screenshots and saving page sources

You might want to save a screenshot to help debug your scraper or provide artifacts you can present alongside your findings. Playwright allows you to take screenshots of the whole screen, or just a particular element:

# take a screenshot of the whole browser
await page.screenshot(path="screenshot.png")

# take a screenshot of just one video
screenshot = await video.screenshot(path="video_screenshot.png")

In the spirit of bringing receipts, you can also save the entire webpage as an HTML file to parse it later.

# save the source of the entire page
page_html = await page.content()
with open('webpage.html', 'w') as output:
    output.write(page_html)
Pro tip: Keep these records to sanity check your results

Taking a screenshot and saving the page source is a useful practice for checking your work. Use the two to cross-reference what was visible in the browser and whatever data you end up extracting during the parsing step.

Let’s close the browser for now, and kick this workflow up a notch.

await browser.close()

Step 9: Putting it all together

At this point, we can read the description of TikTok videos and navigate the “For You” page.

That’s most of the setup we need to try our mock experiment:
let’s watch all TikTok videos that mention food in the description and skip videos that do not mention food.

After one hundred videos, we will see whether we are served videos from FoodTok more frequently than other topics.

Pro tip: Use functions!

So far we wrote code to open the browser, close the dialog, and find videos as separate cells in the notebook. We could copy that code over here to use it, but it will be much easier to read and maintain the code if we write clean, well-documented functions with descriptive names.

Note: because we’re running asynchronous Playwight we are using async def to define functions.

from playwright.async_api import async_playwright, expect

async def open_browser():
    """
    Starts the automated browser and opens a new window
    """
    # Start the browser
    playwright = await async_playwright().start()
    browser = await playwright.firefox.launch(headless=False)

    # Create a new browser window
    page = await browser.new_page()

    return browser, page


async def close_login_dialog(page):
    """
    Checks if the login dialog is present. If so, it "Continues as guest"
    """
    # how many elements with "Continue as guest" do we see?
    if await page.get_by_text('Continue as guest').count() > 0:
        # there is one, let's click it!
        await page.get_by_text('Continue as guest').click()
    else:
        # there is none, we can continue scrolling
        return
    
async def find_videos(page):
    """
    Finds all tiktoks loaded in the browser
    """
    videos = await page.locator('//div[@data-e2e="recommend-list-item-container"]').all()
    return videos

async def get_description(video):
    """
    Extracts the video description along with any hashtags
    """
    try:
        description = await video.locator('//div[@data-e2e="video-desc"]').inner_text()
    except:
        # if the description is missing, just get any text from the video
        description = await video.inner_text()
    return description

async def get_current(videos):
    """
    Given the list of videos it returns the one that's currently playing
    """
    for video in videos:
        if await video.locator('video').count() > 0:
            # this one has the video, we can return it and that ends the function.
            return video
    
    return None

def is_target_video(description, keywords):
    """
    Looks for keywords in the given description. 
    NOTE: only looks for the substring IE partial match is enough.
    Returns `True` if there are any or `False` when there are none.
    """
    # check in any of the keywords is in the description
    for keyword in keywords:
        if keyword in description:
            # we have a video of interest, let's watch it 
            return True
    
    # if we're still here it means no keywords were found
    return False

async def screenshot(video, filename="screenshot.png"):
    """
    Saves a screenshot of a given video to a specified file
    """
    screenshot = await video.screenshot(path = filename)
    
async def save_source(page, filename='webpage.html'):
    """
    Saves the browser HTML to a file
    """
    page_html = await page.content()
    with open(filename, 'w') as output:
        output.write(page_html)

Ok, with that out of the way, let’s set up our first data collection.

Let’s make a directory to save screenshots. We will save screenshots here whenever we find a video related to food.

import os

os.makedirs('data/screenshots/', exist_ok=True)
browser, page = await open_browser()

# Open the default tiktok For You page
await page.goto("https://www.tiktok.com/foryou")

await expect(page.locator('video').first).to_be_visible()
import time

# if the description has any one these words, we will watch the video
keywords = ['food', 'dish', 'cook', 'pizza', 'recipe', 'mukbang', 'dinner', 'foodie', 'restaurant']

# this is where will we store decisions we take
decisions = []

# open a browser, and go to TikTok's For You page.
browser, page = await open_browser()

# Open the default tiktok For You page
await page.goto("https://www.tiktok.com/foryou")

# Let's wait for the first video to load before we start scrolling
await expect(page.locator('video').first).to_be_visible()

for tiktok_index in range(0, 100):
    # make sure to dismiss the login window
    await close_login_dialog(page)
    
    # get all videos
    tiktoks = await find_videos(page)
    
    # the current tiktok is the one that's currently showing the video player
    current_video = await get_current(tiktoks)
    
    if current_video is None:
        print('no more videos')
        break
              
    # read the description of the video
    description = await get_description(current_video)
    
    # categorize the video as relevant to `keywords` or not.
    contains_keyword = is_target_video(description, keywords)
    decisions.append(contains_keyword)
            
    print(tiktok_index, contains_keyword, description)
    
    if contains_keyword:
        # we have a video of interest, let's take a screenshot
        ## here we declare the files we'll save. they're named according to their order.
        fn_screenshot = f"data/screenshots/screenshot_{tiktok_index:05}.png"
        fn_page_source = fn_screenshot.replace('.png', '.html')
        await screenshot(current_video, fn_screenshot)
        await save_source(page, fn_page_source)
        # and now watch it for 30 seconds
        await page.wait_for_timeout(30000)
    
    # move to the next video
    await page.keyboard.press("ArrowDown")
    await page.wait_for_timeout(1000)
    
    
await browser.close()
0 False Super winner! Whoever clears the board first wins👌Sling Puck Game #viral  #viralvideo  #2024 #satisfying 
1 True #india  #streetfood  #food  #fpy  #foryou  #longervideos  
2 False Jajaja YO NO SOY LA QUE REACCIONA, es una nena que se muere por el juhador  #richardRios  #colombia  
3 False No sean esa persona 🥺 
4 False 02.07. Happy birthday #happybirthday  #nohea  #asmr  #satisfyingvideo  
5 False 
6 False Every car needs this!🤯 #lifehack  #cars  #diy  #sports  
7 False This was insane 🫣🤣
8 False #momsoftiktok  #baseketball  #nba  #tiktok  #fyp  #foryou  
9 False This is the story of Nasim Aghdam #youtube  #nasim  #truecrime  
10 False Geeze im tired of hurting 
11 False How North Korea is Now Impossible to Escape 🇰🇵🇰🇷 #northkorea  #korea  #southkorea  #northkoreafact  #northkorealife  #border  #maps  #geography  #learn  #history  #geotok  #historytok  #funfacts  #fyp 
12 False 
13 False Apple watch hidden camera
14 False 34 years later.. someone please tell me! #nostalgia  #90skids  #90stoys  #thirties  #foryou  #nostalgic  
15 False Part 1. #fyp  #foryou  #movie  
16 False 
17 False This is sand leveling, or topdressing. It creates a more level surface for your lawn by filling in low areas. I use masonry sand since I have Bermuda grass. If you have a cool season grass type, a screened topsoil may be better. I like to put a little fertilizer down prior to, and after to ensure bermuda is growing aggressively through the sand. I water every other day for a week and within 2 weeks it should be almost fully recovered. Part 2 update coming soon #lawn  #lawncare  #lawnleveling  #topdressing  #diy  #landscaping  #howto  
18 False Try not to laugh #prank  #funny  #funnyvideos  #scare  #scareprank  
19 False #foryoupage  #foryoupageofficiall  #viralvideo  #trending  #fypシ゚viral  #vacation  #bahamas🇧🇸  #vacay  
20 False Bro done messed up ☠️‼️#donpollo  #fyp  #meme #funny  
21 False #greenscreen  maybe one day, you’ll love us. #singlemom  
22 False Hint: she looks nothing like the dad  #trending  #guessthedad  #momtok  #babytok  #daddysgirl  #fypage  
23 False 💔 95 - on our way back to Michigan 💔 - I once saw a truck driver run into a mountain to avoid hitting other cars 💔#wherethewildthingsare  #trucker  #trucktok🔥  #truckdrivers  #lukecombs  
24 False Cheeseburger Egg Rolls 
25 False Broke character at the end 😭 #fredbeyer  
26 False Part 1#foryou  #viral  
27 False Only at 511 😂😂😂 #ofcourse  #crackerbarrel  
28 False 
29 False Here’s a quick tutorial to setting up your Loop Lasso 🤠  #physics  #technology  #looplasso  
30 False #animalattacks  #truestory  #scarystories  #tiktokstoryteller  #fyp  #pitbull  #pitbullattacksurvivor  #dogattacks  #fyp  
31 False Jesus loves you 🙏 #jesus  #jesuschrist  #christian  #christianitytiktok  #God  
32 False If you’ve ever considered getting a German Shepard… let this be your sign. I’m literally shaking 
33 False Life is weird, scary, and unpredictable sometimes. Here’s to starting over. I’m free now 🤍 Couldn’t have done it without my chosen family #domesticabuseawareness  #marriage  #divorce  #startingover  
34 False 🆘 Plz Help Me! OMG! Hilarious Encounter with a Regal 🐴 | Tourist's Unexpected Adventure 😂
35 False #cute  
36 False PRAUM❤️ #prom2024  #prom  #dress  #fyp  #reddress  #windsor  #foryou  
37 False #fyp  
38 False Nah fam. I’m not for this. I had to get back into line so I could record this. #ai  #artificialintelligence  #wendys  
39 False Chinese culture inspired version of the asoka trend 👏I’m so proud of this one  Let me know how many transitions you see in here, cuz i lost track hahhaa   Song: my new swag by Vava #transitions  #transitionbts  #grwm  #asoka  
40 False on to the next project #diyqueen  #diy  #diyproject  #asmr  #asmrsounds  #asmrvideo  #asmrtiktoks  #fyp  #cuttingglass  #glasscutting  #cutglass  
41 False #copspolice  #cop  #cops  #police  #policeofficer  #copsofftiktok  #copsontiktok  #copstiktok  #foryou  #fyp  #Trending  #fyp  #sheriff  #Monment  #viralvideo  #coptiktok  #trendingvideo  #Viral  
42 False 
43 False Grandma and puppy helps police find grandson #police  #policeofficer  #policeoftiktok  
44 False How Bullet Proof Glass Works 🤔
45 False I swear he ONLY loves mom 🥰 #funny  #viral  #dogsoftiktok  
46 False When troops dont recognize their leadership 🤣😭  #fypシ  #foryoupage  #specialforces  #military  #soldier  #militaryedit  @wooh_man  
47 False Uncovered a really cool part of my family history yesterday and I feel so lucky to have seen another beautiful side to my late grandfather 🤍 #army  #airforce  #vintagevibes  #vintagestyle  #familyhistory  #historical  #ww2  #vso  #aarp  #vfw  #veterans  
48 False 
49 False Vietnamese billionaire Truong My Lan defrauded the bank and said the remaining 26,886,338.30 was at sea. #money  #bank  #billions  #truongmylan  #fyp  #foryou  
50 False #su57  #russia  
51 False I didn't even have a donut.....that day. 🍩 #copsoftiktok  #speeding  #trafficstop  #donuts  #speed  #mean  
52 False I take you to the candy shop #horse  #candyshop  
53 False No murders today. Just a little blackmail 👍 #airbnb  #truecrime  #blackmail  #memphis  
54 False I still do not understand 
55 False 14 y/o muslims homemade clock mistaken for a bomb at school #bomb  #clock  #student  #school  #funny  #islam  #texas  #trending  #viral  #foryou  #discrimination  #teen  #art  
56 False Qatar Airbus A380 #aviation  
57 False He ordered a bogo sandwich in the app but they won’t fulfill the order because they said they didn’t participate in that special offer and he got mad #customerservice  #subway  #fypage  #foryou  #fypシ゚viral🖤video  
58 False I think it was a tie tbh. Who yall got during this fight?? #ivory  #nowthatstv  #liddymechelle  #fyp  #onmysoul  #nowthatstvedit  #viral  #nttv  #fyp  #makemefamous  #blowthisup  #foryou  #vsp  #makemefamous  #liddyvsivory  #capcut  
59 False Tip  #soilph  #soiltester  #soil  #soils  #soilheath  #plant  #plantsoftiktok  #plants  #planting  #plantlife  #plantlover  #garden  #gardening  #gardenlife  #gardens  #flower  #flowers  #tips  #tip  #gardenlove  
60 False This is hilarious 😂 #trump2024  #maga  #donaldtrump  
61 False For Real V8 Experts 💚💛❤️          SHARE your result in the comment!
62 False #CapCut  #story  #storytime  #utah  #scary  #creepy  #florida  #fyp  
63 False Einstein as a kid? 🤨#einstein  #smart  #kid  #math  #hack  #meme  #cool  #school  #student  #education  #bigbraintutor  #studytok  #learnontiktok  #learn  #save  
64 False #question  What lane is for slow traffic Who knows ? #driving  #traffic  #roadrage  #truck  #foryoupage  #highway  
65 True 😂😂#viral  #funny #xybca  #foryou  #fypage  #fypシ゚viral  #fyppppppppppppppppppppppp  #manger  #bk  #latenightsnack  #food  
66 False #sad  
67 False #dementia  #funny  #fyppppppppppppppppppppppp  
68 False just another day in Chicago #chicago  #foryoupage  #fyp  
69 False There are no fingerprints, and no more scratches. When applying a mobile phone screen protector, choose Magic John screen protector that’s been tempered twice.#magicjohn  #newyork  #california  #tiktokmademebuyit  #tiktokshop  #iphone  #losangeles  #screenprotector  #usa_tiktok  
70 False i am here🥰#fyp  #foryou  #real  #relate  #relatable  #him  #stalker  #viral  #viral?  #inyourcloset  
71 False should dweeb count? 🤣 #trivia  
72 False This was just the beginning 👹 #terrelljubilee  #terrelltexas  #carnival  #scammers  #shennanigins  #yellow 
73 False NBA Best Editing Special Effects #NBA  #basketball  #sports  #fyp  @TikTok  @NBA  
74 False The Strangers just busted through the wall at the Chapter 1 Premiere🪓 #thestrangers  @Lionsgate   
75 False #walkingwithlions  #lions  #travel  #southafrica  
76 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
77 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
78 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
79 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
80 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
81 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
82 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
83 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
84 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
85 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
86 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
87 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
88 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
89 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
90 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
91 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
92 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
93 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
94 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
95 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
96 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
97 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
98 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
99 False Oranic Lions Mane from the fully fruiting body grown on mycelium. #lionsmane  #lionsmanemushroom  #memoryloss  #cognitivefunctions  #brain  #fyp  #ttshop  #tiktokmademebuyit  #tiktokshop  
Pro tip: Be careful about keywords

For experiments that use keywords, the choices we make will directly shape our results. In the field, you can mitigate your own predisposition and biases by working with domain experts to curate keyword lists.

import matplotlib.pyplot as plt
plt.plot(decisions, ds='steps')
plt.xlabel('Video Number')
plt.ylabel('Watched')
plt.yticks([0, 1], ['False', 'True']);

The figure above shows when during our 100-videos-long session we were recommended a video about food (from keywords). The x-axis is chronological, the 1st video displayed is on the left, and the most recent video is on the right. The y-axis is “yes” or “no,” depending on if the video was related to food.

Results

You can look back to the data/screenshots folder we created to check whether the videos we watched appear to be food-related.

If the feed was indeed increasingly filled with food videos, we would see more lines towards the right of the graph. At least here it does not appear to be the case.

Does it mean that the WSJ investigation was wrong, or that TikTok stopped personalizing content?

The answer is “No,” for several reasons:

  1. We only scrolled through 100 videos, this is likely too few to observe any effects. Try re-running with a higher number!
  2. When studying personalization you should use an account per profile and make sure you’re logged in, rather than relying on a fresh browser. So, instead of closing the login dialog, try actually logging in! You know how to find and click buttons, and this is how you put text in text fields.
  3. When you’re not logged in, you will be presented with content from all over the world, in all languages. If you filtered keywords in just one language, you will miss plenty of target content in other languages.
  4. You should always have a baseline to compare to. In this case, you should probably run two accounts at the same time - one that watches food videos and one that doesn’t. Then you compare the prevalence of food videos between these two.
  5. The WSJ investigation was run on the mobile app rather than on a desktop browser. Perhaps TikTok’s personalization works differently based on device or operating system.

Advanced Usage

Above we highlighted some ideas to make your investigation or study more robust, some are methodological choices, but others are technical.

There are some advanced use-cases and tasks you can perform with browser automation that include

  • Authentication using the browser and storing cookies for later use.
  • Intercept background API calls and combine browser automation with API calls. See selenium-wire as an example.

We may cover some or all of these topics in subsequent tutorials.

Citation

To cite this chapter, please use the following BibTex entry:

@incollection{inspect2023browser,
  author    = {Sapiezynski, Piotr and Yin, Leon},
  title     = {Browser Automation},
  booktitle = {Inspect Element},
  year      = {2023},
  editor    = {Yin, Leon and Sapiezynski, Piotr},
  note      = {\url{https://inspectelement.org}}
}

Acknowledgements

Thank you to Ruth Talbot and John West for answering questions about their two respective investigations.