Parsing the web with Xpath

Author

Leon Yin

Published

January 4, 2025

📖 Read online 🖥️ Interactive version ⚙️ GitHub

Xpath is a language used to query and navigate XML-formatted documents, such as HTML.

It is a useful tool for web scraping, as the syntax is standardized across browsers and web parsing software packages.

For this reason, xpath is a seamless workflow between live websites in a browser and custom software to parse out fields (from static HTML) or interact with elements (using browser automation).

Although it’s a unique language on its own, it can generate simple, precise, and generalizable expressions to parse web pages.

Parsing Static Websites in the Browser

Xpath can be used on a live web pages that you encounter on your browser. That makes it a convenient tool that can change how to navigate the web.

We’re going to identify all the recent article titles and links from NPR. In your browser and go to our example website: https://text.npr.org/

Next, open the dev tools by right-clicking anywhere and selecting “Inspect” (or however else).

How to copy the xpath of an element in Dev Tools.

Select any element in the “Elements” (or “Inspector”) tab and copy the xpath

The element we selected is an <a> tag with a link and a title that looks like this (note the text will differ for you):

<a class="topic-title" href="/nx-s1-5035272">What is in Project 2025? </a>

The resulting xpath that we copied looks like this:

/html/body/main/div/ul/li[1]/a

What is xpath?

Xpath records hierarchy across a branch of HTML tags. The first tag denotes the starting place and the last tag denoting the destination.

It designates where an element lives in an HTML document (as if you were honing in on a street address from the center of the Earth).

The example above is long and specific to one element on the page.

At it’s worst, xpath provides is directions to a specifc destination (for example the Shake Shack in Madison Square). At best, xpath provides directions that lead to every Shake Shack.

With a little practice xpath can be both precise and generalizable, providing an elegant way to locate and select elements from web pages.

Here is the other extreme: short and generic.

.//a

This syntax yields the target element mixed with every other element on the page with an <a> tag. Following the Shake Shack analogy, this xpath represents directions to every restaurant on Earth.

You’ll notice the “.//” before the <a> tag, which denotes a search anywhere on the page.

My favorite part about xpath is that you can identify and refine them in browser, and use the same xpath in different frameworks to make web parsing a breeze.

The Goldie Locks approach is not about specifying the exact route, biut rather the destinguishing attributes of the destination.

Identifying the optimal xpath in the browser

Let’s jump into the live website. We’ll try to print the title of each headline of the day.

In Dev Tools, switch over to the “console” tab. This allows us to execute JavaScript on the page.

We’ll use the $x() function to select elements on the page by xpath (“x” for xpath). As a start, type a HTML tag such as an <a> tag: $x('.//a')

The results for just any <a> tag is too general, returning elements that we don’t want. xpath offers an easy way specify attributes and other distinguishable features.

Selecting specific attributes

You can use the “@” sign before an attribute name. This allows you to denote specific attribute values .//a[@href="/nx-s1-5035272"] will look for an <a> tag with an href attribute of “/nx-s1-5035272”.

Better yet, you can simply the presence of an attribute without a specific value .//a[@href].

Similar to any other attribute, you can also select elements based on class './/a[@class="topic-title"]')

Xpath practice

If you want to get better at writing xpath, here’s a simple workflow you can use for an element you want to parse.

<a class="topic-title" href="/nx-s1-5248297">

Remove the delimiter brackets < and > and closing tag. a class="topic-title" href="/nx-s1-5248297"
Add the leading path indicators .//. .//a class="topic-title" href="/nx-s1-5248297"
Start a xpath with a HTML tag with closed brackets: .//a[ ... ] for example .//a[ class="topic-title" href="/nx-s1-5248297"].
Add @ before each attribute, and and between each attribute ending up with: .//a[@class="topic-title" and @href="/nx-s1-5248297"].
From here, you can remove overly-specific attributes. In the case above, the class is unique enough to isolate news articles.

$x('.//a[@class="topic-title"]')

Text Matching

xpath also allows for text-matching.

Here’s how you can match for a link on the page with text mentioning “2025”

$x('.//a[contains(text(),"2025")]')

To sanity check your results, you can expand the resulting list and click any of the elements. This will highlight the element on the page and shoot you back to the Dev Tools “Elements” tab to view the element.

Parsing Xpath from HTML pages in Python

With the correct xpath in hand, we can automate this parsing in Python using the lxml package.

!pip install lxml

from lxml import etree
import requests

1. Visit the page

Let’s visit the website and retrieve the static HTML from the page.

url = "https://text.npr.org/"

resp = requests.get(url)

2. Convert the HTML string into a `etree`

This allows us to parse the text using xpath

tree = etree.HTML(resp.text)

3. Select all the links

Use an xpath to identify all the links on the page that are the same class topic-title.

elements = tree.findall('.//a[@class="topic-title"]')
len(elements)

4. Parse each link

Now we can iterate through each headline and grab the title and link of each story:

data = []
for elm in elements:
    # get the link for each link card
    link = elm.get('href')
    link = f"https://npr.org{link}"
    title = elm.text
    
    row = {'link' : link, 'title': title}
    data.append(row)

data[:5]

[{'link': 'https://npr.org/nx-s1-5248297',
  'title': 'A storm will bring heavy snow and dangerous ice from the Plains to the East Coast'},
 {'link': 'https://npr.org/nx-s1-5248299',
  'title': "A Pulitzer winner quits 'Washington Post' after a cartoon on Bezos is killed"},
 {'link': 'https://npr.org/g-s1-41101',
  'title': "The Golden Globes are Sunday night. Here's five things to look for"},
 {'link': 'https://npr.org/nx-s1-5248310',
  'title': 'Film director and screenwriter Jeff Baena, husband of Aubrey Plaza, dead at 47'},
 {'link': 'https://npr.org/nx-s1-5248283',
  'title': "Jurassic footprints are discovered on a 'dinosaur highway' in southern England"}]

That’s all!