Python

How to parse XML with Python ?

XML is an open standard format used by many different applications such as web browsers, email clients, and databases. In this tutorial, we'll teach you how to parse XML using Python.

What is XML?

XML stands for Extensible Markup Language. It is a computer language that allows the exchange of data between two heterogeneous environments. An XML document is a tree composed of nodes that can be elements or attributes.

It stands for eXtensible Markup Language
It is a markup language like HTML
It was designed to be self-descriptive - Anyone can understand it easily
It was developed to exchange and store data easily
It is a W3C recommendation

Let’s look at an XML file

<urlset>
      <url>
        <loc>https://primates.dev/become-an-author/</loc>
        <lastmod>2020-08-30T21:25:10.486Z</lastmod>
        <image:image>
          <image:loc>https://primates.dev/content/images/2020/02/monkey-computer.jpg</image:loc>
          <image:caption>monkey-computer.jpg</image:caption>
        </image:image>
      </url>
      <url>
        <loc>https://primates.dev/tags/</loc>
        <lastmod>2020-08-30T21:23:08.618Z</lastmod>
      </url>
    </urlset>

Let's look at how to parse this information. It is a sitemap found here: https://primates.dev/sitemap.xml. It is a file that is read by Google and other crawlers to know which pages are present on a website. It helps them identify the content that is present on your website and how often it is modified.

The root element is urlset. When you are creating a sitemap, this element defines a list of URLs that will be listed below. When Google is parsing this XML, it will look at the root element to determine what the content is. Here it knows that it will be a list of URLs.

The second element is URL which has many children such as loc and lastmod

Parsing the XML with Python

1. Installation

You'll need two modules:

Requests: it allows you to send requests such as GET/POST/PUT/DELETE. You can add many different things such as headers, form data, multipart files, and parameters with simple Python dictionaries. It also allows you to access many parameters easily from a response data. Being powered by httplib and urllib3, it does all the hard work for you.
ElementTree: The XML.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data. With this library, you’ll be able to parse XML in a few lines of codes. See that as an alternative to BeautifulSoup with the lxml parser. BS4 uses lxml to parse an HTML file and read it like you read an XML file. It looks at the root and then goes down to the different sections such as <header>, <body>, <footer>and parses all the params of those elements. Reading an HTML and XML file is done in the same way.

We first install requests with pip. Note that this tutorial uses Python 3. Make sure that you have pip3 installed on your computer

pip install requests # Python 3

If you don’t know requests, it is probably one of the most powerful and yet simple libraries to use. I strongly recommend you to check it.

2. Parsing XML

Let’s start with a single file. Call it whatever you like. Here I’ll make main.py and write:

import requests
    import xml.etree.ElementTree as ET

These are the only two libraries needed for parsing XML. Requests will be used to call the endpoint that will give us the XML while xml.etree.ElementTree will help us parse the response.

In this example, we will be parsing https://primates.dev/sitemap-posts.xml. It is a fairly simple file to start with and should introduce you to all the concepts you need to understand how to parse an XML.

r = requests.get('https://primates.dev/sitemap-posts.xml')
    print(r.content)

You should end up with something looking like this:

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="//primates.dev/sitemap.xsl"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
      <url>
        <loc>https://primates.dev/find-all-urls-of-a-website-in-a-few-seconds-python/</loc>
        <lastmod>2020-08-30T21:27:11.645Z</lastmod>
        <image:image>
          <image:loc>https://primates.dev/content/images/2020/03/sitemap-crawler-python-primates.jpg</image:loc>
          <image:caption>sitemap-crawler-python-primates.jpg</image:caption>
        </image:image>
      </url>
      ...
    </urlset>'

Let's try to extract all the information from this XML example easily. This is were the module ElementTree comes in. Using ElementTree, we parse the data into a variable. This will use the root of the structure and we essentially create a dictionary with key/values pairs.

Now all of our data is in the root variable, we can start working with it.
We will use the method "iter"; to access data within the variable.

To view all elements (tags) we can use a wildcard such as "*".

Giving us this output:

{http://www.sitemaps.org/schemas/sitemap/0.9}url
    {http://www.sitemaps.org/schemas/sitemap/0.9}loc
    {http://www.sitemaps.org/schemas/sitemap/0.9}lastmod
    {http://www.google.com/schemas/sitemap-image/1.1}image
    {http://www.google.com/schemas/sitemap-image/1.1}loc
    {http://www.google.com/schemas/sitemap-image/1.1}caption
    {http://www.sitemaps.org/schemas/sitemap/0.9}url
    {http://www.sitemaps.org/schemas/sitemap/0.9}loc
    {http://www.sitemaps.org/schemas/sitemap/0.9}lastmod
    {http://www.google.com/schemas/sitemap-image/1.1}image
    {http://www.google.com/schemas/sitemap-image/1.1}loc
    {http://www.google.com/schemas/sitemap-image/1.1}caption

Now you tell me, this is not a very good looking result and we are also missing some values. Let’s look at how to extract the data from the keys that we got.

Now replace the above code with this one.

xmlDict = {}
    for sitemap in root:
        children = list(sitemap)
        xmlDict[children[0].text] = children[1].text
    print (xmlDict)

This code will extract the loc and lastmod values of our XML. It doesn’t matter what type of data it is, it can be a string, a DateTime, an int. You can store any values you want in an XML.

You should end up with something like that:

{
      'https://primates.dev/find-all-urls-of-a-website-in-a-few-seconds-python/': '2020-08-30T21:27:11.645Z', 
     'https://primates.dev/parsing-an-api-xml-response-data-python/': '2020-08-30T21:26:53.152Z',
     'https://primates.dev/how-to-install-ghost/': '2020-08-28T22:36:43.749Z',
     'https://primates.dev/how-to-install-multiple-ghost-instances-on-the-same-server/': '2020-08-12T11:01:49.000Z',
     'https://primates.dev/10-tips-on-how-to-choose-a-domain-name-for-your-business/': '2020-08-04T23:26:22.000Z'
    ...
    }

I hope you found this tutorial helpful. XML is not one of the best ways of transferring data between computers. It is an extremely verbose language. It takes a lot of space but is easily readable by humans. Huge XML with lots of nested elements can be very tedious to read. Fortunately for us, machines do most of the work.

Please let me know in the comment if you found a better way to parse XML and I’ll gladly add it here.

And if you want more content like this one, don’t forget to subscribe to our newsletter!