You've successfully subscribed to Primates
Great! Next, complete checkout for full access to Primates
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.
Success! Your billing info is updated.
Billing info update failed.

Parsing an API XML Response Data - Python

A simple and effective way to parse API XML Response Data with Python in less than 10 lines of code

Stanislas Girard
Stanislas Girard

I recently started looking at parsing sitemaps from different websites for a project. Due to my current level in python I inevitably hit a roadblock. Sitemaps are XML Objects that store all the URLs of a website. Parsing the sitemap of a website allows to quickly and efficiently retrieve all the URLs.

After a bit of research I found a simple and easy way to parse XML using python.

You'll need two modules:

  • Requests: it allow you to send HTTP/1.1 requests. You can add headers, form data, multipart files, and parameters with simple Python dictionaries, and access the response data in the same way. It’s powered by httplib and urllib3, but it does all the hard work and crazy hacks for you.
  • ElementTree: The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data.
  • If you are unfamiliar with XML, I encourage you to check out this.

Our objective

We'd like to parse the following XML

<?xml version="1.0" encoding="UTF-8"?>\n\n<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"\n    xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"\n    xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">\n<url>\n         <loc>https://www.washingtonpost.com/national/ben-affleck-on-the-pain-and-catharsis-of-the-way-back/2020/02/27/40e505a2-59a6-11ea-8efd-0f904bdd8057_story.html</loc>\n         <lastmod>2020-02-27T21:15:18Z</lastmod>\n         <news:news>\n            <news:publication>\n                <news:name>Washington Post</news:name>\n                <news:language>en</news:language>\n            </news:publication>\n            \n            <news:publication_date>2020-02-27T21:15:18Z</news:publication_date>\n            <news:title><![CDATA[Ben Affleck on the pain and catharsis of \xe2\x80\x98The Way Back\xe2\x80\x99]]></news:title>\n            <news:keywords><![CDATA[US-Film-Ben Affleck, Toby Emmerich, Robert Pattinson, Ana de Armas, Jake Coyle, Ridley Scott, Nicole Holofcener, Jennifer Garner, Barbara Walters, Matt Reeves, Ben Affleck, Matt Damon, General news, Arts and entertainment, Celebrity, Entertainment, Movies, Basketball, Sports]]></news:keywords>\n            </news:news>\n         <changefreq>hourly</changefreq></url>\n </urlset></xml>

Which give us a news with a publication containing:

  • URL of the article
  • Name of the publisher
  • Language of the article
  • Publication Date
  • Title of the article
  • Keywords
  • And the frequence at which the article could change

Getting Started

Installation

pip install requests

This command will install the required libraries for the following example.

In your Python3 file such as main.py

import requests
import xml.etree.ElementTree as ET

Parsing the XML


We'll start by making a request to our designated XML, in this case the sitemap of The Washington Post.

r = requests.get('https://www.washingtonpost.com/arcio/news-sitemap/')

Then print the content of the request

print(r.content)

These is were the module ElementTree comes in. Using ElementTree, we parse the data into a variable. This will use the root of the structure. Essentially, we create a dictionary.

root = ET.fromstring(r.content)

Now all of our data is in the root variable, we can start working with it.
We will use the method "iter"; to access data within the variable.

To view all elements (tags) we can use a wildcard such as "*".

for child in root.iter('*'):
    print(child.tag)

This will output all the tags

{http://www.google.com/schemas/sitemap-news/0.9}news
{http://www.google.com/schemas/sitemap-news/0.9}publication
{http://www.google.com/schemas/sitemap-news/0.9}name
{http://www.google.com/schemas/sitemap-news/0.9}language
{http://www.google.com/schemas/sitemap-news/0.9}publication_date
{http://www.google.com/schemas/sitemap-news/0.9}title
{http://www.google.com/schemas/sitemap-news/0.9}keywords
{http://www.sitemaps.org/schemas/sitemap/0.9}changefreq

Then we just have to create a dictionnary. Here I created a dictionnary where the Key is the url and the value is the last date modified.

xmlDict = {}
for sitemap in root:
    children = sitemap.getchildren()
    xmlDict[children[0].text] = children[1].text
print (xmlDict)

Giving the output

{'https://www.washingtonpost.com/business/energy/coronavirus-sends-vix-into-backwardationheres-what-that-means/2020/02/27/813936c2-5970-11ea-8efd-0f904bdd8057_story.html': '2020-02-27T20:55:48Z',
'https://www.washingtonpost.com/business/why-tech-firms-want-some-facial-recognition-rules/2020/02/20/f567a0da-53e7-11ea-80ce-37a8d4266c09_story.html': '2020-02-27T21:13:33Z',
'https://www.washingtonpost.com/business/the-citizenship-law-behind-indias-sectarian-violence/2020/02/26/6fea20f6-5908-11ea-8efd-0f904bdd8057_story.html': '2020-02-27T21:14:41Z',
'https://www.washingtonpost.com/business/how-a-mega-ipo-could-help-plug-indias-budget-deficit/2020/02/27/37999cb2-59a4-11ea-8efd-0f904bdd8057_story.html': '2020-02-27T21:18:44Z', 
'https://www.washingtonpost.com/business/economy/walmart-to-launch-paid-membership-program-to-compete-with-amazons-prime/2020/02/27/f30888e2-5962-11ea-9b35-def5a027d470_story.html': '2020-02-27T21:49:00Z'}

You should be all set for parsing XML now.

Here is the gist for the code

New and improved version here:

How to parse XML with Python ?
A simple and updated way to parse an XML with Python. Don’t let XML beat you down. Master the parsing of XML with this simple tutorial.
PythonParsing

Stanislas Girard

Creator of Primates & Author. Blockchain, Data, Security, Machine Learning, Web ... The World is full of interesting subjects.