Parsing an API XML Response Data - Python

A simple and effective way to parse API XML Response Data with Python in less than 10 lines of code


3 min read
Parsing an API XML Response Data - Python

I recently started looking at parsing sitemaps from different websites for a project. Due to my current level in python I inevitably hit a roadblock. Sitemaps are XML Objects that store all the URLs of a website. Parsing the sitemap of a website allows to quickly and efficiently retrieve all the URLs.

After a bit of research I found a simple and easy way to parse XML using python.

You'll need two modules:

  • Requests: it allow you to send HTTP/1.1 requests. You can add headers, form data, multipart files, and parameters with simple Python dictionaries, and access the response data in the same way. It’s powered by httplib and urllib3, but it does all the hard work and crazy hacks for you.
  • ElementTree: The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data.
  • If you are unfamiliar with XML, I encourage you to check out this.

Our objective

We'd like to parse the following XML

<?xml version="1.0" encoding="UTF-8"?>\n\n<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"\n    xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"\n    xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">\n<url>\n         <loc>https://www.washingtonpost.com/national/ben-affleck-on-the-pain-and-catharsis-of-the-way-back/2020/02/27/40e505a2-59a6-11ea-8efd-0f904bdd8057_story.html</loc>\n         <lastmod>2020-02-27T21:15:18Z</lastmod>\n         <news:news>\n            <news:publication>\n                <news:name>Washington Post</news:name>\n                <news:language>en</news:language>\n            </news:publication>\n            \n            <news:publication_date>2020-02-27T21:15:18Z</news:publication_date>\n            <news:title><![CDATA[Ben Affleck on the pain and catharsis of \xe2\x80\x98The Way Back\xe2\x80\x99]]></news:title>\n            <news:keywords><![CDATA[US-Film-Ben Affleck, Toby Emmerich, Robert Pattinson, Ana de Armas, Jake Coyle, Ridley Scott, Nicole Holofcener, Jennifer Garner, Barbara Walters, Matt Reeves, Ben Affleck, Matt Damon, General news, Arts and entertainment, Celebrity, Entertainment, Movies, Basketball, Sports]]></news:keywords>\n            </news:news>\n         <changefreq>hourly</changefreq></url>\n </urlset></xml>

Which give us a news with a publication containing:

  • URL of the article
  • Name of the publisher
  • Language of the article
  • Publication Date
  • Title of the article
  • Keywords
  • And the frequence at which the article could change

Getting Started

Installation

pip install requests

This command will install the required libraries for the following example.

In your Python3 file such as main.py

import requests
import xml.etree.ElementTree as ET

Parsing the XML


We'll start by making a request to our designated XML, in this case the sitemap of The Washington Post.

r = requests.get('https://www.washingtonpost.com/arcio/news-sitemap/')

Then print the content of the request

print(r.content)

These is were the module ElementTree comes in. Using ElementTree, we parse the data into a variable. This will use the root of the structure. Essentially, we create a dictionary.

root = ET.fromstring(r.content)

Now all of our data is in the root variable, we can start working with it.
We will use the method "iter"; to access data within the variable.

To view all elements (tags) we can use a wildcard such as "*".

for child in root.iter('*'):
    print(child.tag)

This will output all the tags

{http://www.google.com/schemas/sitemap-news/0.9}news
{http://www.google.com/schemas/sitemap-news/0.9}publication
{http://www.google.com/schemas/sitemap-news/0.9}name
{http://www.google.com/schemas/sitemap-news/0.9}language
{http://www.google.com/schemas/sitemap-news/0.9}publication_date
{http://www.google.com/schemas/sitemap-news/0.9}title
{http://www.google.com/schemas/sitemap-news/0.9}keywords
{http://www.sitemaps.org/schemas/sitemap/0.9}changefreq

Then we just have to create a dictionnary. Here I created a dictionnary where the Key is the url and the value is the last date modified.

xmlDict = {}
for sitemap in root:
    children = sitemap.getchildren()
    xmlDict[children[0].text] = children[1].text
print (xmlDict)

Giving the output

{'https://www.washingtonpost.com/business/energy/coronavirus-sends-vix-into-backwardationheres-what-that-means/2020/02/27/813936c2-5970-11ea-8efd-0f904bdd8057_story.html': '2020-02-27T20:55:48Z',
'https://www.washingtonpost.com/business/why-tech-firms-want-some-facial-recognition-rules/2020/02/20/f567a0da-53e7-11ea-80ce-37a8d4266c09_story.html': '2020-02-27T21:13:33Z',
'https://www.washingtonpost.com/business/the-citizenship-law-behind-indias-sectarian-violence/2020/02/26/6fea20f6-5908-11ea-8efd-0f904bdd8057_story.html': '2020-02-27T21:14:41Z',
'https://www.washingtonpost.com/business/how-a-mega-ipo-could-help-plug-indias-budget-deficit/2020/02/27/37999cb2-59a4-11ea-8efd-0f904bdd8057_story.html': '2020-02-27T21:18:44Z', 
'https://www.washingtonpost.com/business/economy/walmart-to-launch-paid-membership-program-to-compete-with-amazons-prime/2020/02/27/f30888e2-5962-11ea-9b35-def5a027d470_story.html': '2020-02-27T21:49:00Z'}

You should be all set for parsing XML now.

Here is the gist for the code


What is a smart contract?
Previous article

What is a smart contract?

Smart contracts are one of the most promising applications in the Blockchain ecosystem. Unlike a traditional deal, the execution is guaranteed by a legislative framework, and the execution is governed by computer code.

DDOS with a crapy computer - Slowloris Attack
Next article

DDOS with a crapy computer - Slowloris Attack

The Slowloris attack allows a user to DDOS a server using only one machine. It tries to keep as many connections open with the target web server as possible and tries to keep them open as long as possible.


Related Articles

Easy logging with Python
3 min read
Introduction to Python
4 min read

GO TOP

πŸŽ‰ You've successfully subscribed to Primates!
OK