Parsing an API XML Response Data - Python
A simple and effective way to parse API XML Response Data with Python in less than 10 lines of code
I recently started looking at parsing sitemaps from different websites for a project. Due to my current level in python I inevitably hit a roadblock. Sitemaps are XML Objects that store all the URLs of a website. Parsing the sitemap of a website allows to quickly and efficiently retrieve all the URLs.
After a bit of research I found a simple and easy way to parse XML using python.
You'll need two modules:
- Requests: it allow you to send HTTP/1.1 requests. You can add headers, form data, multipart files, and parameters with simple Python dictionaries, and access the response data in the same way. It’s powered by httplib and urllib3, but it does all the hard work and crazy hacks for you.
- ElementTree: The
xml.etree.ElementTree
module implements a simple and efficient API for parsing and creating XML data. - If you are unfamiliar with XML, I encourage you to check out this.
Our objective
We'd like to parse the following XML
<?xml version="1.0" encoding="UTF-8"?>\n\n<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"\n xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"\n xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">\n<url>\n <loc>https://www.washingtonpost.com/national/ben-affleck-on-the-pain-and-catharsis-of-the-way-back/2020/02/27/40e505a2-59a6-11ea-8efd-0f904bdd8057_story.html</loc>\n <lastmod>2020-02-27T21:15:18Z</lastmod>\n <news:news>\n <news:publication>\n <news:name>Washington Post</news:name>\n <news:language>en</news:language>\n </news:publication>\n \n <news:publication_date>2020-02-27T21:15:18Z</news:publication_date>\n <news:title><![CDATA[Ben Affleck on the pain and catharsis of \xe2\x80\x98The Way Back\xe2\x80\x99]]></news:title>\n <news:keywords><![CDATA[US-Film-Ben Affleck, Toby Emmerich, Robert Pattinson, Ana de Armas, Jake Coyle, Ridley Scott, Nicole Holofcener, Jennifer Garner, Barbara Walters, Matt Reeves, Ben Affleck, Matt Damon, General news, Arts and entertainment, Celebrity, Entertainment, Movies, Basketball, Sports]]></news:keywords>\n </news:news>\n <changefreq>hourly</changefreq></url>\n </urlset></xml>
Which give us a news with a publication containing:
- URL of the article
- Name of the publisher
- Language of the article
- Publication Date
- Title of the article
- Keywords
- And the frequence at which the article could change
Getting Started
Installation
pip install requests
This command will install the required libraries for the following example.
In your Python3 file such as main.py
import requests
import xml.etree.ElementTree as ET
Parsing the XML
We'll start by making a request to our designated XML, in this case the sitemap of The Washington Post.
r = requests.get('https://www.washingtonpost.com/arcio/news-sitemap/')
Then print the content of the request
print(r.content)
These is were the module ElementTree comes in. Using ElementTree, we parse the data into a variable. This will use the root of the structure. Essentially, we create a dictionary.
root = ET.fromstring(r.content)
Now all of our data is in the root variable, we can start working with it.
We will use the method "iter"; to access data within the variable.
To view all elements (tags) we can use a wildcard such as "*".
for child in root.iter('*'):
print(child.tag)
This will output all the tags
{http://www.google.com/schemas/sitemap-news/0.9}news
{http://www.google.com/schemas/sitemap-news/0.9}publication
{http://www.google.com/schemas/sitemap-news/0.9}name
{http://www.google.com/schemas/sitemap-news/0.9}language
{http://www.google.com/schemas/sitemap-news/0.9}publication_date
{http://www.google.com/schemas/sitemap-news/0.9}title
{http://www.google.com/schemas/sitemap-news/0.9}keywords
{http://www.sitemaps.org/schemas/sitemap/0.9}changefreq
Then we just have to create a dictionnary. Here I created a dictionnary where the Key is the url and the value is the last date modified.
xmlDict = {}
for sitemap in root:
children = sitemap.getchildren()
xmlDict[children[0].text] = children[1].text
print (xmlDict)
Giving the output
{'https://www.washingtonpost.com/business/energy/coronavirus-sends-vix-into-backwardationheres-what-that-means/2020/02/27/813936c2-5970-11ea-8efd-0f904bdd8057_story.html': '2020-02-27T20:55:48Z',
'https://www.washingtonpost.com/business/why-tech-firms-want-some-facial-recognition-rules/2020/02/20/f567a0da-53e7-11ea-80ce-37a8d4266c09_story.html': '2020-02-27T21:13:33Z',
'https://www.washingtonpost.com/business/the-citizenship-law-behind-indias-sectarian-violence/2020/02/26/6fea20f6-5908-11ea-8efd-0f904bdd8057_story.html': '2020-02-27T21:14:41Z',
'https://www.washingtonpost.com/business/how-a-mega-ipo-could-help-plug-indias-budget-deficit/2020/02/27/37999cb2-59a4-11ea-8efd-0f904bdd8057_story.html': '2020-02-27T21:18:44Z',
'https://www.washingtonpost.com/business/economy/walmart-to-launch-paid-membership-program-to-compete-with-amazons-prime/2020/02/27/f30888e2-5962-11ea-9b35-def5a027d470_story.html': '2020-02-27T21:49:00Z'}
You should be all set for parsing XML now.
Here is the gist for the code
import requests | |
import xml.etree.ElementTree as ET | |
r = requests.get('https://www.washingtonpost.com/arcio/news-sitemap/') | |
print(r.content) | |
xmlDict = {} | |
root = ET.fromstring(r.content) | |
for child in root.iter('*'): | |
print(child.tag) | |
for sitemap in root: | |
children = sitemap.getchildren() | |
xmlDict[children[0].text] = children[1].text | |
print (xmlDict) |