Python

Parsing an API XML Response Data - Python

A simple and effective way to parse API XML Response Data with Python in less than 10 lines of code

I recently started looking at parsing sitemaps from different websites for a project. Due to my current level in python I inevitably hit a roadblock. Sitemaps are XML Objects that store all the URLs of a website. Parsing the sitemap of a website allows to quickly and efficiently retrieve all the URLs.

After a bit of research I found a simple and easy way to parse XML using python.

You'll need two modules:

Requests: it allow you to send HTTP/1.1 requests. You can add headers, form data, multipart files, and parameters with simple Python dictionaries, and access the response data in the same way. It’s powered by httplib and urllib3, but it does all the hard work and crazy hacks for you.
ElementTree: The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data.
If you are unfamiliar with XML, I encourage you to check out this.

Our objective

We'd like to parse the following XML

<?xml version="1.0" encoding="UTF-8"?>\n\n<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"\n    xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"\n    xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">\n<url>\n         <loc>https://www.washingtonpost.com/national/ben-affleck-on-the-pain-and-catharsis-of-the-way-back/2020/02/27/40e505a2-59a6-11ea-8efd-0f904bdd8057_story.html</loc>\n         <lastmod>2020-02-27T21:15:18Z</lastmod>\n         <news:news>\n            <news:publication>\n                <news:name>Washington Post</news:name>\n                <news:language>en</news:language>\n            </news:publication>\n            \n            <news:publication_date>2020-02-27T21:15:18Z</news:publication_date>\n            <news:title><![CDATA[Ben Affleck on the pain and catharsis of \xe2\x80\x98The Way Back\xe2\x80\x99]]></news:title>\n            <news:keywords><![CDATA[US-Film-Ben Affleck, Toby Emmerich, Robert Pattinson, Ana de Armas, Jake Coyle, Ridley Scott, Nicole Holofcener, Jennifer Garner, Barbara Walters, Matt Reeves, Ben Affleck, Matt Damon, General news, Arts and entertainment, Celebrity, Entertainment, Movies, Basketball, Sports]]></news:keywords>\n            </news:news>\n         <changefreq>hourly</changefreq></url>\n </urlset></xml>

Which give us a news with a publication containing:

URL of the article
Name of the publisher
Language of the article
Publication Date
Title of the article
Keywords
And the frequence at which the article could change

Getting Started

Installation

pip install requests

This command will install the required libraries for the following example.

In your Python3 file such as main.py

import requests
    import xml.etree.ElementTree as ET

Parsing the XML

We'll start by making a request to our designated XML, in this case the sitemap of The Washington Post.

r = requests.get('https://www.washingtonpost.com/arcio/news-sitemap/')

Then print the content of the request

print(r.content)

These is were the module ElementTree comes in. Using ElementTree, we parse the data into a variable. This will use the root of the structure. Essentially, we create a dictionary.

root = ET.fromstring(r.content)

Now all of our data is in the root variable, we can start working with it.
We will use the method "iter"; to access data within the variable.

To view all elements (tags) we can use a wildcard such as "*".

for child in root.iter('*'):
        print(child.tag)

This will output all the tags

{http://www.google.com/schemas/sitemap-news/0.9}news
    {http://www.google.com/schemas/sitemap-news/0.9}publication
    {http://www.google.com/schemas/sitemap-news/0.9}name
    {http://www.google.com/schemas/sitemap-news/0.9}language
    {http://www.google.com/schemas/sitemap-news/0.9}publication_date
    {http://www.google.com/schemas/sitemap-news/0.9}title
    {http://www.google.com/schemas/sitemap-news/0.9}keywords
    {http://www.sitemaps.org/schemas/sitemap/0.9}changefreq

Then we just have to create a dictionnary. Here I created a dictionnary where the Key is the url and the value is the last date modified.

xmlDict = {}
    for sitemap in root:
        children = sitemap.getchildren()
        xmlDict[children[0].text] = children[1].text
    print (xmlDict)

Giving the output

{'https://www.washingtonpost.com/business/energy/coronavirus-sends-vix-into-backwardationheres-what-that-means/2020/02/27/813936c2-5970-11ea-8efd-0f904bdd8057_story.html': '2020-02-27T20:55:48Z',
    'https://www.washingtonpost.com/business/why-tech-firms-want-some-facial-recognition-rules/2020/02/20/f567a0da-53e7-11ea-80ce-37a8d4266c09_story.html': '2020-02-27T21:13:33Z',
    'https://www.washingtonpost.com/business/the-citizenship-law-behind-indias-sectarian-violence/2020/02/26/6fea20f6-5908-11ea-8efd-0f904bdd8057_story.html': '2020-02-27T21:14:41Z',
    'https://www.washingtonpost.com/business/how-a-mega-ipo-could-help-plug-indias-budget-deficit/2020/02/27/37999cb2-59a4-11ea-8efd-0f904bdd8057_story.html': '2020-02-27T21:18:44Z', 
    'https://www.washingtonpost.com/business/economy/walmart-to-launch-paid-membership-program-to-compete-with-amazons-prime/2020/02/27/f30888e2-5962-11ea-9b35-def5a027d470_story.html': '2020-02-27T21:49:00Z'}

You should be all set for parsing XML now.

Here is the gist for the code

	import requests
	import xml.etree.ElementTree as ET
	r = requests.get('https://www.washingtonpost.com/arcio/news-sitemap/')
	print(r.content)
	xmlDict = {}
	root = ET.fromstring(r.content)
	for child in root.iter('*'):
	print(child.tag)
	for sitemap in root:
	children = sitemap.getchildren()
	xmlDict[children[0].text] = children[1].text
	print (xmlDict)

view raw Parsing an API XML Response Data hosted with ❤ by GitHub