Get All URLs From A Website Using Python Script
You can easily extract all the links on a web page using python script. Have you ever wanted to extract all the URLs of a website quickly? We'll tell you how! It is hundreds of times faster than crawling all the pages of a website to find all of its URLs.
Have you ever wanted to extract all the URLs of a website quickly? We'll tell you how! Usually crawlers visit every pages of a website and indexes them. This method is extremely slow and often results in crawlers taking hours to crawl a website.
Most websites have a file called Sitemap that lists all the URLs. If we manage to find this file we can find all the URLs of a website in seconds. We'll look at a simple way to extract all the URLs of a website based on its Sitemap. It is hundreds of times faster than crawling all the pages of a website to find all of its URLs.
Sitemap
The Sitemap allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. Most websites, such as Primates.dev have their Sitemap located at /sitemap.xml
.
Let's look at our Sitemap. Go to https://primates.dev/sitemap.xml.
You should end up on a page that looks like the image below.
You can see that we have four sitemaps in our sitemap.xml
. Usually, websites separate their content in different sitemaps.
- sitemap-pages.xml references all the pages of our website that are not posts.
- sitemap-posts.xml references all the articles on our website. Useful if you want to parse the article's content of a website.
- sitemap-authors.xml references all the authors of our website. It is a specific sitemap to our website because we have multiple authors.
- sitemap-tags.xml references all the tags of our website.
Sitemaps are present on almost every website on the web. It is a file that allows search engines to easily find new pages on a website without having to crawl it. What if you can't find your Sitemap? You can look at your robots.txt
file.
Go to <yourwebsite>/robots.txt
.
As an example, we'll take a look at https://primates.dev/robots.txt.
- Sitemap: References the sitemaps of a website. You don't always have the sitemaps in the robots.txt, but most websites have their Sitemap in this file.
- Disallow: Pages we don't want to be crawled or references in search engines. For example,/ghost/is the address we use to go to our admin.
- User-agent: Defines the rules that have to be followed by the specified user-agent. Here we determined that the following rules apply to everybody.
No luck with the robots.txt
file ? Well then you need some help.
Method 1:
Here is a list of the most common sitemaps we have found. This is the most popular sitemaps URI path we have seen. Please note that it can change based on the technologies popular at the moment used for creating websites. However, most websites have the same URI for their sitemaps, and therefore it is usually easy to find it. Whether in the robots.txt
or by just trying.
/sitemap.xml
/feeds/posts/default?orderby=updated
/sitemap.xml.gz
/sitemap_index.xml
/s2/sitemaps/profiles-sitemap.xml
/sitemap.php
/sitemap_index.xml.gz
/vb/sitemap_index.xml.gz
/sitemapindex.xml
/sitemap.gz
/sitemap_news.xml
/sitemap-index.xml
/sitemapindex.xml
/sitemap-news.xml
/post-sitemap.xml
/page-sitemap.xml
/portfolio-sitemap.xml
/home_slider-sitemap.xml
/category-sitemap.xml
/author-sitemap.xml
With this extensive list of common sitemaps URI path you should be able to find what you are looking for. If the sitemap is not in the previous list then method 2 should give you the answer.
Method 2:
Let's use Google to find the remaining Sitemaps. This method usually only works for big websites such as news websites.
Go to Google.com and type site:<url_website> filetype:xml
For example: site:theguardian.com filetype:xml
If you have any luck you should find a sitemap.
As you can see, we found a sitemap that isn't on our list. It is a simple and effective way to find sitemaps for huge websites. Depending on the website, you can or cannot find sitemaps with this method.
How to read sitemaps
Let's take a look at the source code of a sitemap. Go to https://primates.dev/sitemap.xml and look at the source code.
This is what we call a sitemap index. It is a list of other sitemaps. As mentioned above, it references all the sitemaps of our website.
- loc: URL of the sitemap
- lastmod: last time the Sitemap was modified.
This sitemap index file is essential for crawlers. It is the entry point for Google's crawlers. With this file, it can know where all the other sitemaps are and if they changed since the last crawl.
Let's take a look at https://primates.dev/sitemap-posts.xml. Same thing, look at the source code.
<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="//primates.dev/sitemap.xsl"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<url>
<loc>https://primates.dev/brave-says-no-to-error-404/</loc>
<lastmod>2020-03-02T11:28:46.280Z</lastmod>
<image:image>
<image:loc>https://primates.dev/content/images/2020/02/404-brave-not-found-primates-dev.jpg</image:loc>
<image:caption>404-brave-not-found-primates-dev.jpg</image:caption>
</image:image>
</url>
<url>
<loc>https://primates.dev/create-games-directly-in-your-browser-babylonjs/</loc>
<lastmod>2020-02-29T23:49:11.000Z</lastmod>
<image:image>
<image:loc>https://primates.dev/content/images/2020/02/gaming-babylonjs-primates-dev.jpg</image:loc>
<image:caption>gaming-babylonjs-primates-dev.jpg</image:caption>
</image:image>
</url>
</urlset>
This is only an extract of the sitemap. The information found on this sitemap is the same for all the URLs.
- urlset: Indicates that it is a list of URLs
- url: Indicates a URL
- loc: URL of the page
- lastmod: Last time the page was changed
- image:image: Indicates that it is an image
- image:loc: URL of the featured image of the article
- image:caption: Caption of the image
Isn't it better than common crawlers ? We have a way to find all the urls of a website. We just have to find a way to extract all this information now. Fortunately for us, here is a small piece of code that does the job :D
Code for parsing sitemaps
Here is the code.
Installation
pip install beautifulsoup4
pip install pandas
Not a lot of libraries needed. It is a fairly simple code.
Crawler & Parser function
import requests
from bs4 import BeautifulSoup as Soup
import pandas as pd
import hashlib
# Pass the headers you want to retrieve from the xml such as ["loc", "lastmod"]
def parse_sitemap( url,headers):
resp = requests.get(url)
# we didn't get a valid response, bail
if (200 != resp.status_code):
return False
# BeautifulSoup to parse the document
soup = Soup(resp.content, "xml")
# find all the <url> tags in the document
urls = soup.findAll('url')
sitemaps = soup.findAll('sitemap')
new_list = ["Source"] + headers
panda_out_total = pd.DataFrame([], columns=new_list)
if not urls and not sitemaps:
return False
# Recursive call to the the function if sitemap contains sitemaps
if sitemaps:
for u in sitemaps:
sitemap_url = u.find('loc').string
panda_recursive = parse_sitemap(sitemap_url, headers)
panda_out_total = pd.concat([panda_out_total, panda_recursive], ignore_index=True)
# storage for later...
out = []
# Creates a hash of the parent sitemap
hash_sitemap = hashlib.md5(str(url).encode('utf-8')).hexdigest()
# Extract the keys we want
for u in urls:
values = [hash_sitemap]
for head in headers:
loc = None
loc = u.find(head)
if not loc:
loc = "None"
else:
loc = loc.string
values.append(loc)
out.append(values)
# Creates a dataframe
panda_out = pd.DataFrame(out, columns= new_list)
# If recursive then merge recursive dataframe
if not panda_out_total.empty:
panda_out = pd.concat([panda_out, panda_out_total], ignore_index=True)
#returns the dataframe
return panda_out
Explanation
First of all we make a request to the specified url in the function parameters.
resp = requests.get(url)
# we didn't get a valid response, bail
if (200 != resp.status_code):
return False
Then we parse the content of the response using BeautifulSoup4.
# BeautifulSoup to parse the document
soup = Soup(resp.content, "xml")
Then we look for either a urlset or a sitemapindex
# find all the <url> tags in the document
urls = soup.findAll('url')
sitemaps = soup.findAll('sitemap')
If we are in a sitemapindex such as https://primates.dev/sitemap.xml we recursively call the function passing the URL (loc) of the sitemap.
# Recursive call to the the function if sitemap contains sitemaps
if sitemaps:
for u in sitemaps:
sitemap_url = u.find('loc').string
panda_recursive = parse_sitemap(sitemap_url, headers)
panda_out_total = pd.concat([panda_out_total, panda_recursive], ignore_index=True)
Then we create a hash of the sitemap urls for better indexing
# Creates a hash of the parent sitemap
hash_sitemap = hashlib.md5(str(url).encode('utf-8')).hexdigest()
We only have one step to finish the extraction. Parse the information of the sitemap.
# Extract the keys we want
for u in urls:
values = [hash_sitemap]
for head in headers:
loc = None
loc = u.find(head)
if not loc:
loc = "None"
else:
loc = loc.string
values.append(loc)
out.append(values)
The functions takes a headers as parameters.
The headers
parameter is a list of all the information you want to retrieve from the sitemap.
For example ["loc", "lastmod"]
The program returns a panda dataframe for easier management down the line.
Example
Extract all the post urls of https://primates.dev/sitemap-posts.xml
parse_sitemap("https://primates.dev/sitemap-posts.xml", ["loc", "lastmod" ])
Resulting in:
Source loc lastmod
0 7e3bc65a80810f933f22b0b2db05d8d6 https://primates.dev/brave-says-no-to-error-404/ 2020-03-02T11:28:46.280Z
1 7e3bc65a80810f933f22b0b2db05d8d6 https://primates.dev/create-games-directly-in-... 2020-02-29T23:49:11.000Z
2 7e3bc65a80810f933f22b0b2db05d8d6 https://primates.dev/ddos-with-a-crapy-computer/ 2020-02-28T12:43:34.438Z
3 7e3bc65a80810f933f22b0b2db05d8d6 https://primates.dev/parsing-an-api-xml-respon... 2020-02-27T22:11:12.323Z
...
18 7e3bc65a80810f933f22b0b2db05d8d6 https://primates.dev/optimize-your-website-in-... 2020-02-18T14:45:11.000Z
Extract all the urls of https://primates.dev/sitemap.xml
dataframe = parse_sitemap("https://primates.dev/sitemap.xml", ["loc" ])
Resulting in:
Source loc
0 abf73adfd112dfa0235f39ac8ef9e6ec https://primates.dev/the-team/
1 abf73adfd112dfa0235f39ac8ef9e6ec https://primates.dev/become-an-author/
2 abf73adfd112dfa0235f39ac8ef9e6ec https://primates.dev/categories/
3 abf73adfd112dfa0235f39ac8ef9e6ec https://primates.dev/
4 7e3bc65a80810f933f22b0b2db05d8d6 https://primates.dev/brave-says-no-to-error-404/
...
44 bd98aba14cd9e52313c0ae77dc97f892 https://primates.dev/tag/seo/
45 bd98aba14cd9e52313c0ae77dc97f892 https://primates.dev/tag/ads/
I hope that this little piece of code will be useful. Please feel free to comment if you find new sitemaps. I'll be more than delighted to see what kind of projects you do using this little piece of code. Have fun ! I hope it showed you that standard crawlers are not always the answer.
Link to the Gist of the script here