Crawler

Get All URLs From A Website Using Python Script

You can easily extract all the links on a web page using python script. Have you ever wanted to extract all the URLs of a website quickly? We'll tell you how! It is hundreds of times faster than crawling all the pages of a website to find all of its URLs.

Have you ever wanted to extract all the URLs of a website quickly? We'll tell you how! Usually crawlers visit every pages of a website and indexes them. This method is extremely slow and often results in crawlers taking hours to crawl a website.

Most websites have a file called Sitemap that lists all the URLs. If we manage to find this file we can find all the URLs of a website in seconds. We'll look at a simple way to extract all the URLs of a website based on its Sitemap. It is hundreds of times faster than crawling all the pages of a website to find all of its URLs.

Sitemap

The Sitemap allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. Most websites, such as Primates.dev have their Sitemap located at /sitemap.xml.

Let's look at our Sitemap. Go to https://primates.dev/sitemap.xml.
You should end up on a page that looks like the image below.

You can see that we have four sitemaps in our sitemap.xml. Usually, websites separate their content in different sitemaps.

sitemap-pages.xml references all the pages of our website that are not posts.
sitemap-posts.xml references all the articles on our website. Useful if you want to parse the article's content of a website.
sitemap-authors.xml references all the authors of our website. It is a specific sitemap to our website because we have multiple authors.
sitemap-tags.xml references all the tags of our website.

Sitemaps are present on almost every website on the web. It is a file that allows search engines to easily find new pages on a website without having to crawl it. What if you can't find your Sitemap? You can look at your robots.txt file.
Go to <yourwebsite>/robots.txt.

As an example, we'll take a look at https://primates.dev/robots.txt.

Sitemap: References the sitemaps of a website. You don't always have the sitemaps in the robots.txt, but most websites have their Sitemap in this file.
Disallow: Pages we don't want to be crawled or references in search engines. For example,/ghost/is the address we use to go to our admin.
User-agent: Defines the rules that have to be followed by the specified user-agent. Here we determined that the following rules apply to everybody.

No luck with the robots.txtfile ? Well then you need some help.

Method 1:

Here is a list of the most common sitemaps we have found. This is the most popular sitemaps URI path we have seen. Please note that it can change based on the technologies popular at the moment used for creating websites. However, most websites have the same URI for their sitemaps, and therefore it is usually easy to find it. Whether in the robots.txt or by just trying.

/sitemap.xml
/feeds/posts/default?orderby=updated
/sitemap.xml.gz
/sitemap_index.xml
/s2/sitemaps/profiles-sitemap.xml
/sitemap.php
/sitemap_index.xml.gz
/vb/sitemap_index.xml.gz
/sitemapindex.xml
/sitemap.gz
/sitemap_news.xml
/sitemap-index.xml
/sitemapindex.xml
/sitemap-news.xml
/post-sitemap.xml
/page-sitemap.xml
/portfolio-sitemap.xml
/home_slider-sitemap.xml
/category-sitemap.xml
/author-sitemap.xml

With this extensive list of common sitemaps URI path you should be able to find what you are looking for. If the sitemap is not in the previous list then method 2 should give you the answer.

Method 2:

Let's use Google to find the remaining Sitemaps. This method usually only works for big websites such as news websites.
Go to Google.com and type site:<url_website> filetype:xml
For example: site:theguardian.com filetype:xml

If you have any luck you should find a sitemap.

As you can see, we found a sitemap that isn't on our list. It is a simple and effective way to find sitemaps for huge websites. Depending on the website, you can or cannot find sitemaps with this method.

How to read sitemaps

Let's take a look at the source code of a sitemap. Go to https://primates.dev/sitemap.xml and look at the source code.


    <?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="//primates.dev/sitemap.xsl"?>
    <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap><loc>https://primates.dev/sitemap-pages.xml</loc><lastmod>2020-02-29T13:17:58.206Z</lastmod></sitemap>
    <sitemap><loc>https://primates.dev/sitemap-posts.xml</loc><lastmod>2020-03-02T11:28:46.333Z</lastmod></sitemap>
    <sitemap><loc>https://primates.dev/sitemap-authors.xml</loc><lastmod>2020-03-02T17:24:31.387Z</lastmod></sitemap>
    <sitemap><loc>https://primates.dev/sitemap-tags.xml</loc><lastmod>2020-03-01T00:39:53.448Z</lastmod></sitemap>
    </sitemapindex>

Sitemap Index

This is what we call a sitemap index. It is a list of other sitemaps. As mentioned above, it references all the sitemaps of our website.

loc: URL of the sitemap
lastmod: last time the Sitemap was modified.

This sitemap index file is essential for crawlers. It is the entry point for Google's crawlers. With this file, it can know where all the other sitemaps are and if they changed since the last crawl.

Let's take a look at https://primates.dev/sitemap-posts.xml. Same thing, look at the source code.

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="//primates.dev/sitemap.xsl"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
    <url>
        <loc>https://primates.dev/brave-says-no-to-error-404/</loc>
        <lastmod>2020-03-02T11:28:46.280Z</lastmod>
        <image:image>
            <image:loc>https://primates.dev/content/images/2020/02/404-brave-not-found-primates-dev.jpg</image:loc>
            <image:caption>404-brave-not-found-primates-dev.jpg</image:caption>
        </image:image>
    </url>
    <url>
        <loc>https://primates.dev/create-games-directly-in-your-browser-babylonjs/</loc>
        <lastmod>2020-02-29T23:49:11.000Z</lastmod>
        <image:image>
            <image:loc>https://primates.dev/content/images/2020/02/gaming-babylonjs-primates-dev.jpg</image:loc>
            <image:caption>gaming-babylonjs-primates-dev.jpg</image:caption>
        </image:image>
    </url>
    </urlset>

This is only an extract of the sitemap. The information found on this sitemap is the same for all the URLs.

urlset: Indicates that it is a list of URLs
url: Indicates a URL
loc: URL of the page
lastmod: Last time the page was changed
image:image: Indicates that it is an image
image:loc: URL of the featured image of the article
image:caption: Caption of the image

Isn't it better than common crawlers ? We have a way to find all the urls of a website. We just have to find a way to extract all this information now. Fortunately for us, here is a small piece of code that does the job :D

Code for parsing sitemaps

Here is the code.

Installation

pip install beautifulsoup4
    pip install pandas

Not a lot of libraries needed. It is a fairly simple code.

Crawler & Parser function

import requests
    from bs4 import BeautifulSoup as Soup
    import pandas as pd
    import hashlib
    
    # Pass the headers you want to retrieve from the xml such as ["loc", "lastmod"]
    def parse_sitemap( url,headers):
        resp = requests.get(url)
        # we didn't get a valid response, bail
        if (200 != resp.status_code):
            return False
    
        # BeautifulSoup to parse the document
        soup = Soup(resp.content, "xml")
    
        # find all the <url> tags in the document
        urls = soup.findAll('url')
        sitemaps = soup.findAll('sitemap')
        new_list = ["Source"] + headers
        panda_out_total = pd.DataFrame([], columns=new_list)
    
    
        if not urls and not sitemaps:
            return False
    
        # Recursive call to the the function if sitemap contains sitemaps
        if sitemaps:
            for u in sitemaps:
                sitemap_url = u.find('loc').string
                panda_recursive = parse_sitemap(sitemap_url, headers)
                panda_out_total = pd.concat([panda_out_total, panda_recursive], ignore_index=True)
    
        # storage for later...
        out = []
    
        # Creates a hash of the parent sitemap
        hash_sitemap = hashlib.md5(str(url).encode('utf-8')).hexdigest()
    
        # Extract the keys we want
        for u in urls:
            values = [hash_sitemap]
            for head in headers:
                loc = None
                loc = u.find(head)
                if not loc:
                    loc = "None"
                else:
                    loc = loc.string
                values.append(loc)
            out.append(values)
        
        # Creates a dataframe
        panda_out = pd.DataFrame(out, columns= new_list)
    
        # If recursive then merge recursive dataframe
        if not panda_out_total.empty:
            panda_out = pd.concat([panda_out, panda_out_total], ignore_index=True)
    
        #returns the dataframe
        return panda_out

Explanation

First of all we make a request to the specified url in the function parameters.

resp = requests.get(url)
        # we didn't get a valid response, bail
        if (200 != resp.status_code):
            return False

Then we parse the content of the response using BeautifulSoup4.

# BeautifulSoup to parse the document
        soup = Soup(resp.content, "xml")

Then we look for either a urlset or a sitemapindex

# find all the <url> tags in the document
        urls = soup.findAll('url')
        sitemaps = soup.findAll('sitemap')

If we are in a sitemapindex such as https://primates.dev/sitemap.xml we recursively call the function passing the URL (loc) of the sitemap.

# Recursive call to the the function if sitemap contains sitemaps
        if sitemaps:
            for u in sitemaps:
                sitemap_url = u.find('loc').string
                panda_recursive = parse_sitemap(sitemap_url, headers)
                panda_out_total = pd.concat([panda_out_total, panda_recursive], ignore_index=True)

Then we create a hash of the sitemap urls for better indexing

# Creates a hash of the parent sitemap
        hash_sitemap = hashlib.md5(str(url).encode('utf-8')).hexdigest()

We only have one step to finish the extraction. Parse the information of the sitemap.

 # Extract the keys we want
        for u in urls:
            values = [hash_sitemap]
            for head in headers:
                loc = None
                loc = u.find(head)
                if not loc:
                    loc = "None"
                else:
                    loc = loc.string
                values.append(loc)
            out.append(values)

The functions takes a headers as parameters.
The headers parameter is a list of all the information you want to retrieve from the sitemap.
For example ["loc", "lastmod"]

The program returns a panda dataframe for easier management down the line.

Example

Extract all the post urls of https://primates.dev/sitemap-posts.xml

parse_sitemap("https://primates.dev/sitemap-posts.xml", ["loc", "lastmod" ])

Resulting in:

  Source                                                loc                   lastmod
    0   7e3bc65a80810f933f22b0b2db05d8d6   https://primates.dev/brave-says-no-to-error-404/  2020-03-02T11:28:46.280Z
    1   7e3bc65a80810f933f22b0b2db05d8d6  https://primates.dev/create-games-directly-in-...  2020-02-29T23:49:11.000Z
    2   7e3bc65a80810f933f22b0b2db05d8d6   https://primates.dev/ddos-with-a-crapy-computer/  2020-02-28T12:43:34.438Z
    3   7e3bc65a80810f933f22b0b2db05d8d6  https://primates.dev/parsing-an-api-xml-respon...  2020-02-27T22:11:12.323Z
    ...
    18  7e3bc65a80810f933f22b0b2db05d8d6  https://primates.dev/optimize-your-website-in-...  2020-02-18T14:45:11.000Z

Extract all the urls of https://primates.dev/sitemap.xml

dataframe = parse_sitemap("https://primates.dev/sitemap.xml", ["loc" ])

Resulting in:

                  Source                                                loc
    0   abf73adfd112dfa0235f39ac8ef9e6ec                     https://primates.dev/the-team/
    1   abf73adfd112dfa0235f39ac8ef9e6ec             https://primates.dev/become-an-author/
    2   abf73adfd112dfa0235f39ac8ef9e6ec                   https://primates.dev/categories/
    3   abf73adfd112dfa0235f39ac8ef9e6ec                              https://primates.dev/
    4   7e3bc65a80810f933f22b0b2db05d8d6   https://primates.dev/brave-says-no-to-error-404/
    ...
    44  bd98aba14cd9e52313c0ae77dc97f892                      https://primates.dev/tag/seo/
    45  bd98aba14cd9e52313c0ae77dc97f892                      https://primates.dev/tag/ads/

I hope that this little piece of code will be useful. Please feel free to comment if you find new sitemaps. I'll be more than delighted to see what kind of projects you do using this little piece of code. Have fun ! I hope it showed you that standard crawlers are not always the answer.

Link to the Gist of the script here