You've successfully subscribed to Primates
Great! Next, complete checkout for full access to Primates
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.
Success! Your billing info is updated.
Billing info update failed.

Hack the heck out of RSA

Imagine a world with no internet. Well, that article won't explain how to achieve that but will show you a few simple things you can do to understand how fragile the web is.

Stanislas Girard
Stanislas Girard

Imagine a world with no internet. Well, that article won't explain how to achieve that but will show you a few simple things you can do to understand how fragile the web is.

TL;DR:

  • Get millions of domain names
  • Extract the certificates
  • BatchGCD the heck out of the public keys

How to get millions of domain names?

The first step is to gather millions of SSL certificates to achieve that we need to find a way to collect unique domain names. Most of the websites allowing you to download this informations are asking for a financial contribution.

However, you can download 10 million domains on this link for free, but that won't be enough.

We are going to take a look at CommonCrawl, which is a repository of billions of web pages that have been crawled, representing petabytes of data. You could download the entire catalog and then parse it, but that would require thousands of euros of investment in a new computer.

Fortunately for us, Common Crawl has an S3 Bucket that can be crawled easily using Amazon Athena

AWS Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Athena is easy to use. Point to your data in Amazon S3, define the schema and start querying using standard SQL. Most results are delivered within seconds. With Athena, there's no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to analyze large-scale datasets quickly.

  • Open the Athena query editor.
  • Select us-east-1 as your location
  • Run the Query CREATE DATABASE ccindex

This will create a database

  • Create a new table with the following Query
CREATE EXTERNAL TABLE IF NOT EXISTS ccindex (
url_surtkey                   STRING,
url                           STRING,
url_host_name                 STRING,
url_host_tld                  STRING,
url_host_2nd_last_part        STRING,
url_host_3rd_last_part        STRING,
url_host_4th_last_part        STRING,
url_host_5th_last_part        STRING,
url_host_registry_suffix      STRING,
url_host_registered_domain    STRING,
url_host_private_suffix       STRING,
url_host_private_domain       STRING,
url_protocol                  STRING,
url_port                      INT,
url_path                      STRING,
url_query                     STRING,
fetch_time                    TIMESTAMP,
fetch_status                  SMALLINT,
content_digest                STRING,
content_mime_type             STRING,
content_mime_detected         STRING,
content_charset               STRING,
content_languages             STRING,
warc_filename                 STRING,
warc_record_offset            INT,
warc_record_length            INT,
warc_segment                  STRING)
PARTITIONED BY (
crawl                         STRING,
subset                        STRING)
STORED AS parquet
LOCATION 's3://commoncrawl/cc-index/table/cc-main/warc/';

This will map the table ccindex that you created with the location of the S3 bucket where the data is stored

  • Run

MSCK REPAIR TABLE ccindex

Note that you need to rerun this command every month as the CommonCrawl foundation adds new data.

  • Run
SELECT DISTINCT(url_host_name)
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
AND subset = 'warc'
AND url_host_tld = 'no'

This will return all the hostname from Norway that were crawled in May 2018

SELECT COUNT(*) AS count,
url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
AND subset = 'warc'
AND url_host_tld = 'no'
GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC

This Query will return all the url_host_registered_domain count from Norway

Mother of all queries -> How to get 27M domains in 30seconds:

SELECT DISTINCT(url_host_name)
FROM "ccindex"."ccindex"
WHERE subset = 'warc'

With this request, we managed to get 27M unique domains in less than a minute.

Now let's move to part 2

Get the certificates

Github Repository: https://github.com/StanGirard/GoSSL

Javascript Hell

Requesting millions of certificates isn't an easy task. As a javascript developer, I never encountered such a job. It needs to be fast and efficient. I got to experience my first memory leak in Nodejs, out of memory. I also experienced a low request per second (40 per second).

You can check index.js in the trash code folder if you want to see the implementation on JS.

Let's Go Python

I decided to move to python3 to avoid many of the headaches that I got with Nodejs. The results were not as expected (60-70) per second. No memory issues this time!

You can check test.py if you want to see the implementation in Python.

Go for the win

Python was not good enough; Goland seemed like a good fit. Never used before but seemed like the right choice.

The program can be found in src/main.go

cat domainnames | go run main.go

You need to have one domain per line. But that should not be an issue because we managed to gather one certificate per line with AWS Athena

Batch GCD

Now it is time to hack! We have millions of public keys. It is time to find GCD with those keys.

I created a python notebook MultiplyCerts to see the basic implementation and complexity of a Batch GCD implementation. I found out that the complexity of the basic implementation of Batch GCD is X^2. You can find the results in MultiplyCerts.html

Batch GCD Implementation

It can then be used (for example) by a project like batch-GCD, which implements the algorithm from here developed by DJ Bernstein to find weak primes.

(C++)+GMP implementation of the Batch GCD algorithm by Daniel Bernstein. This algorithm, described in How To Find Smooth Parts Of Integers, allows pairwise computing GCDs of a list of integers in quasilinear time and memory. See e.g., factorable.net, which also provides source code.

Conclusion

I will not share the results publicly, but you should be able to reproduce and find some impressive results with the BatchGCD algorithm.

It is easier to find GCD from millions of public keys than to try to find the factors of a public key.

News about protecting the web

RSAHack

Stanislas Girard

Creator of Primates & Author. Devops, Blockchain, Data, Security, Machine Learning, Web ... The World is full of interesting subjects.