What is Datadome and how to bypass it when web scraping?

Datadome is a leading anti-bot protection service widely adopted across various websites to safeguard against automated access.

While automated web access is often harmful, systems like Datadome, which serve as Web Application Firewalls (WAF), sometimes also prevent benign web scraping bots from accessing data.

In this brief introduction, we'll explore how Datadome's anti-bot technology functions and discuss legitimate methods to circumvent it for web scraping purposes.

What is Datadome WAF?

Datadome provides a Web Application Firewall (WAF) service that acts as a protective middleware between a website and its users.

This crucial middleware layer filters incoming traffic to distinguish bot-generated connections from human users.

Although many web scrapers are generally benign, they are still automated bots and, therefore, are typically blocked by Datadome’s sophisticated filters.

example datadome block page encountered in web scraping
example Datadome block page

A key distinction between harmful bots and benign scrapers is their interaction with website elements.

Benign web scrapers simply harvest publicly available data without interacting with core website functionalities, making it feasible to design scrapers that can evade Datadome's anti-bot measures. To achieve this, an understanding of Datadome's detection mechanisms is essential.

How to bypass Datadome?

The Datadome bypass when web scraping involves many different techniques and resources that are already implemented by most web scraping APIs.

Here are the best web scraping APIs for Datadome anti-bot protected targets:

Service Success % Speed Cost $/1000 πŸ”—
1
100%
+2
2.8s
-0.5
$4.9
=
2
100%
=
9.3s
-0.1
$4.19
-0.03
3
100%
+1
27.8s
-4.2
$1.9
=
4
98%
-2
8.2s
-1.0
$6.9
=
5
98%
+2
22.8s
-3.3
$2.71
=
6
89%
+1
2.6s
+0.1
$3.27
=
7
11%
+1
11.0s
+4.5
$2.2
=
Data range Dec 13 - Dec 20

Etsy.com product and other public pages

more on etsy.com

However, to bypass Datadome WAF manually the web scraper needs to be fortified with several key features that resist Datadome identification methods. To implement that we need to take a deep look at how Datadome works.

Join the Scrapeway newsletter!

Early benchmark reports and industry insights every week!

How does the Datadome anti-bot works?

Datadome inspects each incoming connection and generates a trust score which is used as a filter to determine the likelihood of the connection coming from a real human being.

To generate this score Datadome is using various fingerprinting and data point metrics. Let's take a quick look at some.

IP Address

The first metric is the IP address of the connecting client and each client has one.

There's a limited amount of IPs on the internet and they each have distinct features that let Datadome to assign probabilities to them.

πŸ—ΊοΈπŸ—ΊοΈπŸ—ΊοΈ

For example, a user connecting from a home internet connection is significantly less likely to be a robot than a user connecting from a data center.

With this IP's are separated into several categories:

  • Datacenter - assigned to all data centers like AWS, Google Cloud hosts and so on.
  • Residential - assigned to all home connections.
  • Mobile - assigned to mobile cell towers, satellites and so on.

To combat IP Address analysis, scrapers should high-quality use residential or mobile proxies that haven't been identified by Datadome yet.

Javascript fingerprinting and challenges

The second metric is the client's ability to execute Javascript. As most bots don't execute javascript an easy way to identify them is to serve a javascript challenge.

These challenges are often simple mathematical puzzles that use tokens distributed at other parts of the website. Reverse engineering this behavior can be tricky without using a real web browser.

To solve javascript challenges scrapers need to use real web browsers through headless browser automation. Most commonly through libraries like Puppeteer, Playwright or Selenium.

πŸ‘£πŸ‘£πŸ‘£

Headless browsers can be identified through javascript fingerprinting techniques as they are just slightly different from user browsers in many different ways. Tiny details add up to a full evaluation profile which can be used to identify robots very successfully.

To combat fingerprinting, headless browsers need to be patched with fingerprint resistance and randomization.

HTTP fingerprinting and analysis

Real-user web browsers browse and connect in a few different ways that can be used to identify robot connections.

Most robots still use HTTP1.1 connections which are not sure by real browsers at all in 2024. Meaning, scrapers should use newer versions of the HTTP2 or HTTP3 protocol.

πŸ”ŽπŸ”ŽπŸ”Ž

While advanced HTTP clients like libcurl support HTTP2 pretty well they are susceptible to HTTP2 fingerprinting. This type of fingerprinting measures slight differences between the HTTP2 implementation of the client and the real browser.

Another fingerprinting technique related to HTTP is TLS fingerprinting. In this type of fingerprint, the TLS handshake is analyzed for slight differences between the client and the real browser.

To combat HTTP and TLS fingerprinting, scrapers need to use advanced HTTP2 capable clients that are TLS and HTTP fingerprint resistant.

Behavior and technical analysis

While scrapers usually just collect data without interacting with the website, they can still be identified through their behavior.

Most commonly this is done through scraper implementation mistakes that just don't happen with real users. Some examples:

  • Forgetting custom request headers.
  • Sending requests in different formats or encodings.
  • Missing expected cookies or browser-specific headers like User-Agent.
  • Visiting pages that aren't visited by real users which are known as honey pots.

These slight irregularities are tracked by Datadome in its trust score calculation so, scrapers need to be diligent and replicate requests as close to real user behavior as possible.

πŸͺ€πŸͺ€πŸͺ€

Datadome logs every connection and use behavior and pattern analysis to identify robots. This means that scrapers that behave in an unusual pattern, connect in bursts or have any connection irregularities can be identified through AI-based analysis.

To combat behavior analysis, scrapers need to implement a human-like behavior pattern that is indistinguishable from a real user.

What are some websites that use Datadome?

Datadome is a widely adopted anti-bot and there are many popular websites encountered in web scraping that use it. Here's a quick list:

  • Etsy.com
  • Hermes.com
  • Leboncoin.com
  • Marketwatch.com
  • Reuters.com
  • Tripadvisor.com
  • WSJ.com
  • Wellfound.com

And many others, though, as most big targets often rotate and use multiple anti-bot technologies based on needs and performances.

Many other targets implement Datadome anti-bot protection temporarily when experiencing high traffic or bot attacks which affects web scraping as well.

Summary

Datadome employs advanced fingerprinting and analysis methods to distinguish bots and scrapers, assigning a dynamic trust score. Each web scraping technique can influence this score positively or negatively.

However, as Datadome continually upgrades its anti-bot measures, the challenge for web scrapers is to adapt swiftly to these evolving defenses.

Given these challenges, we strongly recommend employing a continuously updated and managed web scraping API service. Refer to the benchmarks for the latest comparative results!

Newsletter

Join us for the best in web scraping and data hacking news and insights once per week! Early benchmark results, industry insights and highlights from Scrapeway :)