X.com (formerly Twitter) is one of the biggest social networks out there and a popular web scraping target for tracking social signals and announcements.
X.com is using proprietary web scraping protection mechanisms that are constantly evolving. This makes it difficult to scrape Twitter data reliably and this is where web scraping APIs come in handy.
Overall, most web scraping APIs we've tested through our benchmarks perform well for X.com at $7.81 per 1,000 scrape requests on average.
Twitter.com scraping API benchmarks
Scrapeway runs weekly benchmarks for X.com Tweets for the most popular web scraping APIs. Here's the table for this week:
Service | Success % | Speed | Cost $/1000 | |
---|---|---|---|---|
1
|
95%
+28
|
13.1s
-2.3
|
$0.9
=
|
|
2
|
89%
+10
|
24.3s
+0.9
|
$6.9
=
|
|
3
|
86%
+13
|
6.0s
+0.5
|
$8.17
=
|
|
4
|
81%
+12
|
29.2s
+2.3
|
$2.71
=
|
|
5
|
1%
-10
|
16.0s
-13.7
|
$12.25
=
|
|
6
|
0%
|
-
|
-
|
|
7
|
0%
|
17.1s
+2.5
|
$23.75
=
|
How to scrape twitter.com?
X.com is relatively difficult to scrape as it's a heavy javascript application so headless browser use is required. All web scraping APIs we've tested support headless browsers however only these provide full browser control:
To add, Twitter has a lot of anti-scraping mechanisms in place, so it's recommended to use a reliable web scraping service that can bypass the constantly changing anti-scraping measures. See benchmarks for the most up-to-date results.
As for parsing scraped X.com data using traditional HTML parsing tools like XPath or CSS selectors is relatively easy. Twitter uses `data-test` markup extensively through out their application meaning it's very easy to parse the HTML for the data you need.
import json
from parsel import Selector
# install using `pip install scrapfly-sdk`
from scrapfly import ScrapflyClient, ScrapeConfig, ScrapeApiResponse
# create an API client instance
client = ScrapflyClient(key="YOUR API KEY")
# create scrape function that returns HTML parser for a given URL
def scrape(url: str, country: str="", render_js=False, headers: dict=None) -> Selector:
api_result = client.scrape(ScrapeConfig(
url=url,
asp=True,
country='US',
render_js=True,
wait_for_selector="[data-testid='tweet']",
))
return api_result.selector
url = "https://twitter.com/XCreators/status/1770093017506189440"
selector = scrape(url, render_js=True, country="US")
# Twitter can be parsed using css selectors and data-testid attributes
views, reposts, quotes, likes, bookmarks, *_ = selector.css('[data-testid=app-text-transition-container] span::text').getall()
data = {
"tweet": selector.css("[data-testid=tweetText] ::text").get(),
"views": views,
"reposts": reposts,
"quotes": quotes,
"likes": likes,
"bookmarks": bookmarks,
}
from pprint import pprint
pprint(data)
{'bookmarks': '44',
'likes': '725',
'quotes': '24',
'reposts': '127',
'tweet': 'X is the platform for content creators to freely express their '
'artistic and diverse perspectives without the constraints of '
'censorship. Since the introduction of our ad revenue share program, '
'X has paid out an impressive sum of more than $45 million to more '
'than 150,000 creators.',
'views': '530.8K'}
For scraping X.com above we're using headless browser
to render the twitter web app.
Then, to find elements within the HTML the data-testid
attributes come in handy as that
is what twitter is using for their own headless test browsers.
Why scrape X.com Tweets?
X.com is a popular target for web scraping because it has a large amount of social signal data that can be used for various purposes like signal analysis and sentiment analysis.
With announcement monitoring scraping we can track certain Twitter channels for new announcements or changes in sentiment that can be used for trading or marketing purposes.
Market research can also be done by scraping Twitter data to identify trends and sentiment around certain products or services.
Another popular use case for X.com scraping is competition tracking by scraping competitor post performance and follower gains.
Finally, Twitter contains a lot of text data which can be used in AI training