Web Scraping Instagram.com Overview

2024-04-08

Instagram is one of the biggest social networks focusing on media sharing and public announcements making it a popular target for web scraping.

Instagram.com is using proprietary web scraping protection tech that is updated constantly. This makes it difficult to scrape Instagram at scale reliably and that's where web scraping APIs can really come in handy.

Overall, most web scraping APIs we've tested through our benchmarks perform well for scraping Instagram.com at $2.38 per 1,000 scrape requests on average.

Instagram.com scraping API benchmarks

Scrapeway runs weekly benchmarks for Instagram Pages for the most popular web scraping APIs. Here's the table for this week:

Service Success % Speed Cost $/1000
1
100%
+5
3.4s
-0.1
$2.2
=
2
87%
-13
5.4s
+1.4
$2.71
=
3
86%
-14
4.7s
-0.2
$3.75
=
4
85%
-14
40.4s
+32.7
$4.75
=
5
83%
-7
4.7s
-0.2
$3.27
=
6
0%
-
-
7
0%
-
-
Data range Sep 27 - Oct 04

How to scrape instagram.com?

Instagram.com can be surprisingly complex to scrape as it's a giant web app with graphql backend. For people unfamiliar with reverse engineering using browser network inspect it's probably best to pay extra and use headless browser to fully render pages.

In this case, see web scraping API services that support full browser automation that can click on specific posts and scroll through comments. Here's a list of services that support full browser control:

That being said, Instagram can be scraped without the use of headless browsers by using it's graphql backend like this python example for user profile scraping:

Instagram.com scraper
import json
from parsel import Selector
# install using `pip install scrapingbee`
from scrapingbee import ScrapingBeeClient

# create an API client instance
client = ScrapingBeeClient(api_key="YOUR API KEY")

# create scrape function that returns HTML parser for a given URL
def scrape(url: str, country: str="", render_js=False, headers: dict=None) -> Selector:
    api_result = client.get(
        url, 
        headers=headers,
        params={
            "json_response": True,
            "transparent_status_code": True,
            
            "premium_proxy": "True",
            "render_js": "False",
            }
    )
    assert api_result.ok, api_result.text
    data = api_result.json()
    return Selector(data['body'])


# this example show how instagram can be scraped through their backend API
username = "google"
selector = scrape(
    url=f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}",
    headers={"x-ig-app-id": "936619743392459"},  # this is needed to access IG backend API
)
# this returns a giant JSON dataset with all Instagram profile details
dataset = selector.get()['data']['user']

# some examples of what can be found in the dataset:
from pprint import pprint
pprint(dataset)
{
    "biography": "Google unfiltered\u2014sometimes with filters.",
    "external_url": "https://linkin.bio/google",
    "external_url_linkshimmed": "https://l.instagram.com/?u=https%3A%2F%2Flinkin.bio%2Fgoogle&e=AT1fNmnZFR3WyD72UwTTj1nHk-vS6oGaZH57Gwqfq6nj35T_1H3nVcQIphay4l-3qTHSrrBm0QqrKi7TWRjhQEVXLB0VTNYeLNuD_zP-FVr9BOxP",
    "edge_followed_by": {
        "count": 14995780
    },
    "fbid": "17841401778116675",
    "edge_follow": {
        "count": 34
    },
    "full_name": "Google",
    "business_address_json": "{\"city_name\": \"Mountain View, California\", \"city_id\": 108212625870265, \"latitude\": 37.4221, \"longitude\": -122.08432, \"street_address\": \"1600 Amphitheatre Pkwy\", \"zip_code\": \"94043\"}",
    "category_enum": "INTERNET_COMPANY",
    "category_name": "Internetunternehmen",
    "is_verified": true,
    "is_verified_by_mv4b": false,
    "profile_pic_url": "https://instagram.fdtm2-2.fna.fbcdn.net/v/t51.2885-19/425391724_2454717941393726_7200817596193793590_n.jpg?stp=dst-jpg_s150x150&_nc_ht=instagram.fdtm2-2.fna.fbcdn.net&_nc_cat=1&_nc_ohc=p9yxdQR7qOMAb6AQvYj&edm=AOQ1c0wBAAAA&ccb=7-5&oh=00_AfBkva0psR8QgUzFcA0TqYyQNcqt5qBLrxASA62nS2umiw&oe=661EA69C&_nc_sid=8b3546",
    "username": "google",
    ... recent posts and much more
}

For scraping Instagram.com above we're calling it's backend GraphQl endpoint directly. This provides the entire instagram post, comment and metadata dataset. Note we're using x-ig-app-id header to indicate client app version to access this endpoint.

Join the Scrapeway newsletter!

Early benchmark reports and industry insights every week!

Why scrape Instagram Pages?

Web scraping Instagram.com is a popular use case for following social signals and announcements but there are many less obvious uses like tracking e-commerce movements and AI training.

With Instagram signal monitoring scraping we can keep track certain channels for important announcements and other signals. This data is important in market trend estimation or even stock movement predictions as Instram is often one of the first announcement sources.

Increasingly, Instagram data scraping can also be used in Market research. As IG is becoming a major advertisement and e-commerce hub we can track various e-commerce data points like advertisements, product sentiment analysis and pricing.

Finally, Instagram is a vast ocean of user-generated content from brands, to artist creations and real life photographs making it a popular target for AI trainng