Zillow is one of the biggest real estate listing websites in the United States which contains vast amount of real estate current and historical data. This makes it the most popular real estate target for web scraping.
Zillow.com is using its own proprietary web scraping protection technology in combination with PerimeterX anti-bot service. This makes it difficult to scrape Zillow property data reliably and this is where web scraping APIs come in handy.
Overall, most web scraping APIs we've tested through our benchmarks perform well for Zillow.com at $1.98 per 1,000 scrape requests on average.
Zillow.com scraping API benchmarks
Scrapeway runs weekly benchmarks for Zillow Listings for the most popular web scraping APIs. Here's the table for this week:
Service | Success % | Speed | Cost $/1000 | |
---|---|---|---|---|
1
|
100%
=
|
4.5s
-1.7
|
$6.9
=
|
|
2
|
100%
+1
|
4.7s
+1.4
|
$0.49
=
|
|
3
|
97%
-3
|
3.4s
+1.8
|
$3.76
+0.01
|
|
4
|
79%
-14
|
21.8s
+1.4
|
$2.71
=
|
|
5
|
0%
|
-
|
-
|
|
6
|
0%
-34
|
-
-1.8
|
-
-2.2
|
|
7
|
0%
|
-
|
-
|
How to scrape zillow.com?
Zillow is one of the easiest targets to scrape as it's a highly efficient javascript application that stores all of its data in JSON format which means headless browser use is not required.
That being said, Zillow.com has a lot of anti-scraping technologies in place, so it's recommended to use a reliable web scraping service that can bypass the constantly changing anti-scraping measures. See benchmarks for the most up-to-date results.
Zillow's HTML datasets contain their data in JSON variables under NextJS framework variables like
__NEXT_DATA__
and can be easily extracted for full listing datasets making it
an easy scraping target overall.
import json
from parsel import Selector
# webscrapingapi has a Python SDK but it's not great, use httpx instead:
# `pip install httpx`
import httpx
# create an API client instance
client = httpx.Client(timeout=180)
# create scrape function that returns HTML parser for a given URL
def scrape(url: str, country: str="", render_js=False, headers: dict=None) -> Selector:
api_result = client.get(
url,
headers=headers,
params={
"url": url,
"api_key": "YOUR API KEY", # NOTE: add your API KEY here!
"timeout": 60_000,
"render_js": "1",
},
)
assert api_result.status_code == 200, api_result.reason_phrase
return Selector(api_result.text)
url = "https://www.zillow.com/homedetails/1414-1416-20th-Ave-San-Francisco-CA-94122/332857311_zpid/"
selector = scrape(url)
# The entire dataset can be found in a javascript variable:
data = selector.css("script#__NEXT_DATA__::text").get()
data = json.loads(data)["props"]["pageProps"]["componentProps"]["gdpClientCache"]
property_data = list(json.loads(data).values())[0]['property']
# the resulting dataset is pretty big but here are some example fields:
from pprint import pprint
pprint(person_data)
{
"listingDataSource": "Phoenix",
"zpid": 332857311,
"city": "San Francisco",
"state": "CA",
"homeStatus": "FOR_SALE",
"address": {
"streetAddress": "1414-1416 20th Ave",
"city": "San Francisco",
"state": "CA",
"zipcode": "94122",
"neighborhood": null,
"community": null,
"subdivision": null
},
"bedrooms": 7,
"bathrooms": 3,
"price": 1695000,
"yearBuilt": 1924,
"streetAddress": "1414-1416 20th Ave",
"zipcode": "94122",
# ...
# and much more
# ...
}
For scraping Zillow.com we're retrieving the HTML and extract the property dataset from a hidden JSON variable. As Zillow.com is using next.js this variable is available in the NEXT_DATA script.
Why scrape Zillow Listings?
Zillow is a popular web scraping as it has a large amount of real estate data from listing information to market trends and metadata.
With lead scraping Zillow can be used to generate leads for real estate agents, estate owners and investors.
As real estate is one of the biggest markets in the world Zillow is an invaluable Market research tool. It can be used to analyze market trends to minute details like specific neighborhoods and property types.
Zillow.com is also often scraped by real estate agents and investors to monitor competition. and adjust their product and pricing strategies.