Complete Guide to Shopify Data Extraction
Shopify powers over 4.8 million online stores. For price monitoring, market research, lead generation, and competitive analysis, extracting product data, pricing, and store information from Shopify stores is a common requirement. But Shopify sites use JavaScript rendering, anti-bot protections, and dynamic product loading that make simple HTTP scraping unreliable.
This guide covers how to extract data from Shopify stores at scale -- including products, prices, reviews, collections, and store metadata. You'll learn which approaches work, which don't, and how to build a production-ready Shopify extraction pipeline.
Key Takeaways
- Shopify's free JSON formatter product API (
/products.json) works on many stores but is increasingly disabled or rate-limited - JavaScript rendering is required for most modern Shopify themes -- headless browsers or rendering APIs are essential
- SearchHive's ScrapeForge API handles Shopify's JS rendering, CAPTCHA challenges, and proxy rotation
- Rate limiting is critical -- Shopify stores behind Cloudflare will block aggressive scraping
- Store-level metadata is accessible via Shopify's community JSON endpoint for store lookup
Understanding Shopify's Data Structure
Every Shopify store has a predictable URL structure that makes data extraction straightforward once you understand the patterns:
Store homepage: https://store.myshopify.com/
Product listing: https://store.myshopify.com/collections/all
Product page: https://store.myshopify.com/products/slug
Product JSON (legacy): https://store.myshopify.com/products/slug.json
Collection JSON (legacy):https://store.myshopify.com/collections/all.json
Store data JSON: https://store.myshopify.com/products.json?limit=250&page=1
Approach 1: Shopify JSON API (When Available)
Some older Shopify stores expose a public JSON API at /products.json and /collections.json. This is the fastest extraction method when it works.
import requests
import time
def extract_shopify_products_json(store_url, max_pages=10):
"""Extract products via Shopify JSON API (when available)."""
all_products = []
for page in range(1, max_pages + 1):
response = requests.get(
f"{store_url}/products.json",
params={"limit": 250, "page": page}
)
if response.status_code == 404:
print("JSON API not available on this store")
return None
if response.status_code == 429:
print("Rate limited, waiting 60s...")
time.sleep(60)
continue
products = response.json().get("products", [])
if not products:
break
for p in products:
all_products.append({
"title": p.get("title"),
"handle": p.get("handle"),
"vendor": p.get("vendor"),
"product_type": p.get("product_type"),
"tags": p.get("tags", []),
"published_at": p.get("published_at"),
"variants": [
{
"price": v.get("price"),
"compare_at_price": v.get("compare_at_price"),
"sku": v.get("sku"),
"available": v.get("available"),
"grams": v.get("grams")
}
for v in p.get("variants", [])
]
})
print(f"Page {page}: extracted {len(products)} products")
time.sleep(1)
return all_products
# Example usage
products = extract_shopify_products_json("https://example-store.myshopify.com")
if products:
print(f"Total products: {len(products)}")
Limitation: Many stores disable this endpoint or put it behind authentication. If you get a 404, move to Approach 2.
Approach 2: Web Scraping with SearchHive
When the JSON API is unavailable, scrape the rendered HTML. SearchHive's ScrapeForge handles JavaScript rendering, which is required for most modern Shopify themes (Dawn, Sense, etc.).
import requests
import json
import time
API_KEY = "your-searchhive-key"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}
def scrape_shopify_product(store_url, product_handle):
"""Scrape a Shopify product page with JS rendering."""
response = requests.post(
"https://api.searchhive.dev/v1/scrapeforge",
headers=HEADERS,
json={
"url": f"{store_url}/products/{product_handle}",
"format": "markdown",
"render_js": True
}
)
response.raise_for_status()
return response.json()["content"]
def discover_products(store_url):
"""Discover product URLs from a Shopify collection page."""
response = requests.post(
"https://api.searchhive.dev/v1/scrapeforge",
headers=HEADERS,
json={
"url": f"{store_url}/collections/all",
"format": "html",
"render_js": True
}
)
response.raise_for_status()
# Parse product links from the HTML
# (In production, use BeautifulSoup or similar)
import re
html = response.json()["content"]
products = re.findall(r'/products/([a-zA-Z0-9-]+)', html)
return list(set(products))
# Full pipeline
store = "https://example-store.com"
handles = discover_products(store)
print(f"Found {len(handles)} products")
for handle in handles[:5]: # Start with a small batch
content = scrape_shopify_product(store, handle)
print(f"\n--- {handle} ---")
print(content[:500])
time.sleep(2) # Respect rate limits
Approach 3: Discovery + Bulk Extraction
For monitoring multiple Shopify stores, combine SwiftSearch (to find stores) with ScrapeForge (to extract data).
import requests
import time
API_KEY = "your-searchhive-key"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}
# Find Shopify stores in a niche
search = requests.get(
"https://api.searchhive.dev/v1/swiftsearch",
headers=HEADERS,
params={
"query": "site:myshopify.com organic skincare products",
"num_results": 20
}
)
stores = []
for r in search.json()["results"]:
url = r["url"]
if "myshopify.com" in url or "shopify.com" in url:
stores.append(url)
print(f"Found {len(stores)} Shopify stores")
# Extract data from each store
for store_url in stores[:3]:
print(f"\nExtracting from {store_url}...")
try:
scrape = requests.post(
"https://api.searchhive.dev/v1/scrapeforge",
headers=HEADERS,
json={
"url": store_url,
"format": "markdown",
"render_js": True
}
)
if scrape.status_code == 200:
print(scrape.json()["content"][:300])
else:
print(f"Failed: {scrape.status_code}")
except Exception as e:
print(f"Error: {e}")
time.sleep(3)
Handling Anti-Bot Protections
Many Shopify stores sit behind Cloudflare or Shopify's own bot protection. Key strategies:
- Rotate user agents: Use realistic browser user agent parser strings
- Add delays: 2-5 seconds between requests minimum
- Use residential proxies: Datacenter IPs get flagged faster
- Respect robots.txt generator: Check the store's robots.txt before scraping
- Start small: Begin with 5-10 pages, monitor response codes, scale gradually
SearchHive handles proxy rotation and anti-bot detection at the API level, so you don't need to manage proxies yourself.
Data Schema for Shopify Products
from pydantic import BaseModel
from typing import Optional, List
class ShopifyVariant(BaseModel):
sku: Optional[str] = None
price: float
compare_at_price: Optional[float] = None
available: bool = True
weight_grams: Optional[int] = None
class ShopifyProduct(BaseModel):
title: str
handle: str
vendor: Optional[str] = None
product_type: Optional[str] = None
tags: List[str] = []
price: float # Lowest variant price
url: str
image_url: Optional[str] = None
variants: List[ShopifyVariant] = []
def normalize_product(raw_data, store_url):
"""Normalize scraped data into a consistent schema."""
variants = raw_data.get("variants", [])
prices = [v["price"] for v in variants if v.get("price")]
return ShopifyProduct(
title=raw_data["title"],
handle=raw_data["handle"],
vendor=raw_data.get("vendor"),
product_type=raw_data.get("type"),
tags=raw_data.get("tags", []),
price=min(prices) if prices else 0.0,
url=f"{store_url}/products/{raw_data['handle']}",
image_url=raw_data.get("image", {}).get("src"),
variants=variants
)
Common Use Cases
| Use Case | Data Points | Frequency | Recommended Tool |
|---|---|---|---|
| Price monitoring | Product prices, sale prices, stock status | Daily | SearchHive ScrapeForge |
| Market research | Product catalogs, categories, vendors | Weekly | SearchHive SwiftSearch + ScrapeForge |
| Lead generation | Store owners, contact info | One-time | BuiltWith + SearchHive |
| Competitive analysis | Pricing, features, reviews, positioning | Weekly | SearchHive + manual analysis |
| Dropshipping research | Products, prices, images, descriptions | Daily | SearchHive ScrapeForge |
Legal Considerations
- Shopify store data is generally publicly accessible, but scraping at high volume can violate a store's terms of service
- Personal data (customer reviews with names, emails) may be subject to GDPR/CCPA
- Use extracted data for competitive research and market analysis, not for reproducing copyrighted content
- Always respect
robots.txtdirectives
Getting Started
Extract your first Shopify product in under 5 minutes with SearchHive's free tier. You get 500 credits with full access to ScrapeForge (with JS rendering) and SwiftSearch. No credit card required.
For ongoing price monitoring across 10+ stores, the Builder plan ($49/mo, 100K credits) handles daily scraping cycles comfortably. See the documentation for complete API references and code examples.
Related: /blog/complete-guide-to-automated-data-extraction for general data extraction techniques. Related: /compare/searchhive-vs-firecrawl for comparing scraping APIs for Shopify extraction. Related: /blog/best-scraping-behind-login-tools-2025 for extracting data from login-protected areas.