How to Scrape a Shopify Store's Full Product Catalog
If you sell anything that competes with a Shopify-powered brand — a dropshipping store, a competitor DTC label, an agency client's space — the most valuable competitive intelligence asset is the competitor's full product catalog: what they sell, at what prices, in what variant SKUs, with what stock signals. This data is fully public on every Shopify store. There is no admin API access required, no scraping trick, no headless-login dance. The product detail is embedded as JSON-LD on every /products/<handle> page, and the collection listing on /collections/<handle> (or /collections/all) is cursor-paginated and walkable end-to-end. This guide covers the full catalog walk: identifying the collection URL, paging it, expanding each product, and feeding the result into a diffable snapshot for ongoing competitive tracking.
Why Shopify Catalogs Are the Cleanest Public Catalog Data on the Web
Most retail websites bury their product data behind a mix of inconsistent HTML, JavaScript-rendered prices, and shifting CSS classes. Scraping them requires per-site selectors that break every few months. Shopify is the opposite: because Shopify ships a default product template that includes a schema.org Product JSON-LD block, every store on the platform — independent indie brand, billion-dollar DTC label, dropshipper, agency-built client store — exposes the same structured fields on the same canonical URL pattern. That single fact is what makes catalog-scale Shopify scraping tractable.
The numbers are large enough to be material. Shopify reportedly powers more than four million active stores worldwide, and the proportion of mid-market DTC brands sitting on the platform is overwhelming. If a brand has been founded in the last decade and sells direct-to-consumer, the base-rate guess that they are on Shopify is right more often than wrong. Walk a competitor's catalog once and you have a ground-truth answer to merchandising questions that would otherwise require either insider access or hand-counting.
The other reason catalog-scraping is the right approach (vs paying for a third-party intelligence tool) is granularity. A subscription-based competitive-intel platform will tell you a brand has 280 SKUs and an average selling price of $74. The raw scrape tells you exactly which 280 SKUs, what each SKU costs, which sizes/colors are stocked, which variants are on sale at what discount depth, and which products were added or removed since last week. The detail is where the actionable merchandising decisions live, and the detail is what a generic dashboard tool flattens out.
What a Catalog Scrape Actually Returns
A walk of a Shopify storefront yields two layers. The collection-listing layer gives you the inventory of products in the store; the product-detail layer gives you everything else.
The collection page returns, per product:
| Field | Example |
|---|---|
title | Cozy Knit Sweater |
handle | cozy-knit-sweater |
url | /products/cozy-knit-sweater |
image | https://cdn.shopify.com/.../cozy-sweater.jpg |
price_min | 89.00 |
price_max | 89.00 |
currency | USD |
vendor | Brand Name |
The product page returns, per SKU:
| Field | Example |
|---|---|
title | Cozy Knit Sweater |
description | Soft merino-wool knit, ribbed cuffs... |
brand | Brand Name |
product_type | Sweaters |
tags | ["knit", "fall-2026", "best-seller"] |
images | [url, url, url] |
variants[].title | Small / Cream |
variants[].sku | CKS-S-CRM |
variants[].price | 89.00 |
variants[].compare_at_price | 110.00 |
variants[].availability | InStock |
variants[].currency | USD |
canonical_url | https://example-store.com/products/cozy-knit-sweater |
What it does not include: inventory counts, supplier identity, cost of goods, or any private order/customer data. None of that ever appears on a public Shopify storefront — those fields are admin-only and require merchant authentication, which is firmly off-limits.
A subtle but important detail on the variant data: Shopify exposes availability as a boolean (InStock vs OutOfStock), not a quantity. That means you can tell whether a SKU is currently sellable but not how many units are sitting in the warehouse. For competitive tracking this turns out to be the right granularity — knowing a variant went out of stock between Tuesday and Thursday is the actionable signal; knowing it had 47 units on Tuesday and 0 on Thursday adds nothing for downstream decisions. Stores that backorder or pre-order will surface those states through availability values like PreOrder and BackOrder when the merchant has wired them; treat those as in-stock for diff purposes unless your analysis explicitly cares about lead-time.
Finding the Right Collection URL
Every Shopify store exposes its full catalog under one of two collection URLs:
https://<store>/collections/all— the canonical "everything we sell" collection. Most stores have it.https://<store>/collections/<handle>— a specific category (/collections/sweaters,/collections/new-arrivals,/collections/sale).
For competitive intelligence, /collections/all is the starting point — it is the single URL that walks the entire catalog with one cursor. If the store has disabled the all collection (some merchants do, to push customers down curated category paths), the next step is to inspect the navigation menu and identify the largest top-level category. The collection URL is what appears in the address bar when you click that nav item.
Three example URLs to start from:
https://allbirds.com/collections/mens
https://skims.com/collections/all
https://gymshark.com/collections/womens-leggings
A quick sanity check: load the URL in a browser, view the page source, and search for "@type":"CollectionPage". If that string is in the HTML, the collection is JSON-LD enabled and the scraper will walk it. If it is not, the store is likely on Hydrogen, which the scraper handles automatically (it falls back to a browser-rendered fetch — no different from the caller's perspective).
A second tip on URL selection: stores that publish a /sitemap_products_1.xml (and most do) expose every product handle in plain XML. This is a useful cross-check once you have walked the catalog — if the number of unique product handles in your scrape matches the number of <url> entries in the sitemap, the walk is complete. If it is materially smaller, either the page cap was too low or the store is using a non-standard pagination model and the cursor walk stopped early. The sitemap is also the right starting point for any catalog larger than 600 products, because that is where the cursor cap on a single collection walk lands — for very large catalogs, drive the product-detail endpoint directly off the sitemap rather than walking collections.
The API Call
Every LogPose endpoint is asynchronous: submit a job, poll until done, fetch the result. Submit with curl first to confirm the URL parses and the collection walk starts:
curl -G "https://api.logposervices.com/api/v1/ecommerce/shopify/search" \
-H "X-API-Key: lp_xxxxxxx" \
--data-urlencode "url=https://allbirds.com/collections/mens" \
--data-urlencode "pages=10"
# → {"job_id": "shp_8f3a..."}
curl -H "X-API-Key: lp_xxxxxxx" \
"https://api.logposervices.com/api/v1/jobs/shp_8f3a?wait=true&timeout=120"
curl -H "X-API-Key: lp_xxxxxxx" \
https://api.logposervices.com/api/v1/jobs/shp_8f3a/result
Shopify collection pages return roughly 12 products per page, so pages=10 covers about 120 products. The pages parameter is a cap — if the collection is smaller than the cap, the walk stops when the cursor runs out. Most 10-page jobs finish in 30-90 seconds; Hydrogen stores can stretch toward 2 minutes.
For a single product's full variant breakdown, hit the product endpoint:
curl -G "https://api.logposervices.com/api/v1/ecommerce/shopify/product" \
-H "X-API-Key: lp_xxxxxxx" \
--data-urlencode "url=https://allbirds.com/products/mens-wool-runners"
The Python Pipeline
This is the script most teams end up running on a cron. It takes one collection URL, walks the full catalog up to a page cap, expands each product to get the variant detail, and writes two CSVs — one indexed by product, one indexed by SKU. The SKU-level CSV is what unlocks every interesting downstream analysis (variant-pricing patterns, sale depth by category, in-stock breadth).
import os, time, csv, requests
API_KEY = os.environ["LOGPOSE_API_KEY"]
BASE = "https://api.logposervices.com/api/v1"
HEADERS = {"X-API-Key": API_KEY}
def submit_and_wait(path: str, params: dict, timeout_s: int = 180) -> dict:
r = requests.get(f"{BASE}/{path}", params=params, headers=HEADERS, timeout=30)
r.raise_for_status()
job_id = r.json()["job_id"]
deadline = time.time() + timeout_s
while time.time() < deadline:
s = requests.get(f"{BASE}/jobs/{job_id}", headers=HEADERS, timeout=15).json()
if s["status"] == "completed":
break
if s["status"] == "failed":
raise RuntimeError(s.get("error", "unknown failure"))
time.sleep(2)
else:
raise TimeoutError(f"job {job_id} did not finish in {timeout_s}s")
return requests.get(f"{BASE}/jobs/{job_id}/result", headers=HEADERS, timeout=15).json()
def walk_catalog(collection_url: str, pages: int) -> list[dict]:
data = submit_and_wait(
"ecommerce/shopify/search",
{"url": collection_url, "pages": pages},
)
return data["products"]
def expand_product(product_url: str) -> dict:
return submit_and_wait("ecommerce/shopify/product", {"url": product_url})
def scrape_catalog_to_csv(collection_url: str, pages: int, out_prefix: str) -> tuple[int, int]:
listing = walk_catalog(collection_url, pages)
# Write the product-level snapshot first — gives you something usable
# even if the product-expansion step gets rate-limited.
with open(f"{out_prefix}_products.csv", "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(
f,
fieldnames=["title", "handle", "url", "price_min", "price_max", "currency", "vendor"],
extrasaction="ignore",
)
w.writeheader()
for p in listing:
w.writerow(p)
# Expand each product for the variant-level detail
sku_rows = []
for p in listing:
try:
detail = expand_product(p["url"])
except Exception as e:
print(f"skipping {p['handle']}: {e}")
continue
for v in detail.get("variants", []):
sku_rows.append({
"product_title": detail["title"],
"product_handle": p["handle"],
"variant_title": v.get("title"),
"sku": v.get("sku"),
"price": v.get("price"),
"compare_at_price": v.get("compare_at_price"),
"currency": v.get("currency"),
"availability": v.get("availability"),
})
with open(f"{out_prefix}_skus.csv", "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=list(sku_rows[0].keys()))
w.writeheader()
w.writerows(sku_rows)
return len(listing), len(sku_rows)
if __name__ == "__main__":
n_products, n_skus = scrape_catalog_to_csv(
"https://allbirds.com/collections/mens",
pages=20,
out_prefix="allbirds_mens",
)
print(f"wrote {n_products} products, {n_skus} variants")
Run that once and the two CSVs answer the obvious questions: how many SKUs are in the catalog, what is the price distribution by variant, which variants are sold out, how many products carry a compare_at_price (i.e., are on sale).
Turning the Snapshot into Competitive Intelligence
The raw scrape is a one-shot snapshot. The real value comes from running the same scrape weekly and diffing the result against the previous run. Three signals are worth wiring alerts on:
import pandas as pd
this_week = pd.read_csv("allbirds_mens_skus.csv")
last_week = pd.read_csv("allbirds_mens_skus_last_week.csv")
# 1. New SKUs this week — competitor launched product or expanded a variant
new_skus = set(this_week["sku"]) - set(last_week["sku"])
print(f"new SKUs: {len(new_skus)}")
# 2. Discontinued SKUs — variant disappeared from the catalog entirely
gone_skus = set(last_week["sku"]) - set(this_week["sku"])
print(f"discontinued SKUs: {len(gone_skus)}")
# 3. Price changes — same SKU, different price
joined = this_week.merge(last_week, on="sku", suffixes=("_now", "_then"))
price_changes = joined[joined["price_now"] != joined["price_then"]]
print(f"price changes: {len(price_changes)}")
print(price_changes[["sku", "price_then", "price_now"]].head(20))
# 4. Stock signals — was in-stock, now out-of-stock
went_oos = joined[
(joined["availability_then"] == "InStock")
& (joined["availability_now"] == "OutOfStock")
]
print(f"newly out-of-stock: {len(went_oos)}")
Of those four signals, the out-of-stock transition is the most operationally interesting one. A competitor consistently running out of a SKU is a demand signal you can act on — if your catalog covers a substitute, that is the moment to bid harder on the related search terms. Pair this with an ad-keyword tracker and you have a quietly powerful competitive-intelligence loop.
The second-most-useful signal is the new-SKU set. Tracked over six to twelve weeks, the cadence at which a competitor introduces new SKUs (and the categories those SKUs land in) reveals their merchandising strategy more clearly than any earnings call or interview ever would. A brand that adds two SKUs a week to outerwear and zero SKUs to bottoms is telling you where they think the demand is. Aggregated across an entire competitive set — five to ten brands over a quarter — these timelines tell you which subcategories the smartest operators in the space are betting on.
Price changes are usually the least interesting signal once you look at the data. DTC brands tend to hold prices constant outside of explicit sale events, which means the price-changes diff is mostly empty most weeks and then explodes during Black Friday, end-of-season clearance, or coordinated platform promotions. Treat the price-change diff as an event detector rather than a continuous signal: when it lights up, ask why.
Scaling to Multi-Brand Tracking
One competitor catalog is interesting. Twenty competitor catalogs is a research dataset. To scale beyond a single store, submit the entire list of collection URLs as a bulk request and let the platform schedule them against the proxy pool:
requests.post(
"https://api.logposervices.com/api/v1/ecommerce/shopify/search/bulk",
headers={"X-API-Key": os.environ["LOGPOSE_API_KEY"]},
json={
"targets": [
{"url": "https://allbirds.com/collections/mens", "pages": 20},
{"url": "https://skims.com/collections/all", "pages": 20},
{"url": "https://gymshark.com/collections/womens-leggings", "pages": 10},
{"url": "https://outdoorvoices.com/collections/all", "pages": 20},
],
},
).raise_for_status()
Bulk runs in parallel up to your concurrency cap, which cuts a 20-brand walk from 30+ minutes sequential to 4-6 minutes wall-clock. For weekly competitive tracking on a fixed set of brands, schedule this bulk call on a cron and pipe the result into the diff loop above. That is the full DTC-competitive-intelligence stack in roughly 100 lines of Python.
A practical scheduling note: run the weekly walk at the same day-and-hour every week. Brands time their merchandising drops on predictable cycles — Tuesday morning new-arrivals, Friday morning sale launches, Sunday-night soft-restocks. If you scrape at 2am UTC every Monday, you will consistently catch the previous week's full activity in a single snapshot, and the week-over-week diffs become directly comparable. Scraping at variable times produces diffs cluttered with intraday noise that has nothing to do with merchandising decisions.
For agencies running this across multiple clients, the natural extension is to keep one collection-URL list per client and produce a per-client weekly digest. The output format that tends to land best with non-technical brand stakeholders is a one-page summary per competitor: new SKUs added, items currently on sale (with discount depth), items that went out of stock since the last report. Three or four bullet points per brand is usually the right density — denser than that and the brand team stops reading; sparser than that and the report feels thin.
For the broader pattern — choosing what to track and how to surface it to a non-technical team — see the LogPose write-up on competitor price monitoring, which covers dashboarding and alert routing in more depth than this catalog-specific tutorial.
Legality and Ethics
Public Shopify storefronts are indexed by Google, Bing, every comparison shopping engine, and countless aggregator sites. The data — product names, public prices, public stock state, public images — is unambiguously public. US case law (hiQ v. LinkedIn) makes clear that CFAA does not reach public web data; GDPR allows the processing of public commercial data under legitimate interest with no consent requirement; Shopify's own terms of service govern the merchant relationship and the admin/app APIs, not third-party access to public storefronts. The scrape is not the risky step.
Where you should still tread carefully: do not republish the brand's product imagery as your own, do not represent the catalog as if it were yours, and do not use the data to directly impersonate the brand in ad copy or marketplaces. Competitive intelligence — informing your own pricing, merchandising, and stock decisions — is squarely on-side. Direct copying of product copy or imagery is a copyright issue regardless of how the data was obtained.
For dropshippers in particular, there is one specific trap worth naming: the temptation to copy-paste a competitor's full product page (title, description, images) onto your own storefront. That is unambiguously a copyright violation on the description text and the imagery, regardless of whether the source page was public. The legitimate use of catalog scrape data for a dropshipping operation is identifying winning product categories and pricing bands, not reusing copy. Write your own descriptions, source your own imagery from the supplier, and the data informs your strategy without exposing you to a DMCA takedown a year in.
Common Mistakes
- Pointing at the wrong URL form. The search endpoint expects
/collections/..., the product endpoint expects/products/.... Pasting a product URL into search or a homepage URL into either will return a 400. The path-fragment check is strict on purpose. - Skipping the
compare_at_pricefield. The most useful pricing signal in a catalog scrape is whethercompare_at_price > price— that is the merchandising tell that the SKU is on sale. Many teams scrape the price field and miss the discount layer entirely. - Using a single proxy across a 20-store walk. If you are doing this manually with
requestsand one residential proxy, you will get rate-limited on the third store. The managed flow distributes calls across a residential pool by default; this only matters if you are reimplementing the scrape yourself. - Treating
pages=50as a guarantee of 600 rows. It is a cap. Small collections return fewer pages, and the cursor walk stops naturally when the next-page link disappears. Always checklen(products)against your assumption. - Ignoring the Cloudflare 100-second edge timeout. Submit the job, then poll — do not expect a synchronous result on a 20-page catalog walk. The async pattern in the Python script above is the correct shape.
Get Started
- Sign up at logposervices.com and generate an API key under Tool → API Keys.
export LOGPOSE_API_KEY=lp_xxxxxxx- Pick a competitor
/collections/allURL and run the Python script above against it.
Related reading: How to monitor competitor prices across DTC stores for the dashboarding and alerting layer once you have the snapshots, Best Amazon scraper APIs in 2026 for the marketplace side of the competitive picture, and Apify alternatives for ecommerce scraping for the broader managed-vs-DIY trade-off.
External: schema.org Product specification, Shopify Dawn theme source, hiQ Labs v. LinkedIn.