← Back to blogTutorial

How to Scrape a Shopify Store's Full Product Catalog

· 11 min read

If you sell anything that competes with a Shopify-powered brand — a dropshipping store, a competitor DTC label, an agency client's space — the most valuable competitive intelligence asset is the competitor's full product catalog: what they sell, at what prices, in what variant SKUs, with what stock signals. This data is fully public on every Shopify store. There is no admin API access required, no scraping trick, no headless-login dance. The product detail is embedded as JSON-LD on every /products/<handle> page, and the collection listing on /collections/<handle> (or /collections/all) is cursor-paginated and walkable end-to-end. This guide covers the full catalog walk: identifying the collection URL, paging it, expanding each product, and feeding the result into a diffable snapshot for ongoing competitive tracking.

Why Shopify Catalogs Are the Cleanest Public Catalog Data on the Web

Most retail websites bury their product data behind a mix of inconsistent HTML, JavaScript-rendered prices, and shifting CSS classes. Scraping them requires per-site selectors that break every few months. Shopify is the opposite: because Shopify ships a default product template that includes a schema.org Product JSON-LD block, every store on the platform — independent indie brand, billion-dollar DTC label, dropshipper, agency-built client store — exposes the same structured fields on the same canonical URL pattern. That single fact is what makes catalog-scale Shopify scraping tractable.

The numbers are large enough to be material. Shopify reportedly powers more than four million active stores worldwide, and the proportion of mid-market DTC brands sitting on the platform is overwhelming. If a brand has been founded in the last decade and sells direct-to-consumer, the base-rate guess that they are on Shopify is right more often than wrong. Walk a competitor's catalog once and you have a ground-truth answer to merchandising questions that would otherwise require either insider access or hand-counting.

The other reason catalog-scraping is the right approach (vs paying for a third-party intelligence tool) is granularity. A subscription-based competitive-intel platform will tell you a brand has 280 SKUs and an average selling price of $74. The raw scrape tells you exactly which 280 SKUs, what each SKU costs, which sizes/colors are stocked, which variants are on sale at what discount depth, and which products were added or removed since last week. The detail is where the actionable merchandising decisions live, and the detail is what a generic dashboard tool flattens out.

What a Catalog Scrape Actually Returns

A walk of a Shopify storefront yields two layers. The collection-listing layer gives you the inventory of products in the store; the product-detail layer gives you everything else.

The collection page returns, per product:

FieldExample
titleCozy Knit Sweater
handlecozy-knit-sweater
url/products/cozy-knit-sweater
imagehttps://cdn.shopify.com/.../cozy-sweater.jpg
price_min89.00
price_max89.00
currencyUSD
vendorBrand Name

The product page returns, per SKU:

FieldExample
titleCozy Knit Sweater
descriptionSoft merino-wool knit, ribbed cuffs...
brandBrand Name
product_typeSweaters
tags["knit", "fall-2026", "best-seller"]
images[url, url, url]
variants[].titleSmall / Cream
variants[].skuCKS-S-CRM
variants[].price89.00
variants[].compare_at_price110.00
variants[].availabilityInStock
variants[].currencyUSD
canonical_urlhttps://example-store.com/products/cozy-knit-sweater

What it does not include: inventory counts, supplier identity, cost of goods, or any private order/customer data. None of that ever appears on a public Shopify storefront — those fields are admin-only and require merchant authentication, which is firmly off-limits.

A subtle but important detail on the variant data: Shopify exposes availability as a boolean (InStock vs OutOfStock), not a quantity. That means you can tell whether a SKU is currently sellable but not how many units are sitting in the warehouse. For competitive tracking this turns out to be the right granularity — knowing a variant went out of stock between Tuesday and Thursday is the actionable signal; knowing it had 47 units on Tuesday and 0 on Thursday adds nothing for downstream decisions. Stores that backorder or pre-order will surface those states through availability values like PreOrder and BackOrder when the merchant has wired them; treat those as in-stock for diff purposes unless your analysis explicitly cares about lead-time.

Finding the Right Collection URL

Every Shopify store exposes its full catalog under one of two collection URLs:

  1. https://<store>/collections/all — the canonical "everything we sell" collection. Most stores have it.
  2. https://<store>/collections/<handle> — a specific category (/collections/sweaters, /collections/new-arrivals, /collections/sale).

For competitive intelligence, /collections/all is the starting point — it is the single URL that walks the entire catalog with one cursor. If the store has disabled the all collection (some merchants do, to push customers down curated category paths), the next step is to inspect the navigation menu and identify the largest top-level category. The collection URL is what appears in the address bar when you click that nav item.

Three example URLs to start from:

https://allbirds.com/collections/mens
https://skims.com/collections/all
https://gymshark.com/collections/womens-leggings

A quick sanity check: load the URL in a browser, view the page source, and search for "@type":"CollectionPage". If that string is in the HTML, the collection is JSON-LD enabled and the scraper will walk it. If it is not, the store is likely on Hydrogen, which the scraper handles automatically (it falls back to a browser-rendered fetch — no different from the caller's perspective).

A second tip on URL selection: stores that publish a /sitemap_products_1.xml (and most do) expose every product handle in plain XML. This is a useful cross-check once you have walked the catalog — if the number of unique product handles in your scrape matches the number of <url> entries in the sitemap, the walk is complete. If it is materially smaller, either the page cap was too low or the store is using a non-standard pagination model and the cursor walk stopped early. The sitemap is also the right starting point for any catalog larger than 600 products, because that is where the cursor cap on a single collection walk lands — for very large catalogs, drive the product-detail endpoint directly off the sitemap rather than walking collections.

The API Call

Every LogPose endpoint is asynchronous: submit a job, poll until done, fetch the result. Submit with curl first to confirm the URL parses and the collection walk starts:

curl -G "https://api.logposervices.com/api/v1/ecommerce/shopify/search" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode "url=https://allbirds.com/collections/mens" \
  --data-urlencode "pages=10"
# → {"job_id": "shp_8f3a..."}

curl -H "X-API-Key: lp_xxxxxxx" \
  "https://api.logposervices.com/api/v1/jobs/shp_8f3a?wait=true&timeout=120"

curl -H "X-API-Key: lp_xxxxxxx" \
  https://api.logposervices.com/api/v1/jobs/shp_8f3a/result

Shopify collection pages return roughly 12 products per page, so pages=10 covers about 120 products. The pages parameter is a cap — if the collection is smaller than the cap, the walk stops when the cursor runs out. Most 10-page jobs finish in 30-90 seconds; Hydrogen stores can stretch toward 2 minutes.

For a single product's full variant breakdown, hit the product endpoint:

curl -G "https://api.logposervices.com/api/v1/ecommerce/shopify/product" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode "url=https://allbirds.com/products/mens-wool-runners"

The Python Pipeline

This is the script most teams end up running on a cron. It takes one collection URL, walks the full catalog up to a page cap, expands each product to get the variant detail, and writes two CSVs — one indexed by product, one indexed by SKU. The SKU-level CSV is what unlocks every interesting downstream analysis (variant-pricing patterns, sale depth by category, in-stock breadth).

import os, time, csv, requests

API_KEY = os.environ["LOGPOSE_API_KEY"]
BASE = "https://api.logposervices.com/api/v1"
HEADERS = {"X-API-Key": API_KEY}


def submit_and_wait(path: str, params: dict, timeout_s: int = 180) -> dict:
    r = requests.get(f"{BASE}/{path}", params=params, headers=HEADERS, timeout=30)
    r.raise_for_status()
    job_id = r.json()["job_id"]
    deadline = time.time() + timeout_s
    while time.time() < deadline:
        s = requests.get(f"{BASE}/jobs/{job_id}", headers=HEADERS, timeout=15).json()
        if s["status"] == "completed":
            break
        if s["status"] == "failed":
            raise RuntimeError(s.get("error", "unknown failure"))
        time.sleep(2)
    else:
        raise TimeoutError(f"job {job_id} did not finish in {timeout_s}s")
    return requests.get(f"{BASE}/jobs/{job_id}/result", headers=HEADERS, timeout=15).json()


def walk_catalog(collection_url: str, pages: int) -> list[dict]:
    data = submit_and_wait(
        "ecommerce/shopify/search",
        {"url": collection_url, "pages": pages},
    )
    return data["products"]


def expand_product(product_url: str) -> dict:
    return submit_and_wait("ecommerce/shopify/product", {"url": product_url})


def scrape_catalog_to_csv(collection_url: str, pages: int, out_prefix: str) -> tuple[int, int]:
    listing = walk_catalog(collection_url, pages)

    # Write the product-level snapshot first — gives you something usable
    # even if the product-expansion step gets rate-limited.
    with open(f"{out_prefix}_products.csv", "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(
            f,
            fieldnames=["title", "handle", "url", "price_min", "price_max", "currency", "vendor"],
            extrasaction="ignore",
        )
        w.writeheader()
        for p in listing:
            w.writerow(p)

    # Expand each product for the variant-level detail
    sku_rows = []
    for p in listing:
        try:
            detail = expand_product(p["url"])
        except Exception as e:
            print(f"skipping {p['handle']}: {e}")
            continue
        for v in detail.get("variants", []):
            sku_rows.append({
                "product_title": detail["title"],
                "product_handle": p["handle"],
                "variant_title": v.get("title"),
                "sku": v.get("sku"),
                "price": v.get("price"),
                "compare_at_price": v.get("compare_at_price"),
                "currency": v.get("currency"),
                "availability": v.get("availability"),
            })

    with open(f"{out_prefix}_skus.csv", "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=list(sku_rows[0].keys()))
        w.writeheader()
        w.writerows(sku_rows)

    return len(listing), len(sku_rows)


if __name__ == "__main__":
    n_products, n_skus = scrape_catalog_to_csv(
        "https://allbirds.com/collections/mens",
        pages=20,
        out_prefix="allbirds_mens",
    )
    print(f"wrote {n_products} products, {n_skus} variants")

Run that once and the two CSVs answer the obvious questions: how many SKUs are in the catalog, what is the price distribution by variant, which variants are sold out, how many products carry a compare_at_price (i.e., are on sale).

Turning the Snapshot into Competitive Intelligence

The raw scrape is a one-shot snapshot. The real value comes from running the same scrape weekly and diffing the result against the previous run. Three signals are worth wiring alerts on:

import pandas as pd

this_week = pd.read_csv("allbirds_mens_skus.csv")
last_week = pd.read_csv("allbirds_mens_skus_last_week.csv")

# 1. New SKUs this week — competitor launched product or expanded a variant
new_skus = set(this_week["sku"]) - set(last_week["sku"])
print(f"new SKUs: {len(new_skus)}")

# 2. Discontinued SKUs — variant disappeared from the catalog entirely
gone_skus = set(last_week["sku"]) - set(this_week["sku"])
print(f"discontinued SKUs: {len(gone_skus)}")

# 3. Price changes — same SKU, different price
joined = this_week.merge(last_week, on="sku", suffixes=("_now", "_then"))
price_changes = joined[joined["price_now"] != joined["price_then"]]
print(f"price changes: {len(price_changes)}")
print(price_changes[["sku", "price_then", "price_now"]].head(20))

# 4. Stock signals — was in-stock, now out-of-stock
went_oos = joined[
    (joined["availability_then"] == "InStock")
    & (joined["availability_now"] == "OutOfStock")
]
print(f"newly out-of-stock: {len(went_oos)}")

Of those four signals, the out-of-stock transition is the most operationally interesting one. A competitor consistently running out of a SKU is a demand signal you can act on — if your catalog covers a substitute, that is the moment to bid harder on the related search terms. Pair this with an ad-keyword tracker and you have a quietly powerful competitive-intelligence loop.

The second-most-useful signal is the new-SKU set. Tracked over six to twelve weeks, the cadence at which a competitor introduces new SKUs (and the categories those SKUs land in) reveals their merchandising strategy more clearly than any earnings call or interview ever would. A brand that adds two SKUs a week to outerwear and zero SKUs to bottoms is telling you where they think the demand is. Aggregated across an entire competitive set — five to ten brands over a quarter — these timelines tell you which subcategories the smartest operators in the space are betting on.

Price changes are usually the least interesting signal once you look at the data. DTC brands tend to hold prices constant outside of explicit sale events, which means the price-changes diff is mostly empty most weeks and then explodes during Black Friday, end-of-season clearance, or coordinated platform promotions. Treat the price-change diff as an event detector rather than a continuous signal: when it lights up, ask why.

Scaling to Multi-Brand Tracking

One competitor catalog is interesting. Twenty competitor catalogs is a research dataset. To scale beyond a single store, submit the entire list of collection URLs as a bulk request and let the platform schedule them against the proxy pool:

requests.post(
    "https://api.logposervices.com/api/v1/ecommerce/shopify/search/bulk",
    headers={"X-API-Key": os.environ["LOGPOSE_API_KEY"]},
    json={
        "targets": [
            {"url": "https://allbirds.com/collections/mens", "pages": 20},
            {"url": "https://skims.com/collections/all", "pages": 20},
            {"url": "https://gymshark.com/collections/womens-leggings", "pages": 10},
            {"url": "https://outdoorvoices.com/collections/all", "pages": 20},
        ],
    },
).raise_for_status()

Bulk runs in parallel up to your concurrency cap, which cuts a 20-brand walk from 30+ minutes sequential to 4-6 minutes wall-clock. For weekly competitive tracking on a fixed set of brands, schedule this bulk call on a cron and pipe the result into the diff loop above. That is the full DTC-competitive-intelligence stack in roughly 100 lines of Python.

A practical scheduling note: run the weekly walk at the same day-and-hour every week. Brands time their merchandising drops on predictable cycles — Tuesday morning new-arrivals, Friday morning sale launches, Sunday-night soft-restocks. If you scrape at 2am UTC every Monday, you will consistently catch the previous week's full activity in a single snapshot, and the week-over-week diffs become directly comparable. Scraping at variable times produces diffs cluttered with intraday noise that has nothing to do with merchandising decisions.

For agencies running this across multiple clients, the natural extension is to keep one collection-URL list per client and produce a per-client weekly digest. The output format that tends to land best with non-technical brand stakeholders is a one-page summary per competitor: new SKUs added, items currently on sale (with discount depth), items that went out of stock since the last report. Three or four bullet points per brand is usually the right density — denser than that and the brand team stops reading; sparser than that and the report feels thin.

For the broader pattern — choosing what to track and how to surface it to a non-technical team — see the LogPose write-up on competitor price monitoring, which covers dashboarding and alert routing in more depth than this catalog-specific tutorial.

Legality and Ethics

Public Shopify storefronts are indexed by Google, Bing, every comparison shopping engine, and countless aggregator sites. The data — product names, public prices, public stock state, public images — is unambiguously public. US case law (hiQ v. LinkedIn) makes clear that CFAA does not reach public web data; GDPR allows the processing of public commercial data under legitimate interest with no consent requirement; Shopify's own terms of service govern the merchant relationship and the admin/app APIs, not third-party access to public storefronts. The scrape is not the risky step.

Where you should still tread carefully: do not republish the brand's product imagery as your own, do not represent the catalog as if it were yours, and do not use the data to directly impersonate the brand in ad copy or marketplaces. Competitive intelligence — informing your own pricing, merchandising, and stock decisions — is squarely on-side. Direct copying of product copy or imagery is a copyright issue regardless of how the data was obtained.

For dropshippers in particular, there is one specific trap worth naming: the temptation to copy-paste a competitor's full product page (title, description, images) onto your own storefront. That is unambiguously a copyright violation on the description text and the imagery, regardless of whether the source page was public. The legitimate use of catalog scrape data for a dropshipping operation is identifying winning product categories and pricing bands, not reusing copy. Write your own descriptions, source your own imagery from the supplier, and the data informs your strategy without exposing you to a DMCA takedown a year in.

Common Mistakes

  • Pointing at the wrong URL form. The search endpoint expects /collections/..., the product endpoint expects /products/.... Pasting a product URL into search or a homepage URL into either will return a 400. The path-fragment check is strict on purpose.
  • Skipping the compare_at_price field. The most useful pricing signal in a catalog scrape is whether compare_at_price > price — that is the merchandising tell that the SKU is on sale. Many teams scrape the price field and miss the discount layer entirely.
  • Using a single proxy across a 20-store walk. If you are doing this manually with requests and one residential proxy, you will get rate-limited on the third store. The managed flow distributes calls across a residential pool by default; this only matters if you are reimplementing the scrape yourself.
  • Treating pages=50 as a guarantee of 600 rows. It is a cap. Small collections return fewer pages, and the cursor walk stops naturally when the next-page link disappears. Always check len(products) against your assumption.
  • Ignoring the Cloudflare 100-second edge timeout. Submit the job, then poll — do not expect a synchronous result on a 20-page catalog walk. The async pattern in the Python script above is the correct shape.

Get Started

  1. Sign up at logposervices.com and generate an API key under Tool → API Keys.
  2. export LOGPOSE_API_KEY=lp_xxxxxxx
  3. Pick a competitor /collections/all URL and run the Python script above against it.

Related reading: How to monitor competitor prices across DTC stores for the dashboarding and alerting layer once you have the snapshots, Best Amazon scraper APIs in 2026 for the marketplace side of the competitive picture, and Apify alternatives for ecommerce scraping for the broader managed-vs-DIY trade-off.

External: schema.org Product specification, Shopify Dawn theme source, hiQ Labs v. LinkedIn.

Frequently asked questions

Is it legal to scrape another store's Shopify catalog?
Shopify product catalogs are public storefront pages indexed by Google and every comparison-shopping engine. The same legal logic that applies to scraping any retail website applies here: in the US, hiQ Labs v. LinkedIn (9th Cir. 2022) confirmed that CFAA does not reach public data, and the EU treats public commercial pricing data as fair game for legitimate-interest processing under GDPR. What Shopify's Terms of Service forbid is unauthorized access to a merchant's admin API or app surface — neither of which is involved when you load the public product page. The scrape itself is not the risky step. Republishing the data as a competing storefront, or representing the brand falsely, is where the legal exposure actually sits.
Why do all Shopify stores expose the same JSON-LD shape?
Shopify ships a default `product` Liquid template that embeds a structured-data block following the schema.org `Product` specification. Every store — whether it uses Dawn, Debut, a third-party theme, or a custom one — inherits that template unless the merchant explicitly removes it, which almost nobody does because Google's product-result rich snippets depend on it. That is why a single parser can read titles, variants, prices, and stock status across millions of stores without per-store rules. The same logic covers the `CollectionPage` block on `/collections/<handle>` URLs, which lists every product handle in the collection — that is the cursor for the pagination walk.
What fields does a Shopify product page return?
A product-detail scrape returns the product title, full HTML description, brand or vendor, primary image and the full image gallery, product type, tags, every variant (title, SKU, price, compare-at price, currency, availability), the active variant ID, and the canonical product URL. Out-of-stock variants still appear in the variant array with `availability: OutOfStock`, which is exactly what you want for stock-signal tracking. A collection-page scrape returns a lighter view per product — title, handle, image, price range, and the relative URL needed to chain into the product endpoint for full detail.
How does cursor pagination work on /collections/all?
Shopify collection pages do not use a `?page=N` query string. They serve the next page through a JavaScript-rendered cursor (`?page_info=...` on stores using the new pagination model, or scroll-driven AJAX on Hydrogen storefronts). The scraper walks this by following the next-page link the server emits at the bottom of each rendered collection page, until either no next-page link exists or the configured page cap is reached. This is why the `pages` parameter behaves like a max — a small collection may return fewer pages than requested, and the result simply contains everything the cursor walked through. About 12 products per page is typical.
Does this work on Shopify Hydrogen / custom storefronts?
Yes. Hydrogen storefronts (Shopify's headless React framework) render the collection and product pages client-side, but the underlying data still comes from the same Storefront API and the resulting DOM still embeds the same JSON-LD blocks once the page hydrates. The scraper transparently switches to a browser-rendered fetch when it detects that the initial HTML response is missing the structured-data payload, so the call signature is identical regardless of storefront framework. The trade-off is that Hydrogen pages take longer to scrape per page than classic Liquid pages — typically 4-6 seconds vs 1-2 seconds.
How frequently should I re-scrape a competitor catalog?
Weekly is the default for competitive intelligence; daily is only worth it for time-sensitive launches or active product-launch monitoring. The signals you are tracking — new SKU additions, out-of-stock transitions, price changes, compare-at-price (sale) introductions — generally move on weekly cycles for fashion and DTC brands, and on daily cycles only during launch windows or coordinated promotions. A weekly snapshot diffed against the previous week is typically enough to catch every meaningful merchandising decision. Daily polls on a 500-SKU catalog cost roughly 7x more for marginal additional insight.

Related posts

Tutorial

How to Monitor Amazon BuyBox Changes (and Get Alerted When You Lose It)

9 min read
Tutorial

How to Track Amazon Competitor Prices Daily (Export to CSV and Google Sheets)

10 min read
Tutorial

How to Enrich Business Leads with Emails, Phones, and Socials

12 min read