← Back to blogTutorial

Extract Amazon ASIN Data in Bulk

· 9 min read

You have a CSV of 5,000 ASINs and you need every product field for each one — title, price, rating, review count, BSR, images. Manually paging through Amazon would take a week. A naive Python loop takes eight hours if everything works and crashes halfway when you get rate-limited. This guide is about the third option: bulk scraping that actually finishes.

Why Bulk ASIN Extraction Breaks Naive Scripts

Three things hit you the moment you scale past ~100 ASINs:

Rate limits compound. Even slow pacing (5 seconds per ASIN) becomes a hard ceiling: 5,000 ASINs × 5s = ~7 hours of wall time from a single source. Concurrent fetches help, but multiplexing on one IP just triggers Amazon faster.

Failures cascade. A 10% failure rate on 5,000 ASINs is 500 retries. Without a queue, retries collide with the original run and your error rate climbs.

Progress is invisible. A for loop without checkpointing means a crash at ASIN 4,873 sends you back to ASIN 0. You need persistent state per-ASIN, not per-batch.

Memory matters. Loading 5,000 scraped pages into Python at once can blow past 8 GB. Stream to disk; never accumulate.

The DIY Approach

For small batches (~100 ASINs), ThreadPoolExecutor with a low worker count is the baseline:

import csv
import time
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed

HEADERS = {"User-Agent": "Mozilla/5.0 ... Chrome/127.0.0.0 ..."}


def scrape_asin(asin: str) -> dict:
    url = f"https://www.amazon.com/dp/{asin}"
    r = requests.get(url, headers=HEADERS, timeout=15)
    if "validateCaptcha" in r.url:
        return {"asin": asin, "status": "captcha"}
    soup = BeautifulSoup(r.text, "html.parser")
    title_el = soup.select_one("#productTitle")
    price_el = soup.select_one("span.a-price > span.a-offscreen")
    return {
        "asin": asin,
        "status": "ok" if (title_el and price_el) else "no_data",
        "title": title_el.get_text(strip=True) if title_el else None,
        "price": price_el.get_text(strip=True) if price_el else None,
    }


with open("asins.csv") as f:
    asins = [row["asin"] for row in csv.DictReader(f)]

results = []
with ThreadPoolExecutor(max_workers=3) as pool:
    futures = {pool.submit(scrape_asin, a): a for a in asins}
    for fut in as_completed(futures):
        results.append(fut.result())
        time.sleep(2)

with open("out.csv", "w") as f:
    w = csv.DictWriter(f, fieldnames=["asin", "status", "title", "price"])
    w.writeheader()
    w.writerows(results)

max_workers=3 is deliberately small — Amazon flags higher concurrency from one source quickly.

Real Limitations

  • Single-IP ceiling. Even with threading, you will see a CAPTCHA wall around 50-200 successful requests from one IP.
  • CAPTCHA responses are HTTP 200. The status code lies; check r.url for the redirect.
  • as_completed order is non-deterministic. If you need stable output order, sort results by asin afterward.
  • No checkpointing. Crash at ASIN 4,873 = start over from zero.

Sample log of "scaling failure":

[09:14] starting batch of 5000 ASINs
[09:14:02] B09V3KXJPB → ok
[09:14:08] B0BN93GFMN → ok
[09:14:21] B0BTFKR638 → ok
...
[09:18:45] B0CHWJDXYZ → captcha
[09:18:46] B0CHWQHHHJ → captcha
[09:18:47] B0CHX1KQQ1 → captcha
... <456 ASINs in a row return captcha>

You scraped 51 successfully, lost the next 456 to rate limiting, and gave up. Total useful output: 1% of intent.

Scaling Beyond a Single Script

For 1,000+ ASINs you need:

Persistent state per ASIN. SQLite or Postgres with (asin, status, attempts, last_attempt_at, result_json). Skip done ones, retry pending, abandon after N attempts.

Rate-limited rotating residential pool. Bright Data, Smartproxy, or Oxylabs at 5-20 concurrent connections, with sticky sessions per ASIN so a single proxy handles a single fetch start to finish.

Exponential backoff. 1s, 2s, 4s, 8s between retries. Give up after four attempts.

Live progress visibility. Either a terminal progress bar (tqdm) or a small dashboard endpoint that returns (total, done, failed, pending).

Result streaming. Append each scraped result to a JSONL file or pipe to a database immediately. Do not hold results in memory.

A hosted bulk endpoint moves all of this server-side. LogPose's bulk submit takes an array of targets and returns a single bulk_id that aggregates child jobs:

import os
import csv
import time
import requests

API_KEY = os.environ["LOGPOSE_API_KEY"]
BASE = "https://api.logposervices.com/api/v1"
HEADERS = {"X-API-Key": API_KEY}


with open("asins.csv") as f:
    asins = [row["asin"] for row in csv.DictReader(f)]

# Submit in chunks (the endpoint accepts 1-500 targets per call).
CHUNK = 200
bulk_ids = []

for i in range(0, len(asins), CHUNK):
    batch = asins[i:i + CHUNK]
    payload = {
        "targets": [{"url": asin} for asin in batch],  # bare ASIN auto-expanded
        "default_pages": 1,
    }
    r = requests.post(
        f"{BASE}/ecommerce/amazon/smart/bulk",
        headers=HEADERS, json=payload, timeout=60,
    )
    r.raise_for_status()
    bulk_ids.append(r.json()["bulk_id"])
    print(f"submitted batch {i}-{i + len(batch)}: {r.json()['bulk_id']}")

# Poll each bulk until done.
for bulk_id in bulk_ids:
    while True:
        s = requests.get(
            f"{BASE}/jobs/bulk/{bulk_id}", headers=HEADERS, timeout=30,
        ).json()
        print(f"{bulk_id}: {s.get('completed', 0)}/{s.get('total', 0)}")
        if s["status"] in ("completed", "failed"):
            break
        time.sleep(10)

A 5,000-ASIN run completes in tens of minutes rather than hours because the bulk endpoint dispatches child jobs across the global browser pool concurrently.

To preview what a submit would charge without placing a hold, the bulk-estimate endpoint shows it up-front:

estimate = requests.post(
    f"{BASE}/ecommerce/amazon/smart/bulk/estimate",
    headers=HEADERS,
    json={"targets": [{"url": a} for a in asins[:200]]},
).json()
print(f"would charge {estimate['total_credits']} credits")

Common Mistakes

  • No checkpointing. Mid-run crashes are inevitable. Persist per-ASIN state.
  • Treating CAPTCHA as a non-error. A scraper that "succeeds" on a CAPTCHA page silently corrupts your dataset.
  • Holding all results in memory. Stream to disk or DB; do not accumulate.
  • Same proxy for all retries. A burned proxy stays burned. Rotate on retry.
  • Skipping the dry run. Always submit a 10-ASIN sample first and inspect the output before launching the full 5,000.

The Landscape

For bulk Amazon catalog scrapes:

  • Apify Amazon Product Scraper actor — pay-per-result; reasonable for one-off runs, less consistent for daily pipelines.
  • DataForSEO — has Amazon endpoints with a different async model; SERP-focused but covers product data.
  • Bright Data Web Scraper IDE — visual scraper builder with managed proxies; steeper learning curve.
  • DIY + Bright Data proxies — full control if your team already runs scrapers.
  • LogPose — bulk submit endpoint with parent/child progress tracking; useful when bulk is one part of a broader multi-site workflow.

For one-off catalog dumps, the managed bulk option usually wins. For sustained daily bulk runs at large scale, you will save money long-term building your own pipeline once the engineering investment is amortized.

Get Started

  1. Sign up at logposervices.com.
  2. Generate an API key under Tool → API Keys.
  3. Run the bulk-submit snippet above against a CSV of ASINs — start with 20 as a smoke test.
  4. Pipe the result JSONL into your warehouse.

Related: scrape Amazon prices in Python, Amazon search results, Amazon reviews API.

External: Python concurrent.futures, JSONL spec.

Frequently asked questions

How many Amazon ASINs can I scrape in parallel?
From a single residential IP: roughly 1 concurrent connection, 3-5s pacing. With a rotating residential pool: 5-20 concurrent. With a managed bulk API: limited by your account's queue depth (typically 50-500 concurrent jobs).
What is the cheapest way to scrape 10,000 Amazon ASINs?
If you already pay for residential proxies, DIY is cheapest as long as your engineering time is free. If you do not, a managed bulk API is usually faster end-to-end and avoids the operational debt of building queue infrastructure for a one-off run.
How do I handle failed scrapes in a bulk run?
Persist (asin, status, attempts) for every job. Retry failed ones with exponential backoff once or twice, then mark as permanently failed and move on. Trying to make 100% succeed is a trap — 95-98% is the realistic ceiling on Amazon at scale.
Should I scrape sequentially or in parallel?
Sequential is safer for small lots (<100 ASINs). Parallel with rate-limited proxies for medium runs (100-10k). Hosted bulk endpoints for large (10k+) — the API handles concurrency on its end and you just poll a parent bulk ID.
Can I scrape ASINs from a CSV directly?
Yes — load the CSV, loop over the ASIN column, hit either your own scraper or a bulk API endpoint. The LogPose bulk endpoint accepts an array of targets and returns a single bulk_id you poll for aggregate progress, instead of one job at a time.

Related posts

Tutorial

How to Get Amazon Product Reviews via API

9 min read
Strategy

Monitor Amazon Competitor Pricing Daily

9 min read
Tutorial

How to Scrape Amazon Product Prices with Python

10 min read