Extract Amazon ASIN Data in Bulk
You have a CSV of 5,000 ASINs and you need every product field for each one — title, price, rating, review count, BSR, images. Manually paging through Amazon would take a week. A naive Python loop takes eight hours if everything works and crashes halfway when you get rate-limited. This guide is about the third option: bulk scraping that actually finishes.
Why Bulk ASIN Extraction Breaks Naive Scripts
Three things hit you the moment you scale past ~100 ASINs:
Rate limits compound. Even slow pacing (5 seconds per ASIN) becomes a hard ceiling: 5,000 ASINs × 5s = ~7 hours of wall time from a single source. Concurrent fetches help, but multiplexing on one IP just triggers Amazon faster.
Failures cascade. A 10% failure rate on 5,000 ASINs is 500 retries. Without a queue, retries collide with the original run and your error rate climbs.
Progress is invisible. A for loop without checkpointing means a crash at ASIN 4,873 sends you back to ASIN 0. You need persistent state per-ASIN, not per-batch.
Memory matters. Loading 5,000 scraped pages into Python at once can blow past 8 GB. Stream to disk; never accumulate.
The DIY Approach
For small batches (~100 ASINs), ThreadPoolExecutor with a low worker count is the baseline:
import csv
import time
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed
HEADERS = {"User-Agent": "Mozilla/5.0 ... Chrome/127.0.0.0 ..."}
def scrape_asin(asin: str) -> dict:
url = f"https://www.amazon.com/dp/{asin}"
r = requests.get(url, headers=HEADERS, timeout=15)
if "validateCaptcha" in r.url:
return {"asin": asin, "status": "captcha"}
soup = BeautifulSoup(r.text, "html.parser")
title_el = soup.select_one("#productTitle")
price_el = soup.select_one("span.a-price > span.a-offscreen")
return {
"asin": asin,
"status": "ok" if (title_el and price_el) else "no_data",
"title": title_el.get_text(strip=True) if title_el else None,
"price": price_el.get_text(strip=True) if price_el else None,
}
with open("asins.csv") as f:
asins = [row["asin"] for row in csv.DictReader(f)]
results = []
with ThreadPoolExecutor(max_workers=3) as pool:
futures = {pool.submit(scrape_asin, a): a for a in asins}
for fut in as_completed(futures):
results.append(fut.result())
time.sleep(2)
with open("out.csv", "w") as f:
w = csv.DictWriter(f, fieldnames=["asin", "status", "title", "price"])
w.writeheader()
w.writerows(results)
max_workers=3 is deliberately small — Amazon flags higher concurrency from one source quickly.
Real Limitations
- Single-IP ceiling. Even with threading, you will see a CAPTCHA wall around 50-200 successful requests from one IP.
- CAPTCHA responses are HTTP 200. The status code lies; check
r.urlfor the redirect. as_completedorder is non-deterministic. If you need stable output order, sortresultsbyasinafterward.- No checkpointing. Crash at ASIN 4,873 = start over from zero.
Sample log of "scaling failure":
[09:14] starting batch of 5000 ASINs
[09:14:02] B09V3KXJPB → ok
[09:14:08] B0BN93GFMN → ok
[09:14:21] B0BTFKR638 → ok
...
[09:18:45] B0CHWJDXYZ → captcha
[09:18:46] B0CHWQHHHJ → captcha
[09:18:47] B0CHX1KQQ1 → captcha
... <456 ASINs in a row return captcha>
You scraped 51 successfully, lost the next 456 to rate limiting, and gave up. Total useful output: 1% of intent.
Scaling Beyond a Single Script
For 1,000+ ASINs you need:
Persistent state per ASIN. SQLite or Postgres with (asin, status, attempts, last_attempt_at, result_json). Skip done ones, retry pending, abandon after N attempts.
Rate-limited rotating residential pool. Bright Data, Smartproxy, or Oxylabs at 5-20 concurrent connections, with sticky sessions per ASIN so a single proxy handles a single fetch start to finish.
Exponential backoff. 1s, 2s, 4s, 8s between retries. Give up after four attempts.
Live progress visibility. Either a terminal progress bar (tqdm) or a small dashboard endpoint that returns (total, done, failed, pending).
Result streaming. Append each scraped result to a JSONL file or pipe to a database immediately. Do not hold results in memory.
A hosted bulk endpoint moves all of this server-side. LogPose's bulk submit takes an array of targets and returns a single bulk_id that aggregates child jobs:
import os
import csv
import time
import requests
API_KEY = os.environ["LOGPOSE_API_KEY"]
BASE = "https://api.logposervices.com/api/v1"
HEADERS = {"X-API-Key": API_KEY}
with open("asins.csv") as f:
asins = [row["asin"] for row in csv.DictReader(f)]
# Submit in chunks (the endpoint accepts 1-500 targets per call).
CHUNK = 200
bulk_ids = []
for i in range(0, len(asins), CHUNK):
batch = asins[i:i + CHUNK]
payload = {
"targets": [{"url": asin} for asin in batch], # bare ASIN auto-expanded
"default_pages": 1,
}
r = requests.post(
f"{BASE}/ecommerce/amazon/smart/bulk",
headers=HEADERS, json=payload, timeout=60,
)
r.raise_for_status()
bulk_ids.append(r.json()["bulk_id"])
print(f"submitted batch {i}-{i + len(batch)}: {r.json()['bulk_id']}")
# Poll each bulk until done.
for bulk_id in bulk_ids:
while True:
s = requests.get(
f"{BASE}/jobs/bulk/{bulk_id}", headers=HEADERS, timeout=30,
).json()
print(f"{bulk_id}: {s.get('completed', 0)}/{s.get('total', 0)}")
if s["status"] in ("completed", "failed"):
break
time.sleep(10)
A 5,000-ASIN run completes in tens of minutes rather than hours because the bulk endpoint dispatches child jobs across the global browser pool concurrently.
To preview what a submit would charge without placing a hold, the bulk-estimate endpoint shows it up-front:
estimate = requests.post(
f"{BASE}/ecommerce/amazon/smart/bulk/estimate",
headers=HEADERS,
json={"targets": [{"url": a} for a in asins[:200]]},
).json()
print(f"would charge {estimate['total_credits']} credits")
Common Mistakes
- No checkpointing. Mid-run crashes are inevitable. Persist per-ASIN state.
- Treating CAPTCHA as a non-error. A scraper that "succeeds" on a CAPTCHA page silently corrupts your dataset.
- Holding all results in memory. Stream to disk or DB; do not accumulate.
- Same proxy for all retries. A burned proxy stays burned. Rotate on retry.
- Skipping the dry run. Always submit a 10-ASIN sample first and inspect the output before launching the full 5,000.
The Landscape
For bulk Amazon catalog scrapes:
- Apify Amazon Product Scraper actor — pay-per-result; reasonable for one-off runs, less consistent for daily pipelines.
- DataForSEO — has Amazon endpoints with a different async model; SERP-focused but covers product data.
- Bright Data Web Scraper IDE — visual scraper builder with managed proxies; steeper learning curve.
- DIY + Bright Data proxies — full control if your team already runs scrapers.
- LogPose — bulk submit endpoint with parent/child progress tracking; useful when bulk is one part of a broader multi-site workflow.
For one-off catalog dumps, the managed bulk option usually wins. For sustained daily bulk runs at large scale, you will save money long-term building your own pipeline once the engineering investment is amortized.
Get Started
- Sign up at logposervices.com.
- Generate an API key under Tool → API Keys.
- Run the bulk-submit snippet above against a CSV of ASINs — start with 20 as a smoke test.
- Pipe the result JSONL into your warehouse.
Related: scrape Amazon prices in Python, Amazon search results, Amazon reviews API.
External: Python concurrent.futures, JSONL spec.