← Back to blogTutorial

How to Scrape Crunchbase for Startup Funding Data

· 11 min read

Crunchbase is where startup funding data lives in structured form — who raised, how much, at what stage, from which investors, in which sector. For anyone building a VC deal-flow list, enriching a B2B prospect list with funding signals, or mapping a competitive landscape, the profile pages are the source. The problem is that getting that data out at any scale means clicking through hundreds of organization pages by hand, and Crunchbase actively makes automated collection difficult.

This guide walks the full pipeline end to end: what fields you can actually pull, how to search organizations by sector or keyword to get a list of company slugs, how to fetch each organization's detail page for its funding rounds and investors, how the async submit-poll-result loop works, and how to flatten the nested funding-round data into flat CSV rows. It closes with an honest look at why Crunchbase is hard to scrape yourself — the Cloudflare Turnstile and headless-browser problem that defeats most naive scrapers — so you can decide whether to build the anti-bot stack or hand that part off.

What You Can Pull From a Crunchbase Profile

A Crunchbase organization page exposes two layers of data. The firmographic layer describes the company; the funding layer describes its money.

From the firmographic layer you get the company description, the sector categories (e.g. "Artificial Intelligence, SaaS, Developer Tools"), the headquarters location, the founding year, the company website and social links, and a headcount band (the bucketed employee range Crunchbase reports — 11-50, 51-100, 101-250, and so on, never an exact number).

From the funding layer you get the list of named funding rounds — each with a round type (Seed, Series A, Series B, …), the announced date, the money raised, and the set of investors that participated, with lead investors flagged where Crunchbase marks them. Aggregated across rounds you also get total funding raised and the number of funding rounds.

FieldSource layerExample
namefirmographicAcme AI
descriptionfirmographicDeveloper platform for building LLM agents
categoriesfirmographicArtificial Intelligence, Developer Tools, SaaS
locationfirmographicSan Francisco, California, United States
founded_yearfirmographic2021
websitefirmographichttps://acme.ai
headcount_bandfirmographic51-100
total_funding_usdfunding (aggregate)64000000
num_funding_roundsfunding (aggregate)3
funding_rounds[]funding (per round)[{round, date, raised, investors[]}]

What you do not get from public profile pages: exact revenue, exact headcount, private cap-table detail, or anything gated behind a Crunchbase Pro login. Those require a licensed Crunchbase subscription, not a scrape. The public profile is the funding-and-firmographics layer — which is exactly what deal-flow and enrichment workflows need.

Step 1 — Search Organizations by Sector or Keyword

The pipeline starts with discovery: turn a sector or keyword into a list of organization slugs you can then fetch in detail. A Crunchbase slug is the identifier in the profile URL — for https://www.crunchbase.com/organization/acme-ai, the slug is acme-ai.

The organization search endpoint takes a free-text query and returns matching organizations:

curl -G "https://api.logposervices.com/api/v1/ecommerce/crunchbase/orgsearch" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode "query=ai developer tools" \
  --data-urlencode "pages=1"
# → {"job_id": "cb_8f3a...", "status": "pending"}

The response is an async job handle, not the results inline (more on that loop in Step 3). Once it completes, the result is a list of organizations, each carrying at minimum its name and slug — the slug is the key you carry into the detail step.

An honest nuance on pagination. Multi-page search results only unlock when a Crunchbase Pro account is connected to the request via an account_id parameter — Crunchbase gates deep result pagination behind an authenticated Pro session. Without a connected account, pages=N still works but you get the first page of results regardless of the value you pass. For sector discovery, the first page of high-relevance matches is usually enough to seed a detail crawl; if you need exhaustive coverage of a category, connect a Pro account and pass its account_id to page through the full result set.

# With a connected Crunchbase Pro account, deep pagination unlocks
curl -G "https://api.logposervices.com/api/v1/ecommerce/crunchbase/orgsearch" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode "query=ai developer tools" \
  --data-urlencode "pages=5" \
  --data-urlencode "account_id=your_connected_account_id"

Step 2 — Fetch Each Organization's Detail

With a list of slugs in hand, fetch each organization's profile for its funding rounds, investors, and firmographics. The organization endpoint accepts either a full Crunchbase URL or a bare slug:

# Bare slug
curl -G "https://api.logposervices.com/api/v1/ecommerce/crunchbase/organization" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode "url=acme-ai"
# → {"job_id": "cb_9b2d...", "status": "pending"}

# Or a full URL — both resolve to the same profile
curl -G "https://api.logposervices.com/api/v1/ecommerce/crunchbase/organization" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode "url=https://www.crunchbase.com/organization/acme-ai"

The completed result is the firmographic-plus-funding object: description, categories, location, founded year, headcount band, total funding, and the funding_rounds array with each round's type, date, raised amount, and investors.

Two sibling endpoints share the same shape if your workflow needs them. GET /api/v1/ecommerce/crunchbase/person?url=<slug-or-url> pulls a person profile (founders, operators, investors as individuals), and GET /api/v1/ecommerce/crunchbase/hub?url=<slug-or-url> pulls a hub — Crunchbase's curated category and location collections. For a funding-data pipeline, the organization endpoint is the workhorse; the others are there when you extend into people-mapping or category landscapes.

Step 3 — The Async Submit / Poll / Result Loop

Every Crunchbase call is asynchronous, and for a good reason. Getting through Cloudflare Turnstile with a real browser takes time, and api.logposervices.com sits behind Cloudflare's own ~90-second edge timeout — a synchronous request that ran the full scrape inline would get cut off at the edge with a 524 before the page loaded. The async pattern sidesteps that entirely: the GET returns a job handle immediately, the scrape runs server-side, and you poll for completion.

The loop is three calls:

# 1) Submit — returns a job handle immediately
curl -G "https://api.logposervices.com/api/v1/ecommerce/crunchbase/organization" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode "url=acme-ai"
# → {"job_id": "cb_9b2d...", "status": "pending"}

# 2) Poll the job until status flips to "completed"
curl -H "X-API-Key: lp_xxxxxxx" \
  "https://api.logposervices.com/api/v1/jobs/cb_9b2d"
# → {"job_id": "cb_9b2d...", "status": "running"}  → poll again
# → {"job_id": "cb_9b2d...", "status": "completed"} → fetch result

# 3) Fetch the parsed result
curl -H "X-API-Key: lp_xxxxxxx" \
  "https://api.logposervices.com/api/v1/jobs/cb_9b2d/result"

Poll on a short interval — every two to three seconds — and treat failed as a terminal state alongside completed. Never expect the data inline on the submit call; the job handle is the contract.

Step 4 — The Python Pipeline

Here is the full pipeline as one script: search a sector for organizations, collect their slugs, fetch each one's detail through the async loop, flatten the funding rounds to rows, and write a CSV a deal-flow analyst can open and read.

import os, time, csv, requests

API_KEY = os.environ["LOGPOSE_API_KEY"]
BASE = "https://api.logposervices.com/api/v1"
HEADERS = {"X-API-Key": API_KEY}


def submit_and_wait(path: str, params: dict, timeout_s: int = 150) -> dict:
    """Submit an async job, poll until it finishes, return the result body."""
    r = requests.get(f"{BASE}/{path}", params=params, headers=HEADERS, timeout=30)
    r.raise_for_status()
    job_id = r.json()["job_id"]

    deadline = time.time() + timeout_s
    while time.time() < deadline:
        s = requests.get(f"{BASE}/jobs/{job_id}", headers=HEADERS, timeout=15).json()
        status = s.get("status")
        if status == "completed":
            break
        if status == "failed":
            raise RuntimeError(s.get("error", "job failed"))
        time.sleep(3)
    else:
        raise TimeoutError(f"job {job_id} did not finish in {timeout_s}s")

    return requests.get(f"{BASE}/jobs/{job_id}/result", headers=HEADERS, timeout=15).json()


def search_org_slugs(query: str, pages: int = 1) -> list[str]:
    """Search organizations by keyword, return their slugs."""
    data = submit_and_wait(
        "ecommerce/crunchbase/orgsearch",
        {"query": query, "pages": pages},
    )
    orgs = data.get("organizations", data.get("results", []))
    return [o["slug"] for o in orgs if o.get("slug")]


def fetch_org(slug: str) -> dict:
    """Fetch one organization's firmographics + funding rounds."""
    return submit_and_wait(
        "ecommerce/crunchbase/organization",
        {"url": slug},
    )


def flatten_funding(org: dict) -> list[dict]:
    """One row per funding round, firmographics repeated on each row."""
    base = {
        "name": org.get("name"),
        "slug": org.get("slug"),
        "categories": ", ".join(org.get("categories", [])),
        "location": org.get("location"),
        "founded_year": org.get("founded_year"),
        "headcount_band": org.get("headcount_band"),
        "website": org.get("website"),
        "total_funding_usd": org.get("total_funding_usd"),
        "num_funding_rounds": org.get("num_funding_rounds"),
    }
    rounds = org.get("funding_rounds", [])
    if not rounds:
        return [{**base, "round": None, "round_date": None,
                 "round_raised_usd": None, "investors": None}]
    rows = []
    for rnd in rounds:
        rows.append({
            **base,
            "round": rnd.get("round"),
            "round_date": rnd.get("date"),
            "round_raised_usd": rnd.get("raised_usd"),
            "investors": ", ".join(rnd.get("investors", [])),
        })
    return rows


def run(query: str, out_path: str, max_orgs: int = 25) -> int:
    slugs = search_org_slugs(query, pages=1)[:max_orgs]
    print(f"found {len(slugs)} organizations for '{query}'")

    all_rows = []
    for slug in slugs:
        try:
            org = fetch_org(slug)
            org.setdefault("slug", slug)
            all_rows.extend(flatten_funding(org))
            print(f"  ok: {slug}")
        except Exception as e:
            print(f"  skip {slug}: {e}")

    fields = [
        "name", "slug", "categories", "location", "founded_year",
        "headcount_band", "website", "total_funding_usd",
        "num_funding_rounds", "round", "round_date",
        "round_raised_usd", "investors",
    ]
    with open(out_path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fields, extrasaction="ignore")
        w.writeheader()
        w.writerows(all_rows)
    return len(all_rows)


if __name__ == "__main__":
    n = run("ai developer tools", "ai_devtools_funding.csv", max_orgs=25)
    print(f"wrote {n} funding-round rows")

Run it once and you have a CSV where every row is a single funding round, with the company's firmographics repeated alongside so the file is usable on its own without a join. A company with three rounds produces three rows; a company with no disclosed funding produces one row with empty funding fields, so nothing silently drops out of the file.

Step 5 — Flattening Funding Rounds to CSV Rows

The shape decision worth being deliberate about is how you flatten nested funding rounds into a flat table, because there are two valid layouts and they serve different questions.

One row per round (what the script above writes) is the right default for funding analysis. Each round is its own observation, so you can pivot on round type, sum raised amounts by quarter, or filter to "every Series A announced in the last six months." The cost is that firmographic fields repeat across a company's rows — which is exactly what you want for analysis, and exactly what you do not want if you are loading a normalized database.

One row per company is the right shape for a lead list or CRM import, where each company is a single record. Collapse the rounds into summary columns:

import pandas as pd

df = pd.read_csv("ai_devtools_funding.csv")

per_company = (
    df.groupby("slug")
    .agg(
        name=("name", "first"),
        categories=("categories", "first"),
        location=("location", "first"),
        founded_year=("founded_year", "first"),
        headcount_band=("headcount_band", "first"),
        total_funding_usd=("total_funding_usd", "first"),
        num_rounds=("round", lambda s: s.notna().sum()),
        latest_round=("round", "last"),
        latest_round_date=("round_date", "last"),
    )
    .reset_index()
    .sort_values("total_funding_usd", ascending=False)
)

per_company.to_csv("ai_devtools_companies.csv", index=False)

That gives one row per company, sorted by total funding, with the latest round surfaced as a single column — the shape a sales or BD team drops straight into a CRM. Keep both files: the per-round CSV for analysis, the per-company CSV for outreach.

Why Crunchbase Is Hard to Scrape Yourself

If you have tried pointing a plain requests call or a vanilla headless Chrome at a Crunchbase page, you have probably gotten back a challenge page or an empty body instead of data. That is not a bug in your code — it is Cloudflare Turnstile doing its job.

Crunchbase fronts its pages with Cloudflare Turnstile, a challenge that profiles the browser environment rather than asking a human to click puzzles. It inspects the things that distinguish a real browser from an automated one: the headless-Chrome user-agent and JavaScript fingerprint, missing or fake GPU and canvas surfaces, the navigator.webdriver automation flag, and the absence of a genuine display server. A standard headless browser fails most of those checks and gets stuck in the challenge loop — it never reaches the page HTML, so your parser has nothing to parse.

Getting through requires a meaningfully heavier setup than "fetch a URL":

  • A headful (non-headless) browser. Turnstile reliably flags headless Chrome, so the browser has to run in its normal, visible mode — which on a server means there is no screen for it to draw to.
  • A virtual display. To run a headful browser on a headless server, you run it under a virtual framebuffer such as Xvfb, which gives the browser a real display server to render into without a physical monitor attached.
  • Automation tells patched out. The navigator.webdriver flag and the other CDP-automation fingerprints have to be suppressed so the browser does not announce itself as controlled.
  • Residential IPs. Datacenter IP ranges raise the Turnstile difficulty; requests that come from residential addresses look like ordinary visitors, which keeps the challenge from escalating.

That stack — headful Chrome under Xvfb, anti-detection patches, residential proxy rotation, plus the patience to absorb the occasional challenge retry — is most of the engineering in a DIY Crunchbase scraper. None of it is about parsing the page; it is all about being allowed to load the page in the first place. This is the honest reason a managed endpoint exists: it runs and maintains that anti-bot infrastructure so your side of the integration is just the async job loop and a CSV writer, never the Turnstile fight.

Scaling Across Many Sectors

When the job grows from one sector to a recurring sweep of ten — "every Monday, refresh the funding picture across AI infra, dev tools, fintech, healthtech, and the rest" — the bottleneck is the sequential detail crawl, where each organization waits on the previous job's poll loop. Two patterns keep that manageable.

First, run sectors concurrently rather than one slug at a time. The submit_and_wait helper is per-job blocking, but the submit call returns instantly, so you can submit a batch of organization jobs, collect their job IDs, and poll them as a group — turning a 25-company sequential crawl into one parallel wave bounded by your account's concurrency cap.

Second, treat the per-company CSV as a versioned snapshot and diff week over week on slug. The diff surfaces exactly the two events a deal-flow or BD team cares about: companies that took a new round since last week (a row whose num_rounds or latest_round_date changed), and companies that appeared in the sector for the first time. That weekly diff is the difference between re-reading a 25-row file every Monday and getting a five-line "here is what moved" report. Wire the search-plus-detail pipeline into a scheduled job, write the dated CSV, and diff against the prior week — the same recurring-snapshot pattern that turns any one-off scrape into a monitor.

The Honest LogPose Fit

LogPose's Crunchbase endpoints fit well when the shape is "I need structured funding and firmographic data from public Crunchbase profiles, at more volume than I want to click through, and I do not want to build and maintain a Turnstile-beating browser stack." The async job pattern is identical across the organization, person, and hub endpoints — and identical to the other platforms on the API — so your integration stays one shape as you layer Crunchbase enrichment onto an existing lead pipeline. The Turnstile, headful-browser, and residential-proxy infrastructure is handled server-side, which is the part that otherwise eats the most engineering time.

The honest constraints are real. Deep search pagination needs a connected Crunchbase Pro account via account_id — without it you get the first page of results, which is enough to seed a detail crawl but not to enumerate an entire category exhaustively. And the data ceiling is the public profile: funding rounds, investors, firmographics, headcount bands — not exact revenue, exact headcount, or anything behind a Pro login. If your need is the licensed full dataset rather than the public profile layer, that is a Crunchbase subscription question, not a scraping one.

Get Started

  1. Sign up at logposervices.com and generate an API key under Tool → API Keys.
  2. export LOGPOSE_API_KEY=lp_xxxxxxx
  3. Search a sector, fetch the detail, write the CSV:
curl -G "https://api.logposervices.com/api/v1/ecommerce/crunchbase/orgsearch" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode "query=ai developer tools"

Drop the Python pipeline above into a file, point it at your sector, and the funding CSV is on disk after one run.

Related reading: How to build a VC deal-flow list from Crunchbase for the deal-sourcing workflow built on top of this pipeline, Crunchbase API alternatives for funding and investor data for the managed-vs-DIY comparison, and How to enrich business leads with emails, phones, and socials for the chained workflow that turns funding rows into outreach-ready records.

Frequently asked questions

Is it legal to scrape Crunchbase?
The organization profile data Crunchbase shows to anyone without a login — company description, sector categories, headquarters location, founding year, headcount band, named funding rounds, and the investors attached to each round — is public web data. Scraping public web data is not a Computer Fraud and Abuse Act violation in the United States (hiQ Labs v. LinkedIn, 9th Cir. 2022), and EU precedent treats publicly listed firmographic and funding information as lawful to collect under a legitimate-interest basis because company-level financial facts are not personal data. What is a separate question from legality is Crunchbase's own Terms of Service, which restrict automated access and forbid republishing their dataset wholesale or cloning it into a competing product. The honest distinction: pulling public profile fields into your own spreadsheet for analysis sits on settled legal ground, while redistributing Crunchbase's compiled database as your own is a ToS and database-rights problem regardless of how you collected it. For internal deal-flow research and lead enrichment, the scrape is on solid footing; for building a public Crunchbase clone, it is not.
Why is Crunchbase hard to scrape with a headless browser?
Crunchbase fronts its pages with Cloudflare Turnstile, a challenge that profiles the browser environment — headless-Chrome fingerprints, missing GPU surfaces, automation flags, and the absence of a real display all trip it. A vanilla headless Chrome or a plain requests call gets stuck in the challenge loop and never reaches the page HTML, which is why naive scrapers return an empty body or a challenge page instead of data. Getting through requires running a real, non-headless browser under a virtual display (Xvfb) with the automation tells patched out, plus residential IPs so the request looks like an ordinary visitor rather than a datacenter bot. That stack — headful Chrome, virtual framebuffer, anti-detection patches, residential proxy rotation — is most of the work in a DIY Crunchbase scraper. A managed endpoint runs that infrastructure for you and hands back parsed JSON, so your code never has to solve the Turnstile problem.

Related posts

Tutorial

How to Build a VC Deal-Flow List from Crunchbase

10 min read
Comparison

Crunchbase API Alternatives for Funding and Investor Data

10 min read
Comparison

PhantomBuster Alternatives for B2B Prospecting Pipelines

10 min read