← Back to blogTutorial

How to Build a VC Deal-Flow List from Crunchbase

· 10 min read

For an investor or a BD lead, the most valuable company in any sector is the one that closed its round last week. The founder is hiring, picking vendors, and — if your thesis fits — open to a conversation before a dozen other funds have filled their inbox. The problem is that "show me everything in climate hardware that raised a Series A this month" is not a button you can press. Crunchbase has the data, but turning it into a ranked, deduplicated deal-flow list that you can re-run every Monday takes a small pipeline. This guide walks that pipeline end to end: define a thesis filter, search for matching companies, pull each company's latest round, score by recency and fit, and then run a weekly diff so you only ever look at what is genuinely new.

The whole thing lives in one Python file plus a handful of curl calls, and the only moving piece that is genuinely hard — getting past the Cloudflare challenge that Crunchbase puts in front of its pages — is the part you can hand to a managed endpoint.

Why Deal Flow Is a Diff Problem, Not a Search Problem

A one-shot Crunchbase search gets you a snapshot: every company currently tagged with your sector and stage. That is useful exactly once, for the initial backfill. The day after, it is the same list, and a week later it is the same list plus a few rows you have to spot by eye. Reading the same hundred companies every Monday to find the four that are new is how good leads get missed.

What you actually want is the delta — the organizations that appeared in this week's thesis search but were not in last week's, plus the ones whose funding stage advanced (a company you saw at Seed that is now tagged Series A raised in between). That feed does not exist as a Crunchbase product, but the pattern to build it is short, and it is the same diff loop that local-lead teams use to catch new Yellow Pages businesses the week they open. The only differences are the identifier (an org slug instead of a phone number) and the change you care about (a new funding stage instead of a brand-new listing).

First-mover advantage is real here. A founder who just announced a round is in active vendor- and investor-conversation mode for a few weeks, then settles. Reaching them in week one is a categorically different conversation than reaching them in week ten.

Step 1 — Define a Thesis Filter

Before any API call, write down the filter in three parts, because everything downstream keys off it:

  • Sector keywords. The terms you would actually type into Crunchbase search — climate hardware, developer tools, vertical SaaS healthcare, RWA tokenization. Keep a short list; you will run one search per keyword.
  • Stage band. The rounds you write checks into or track — Pre-Seed and Seed for an early fund, Series A/B for a growth fund. This is a post-filter you apply after pulling org detail, since search results do not reliably carry the stage.
  • Geography. Often a hard gate (a US-only fund ignores everything else) and sometimes a soft signal (you will look at EU but rank US higher). Crunchbase profiles carry an HQ location you can filter on.

The thesis filter is the contract for the rest of the pipeline. A vague filter ("AI") returns thousands of rows and no signal; a sharp one ("AI infrastructure, Seed–Series A, US/Canada") returns a list short enough to read every name.

Step 2 — Search for Matching Companies

Discovery is two endpoints. Organizations come from orgsearch, and funds, accelerators, and investment hubs come from fundsearch — useful when your thesis is "who is co-investing in this space" rather than "which startups are raising."

# Organizations matching a sector keyword
curl -G "https://api.logposervices.com/api/v1/ecommerce/crunchbase/orgsearch" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode "query=climate hardware" \
  --data-urlencode "pages=1"
# → {"job_id": "cb_4f1a...", "status": "pending"}
# Funds / accelerators / hubs in the same space
curl -G "https://api.logposervices.com/api/v1/ecommerce/crunchbase/fundsearch" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode "query=climate tech fund" \
  --data-urlencode "pages=1"

Both calls are asynchronous. The GET returns a job envelope — {"job_id": "...", "status": "pending"} — rather than the results inline, because Crunchbase's pages sit behind a Cloudflare challenge and rendering them takes longer than a fast HTTP round trip. You poll the job and fetch the result once it completes (Step 3 shows the loop in Python).

One honest nuance on pagination: pages only walks past the first page of results when you have a connected Crunchbase Pro account passed via an account_id parameter. Without a connected account, you get the first page only, regardless of what you set pages to. For a focused thesis filter that is often fine — the first page of a sharp keyword search is usually the relevant cohort — but if your sector is broad and you need depth, connect a Pro account so multi-page pagination unlocks.

The search result gives you, per company, the slug you need for the detail call (the trailing path segment of the Crunchbase URL, e.g. acme-robotics) plus enough surface — name, short description, HQ — to drop obvious mismatches before you spend a detail pull on them.

Step 3 — Pull the Latest Round from Org Detail

Search tells you a company exists in your space; it does not reliably tell you what they last raised. For that you call the organization detail endpoint, which accepts either a full Crunchbase URL or a bare slug:

curl -G "https://api.logposervices.com/api/v1/ecommerce/crunchbase/organization" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode "url=acme-robotics"
# → {"job_id": "cb_9b22...", "status": "pending"}

# Poll, then fetch the result
curl "https://api.logposervices.com/api/v1/jobs/cb_9b22/result" \
  -H "X-API-Key: lp_xxxxxxx"

The detail result carries the firmographics and — the part you actually came for — the funding history: each round's stage (Seed, Series A, …), the amount band, the announced date, and the lead and participating investors. That is everything your thesis post-filter and your scoring need: stage to gate, date to rank by recency, investors to spot when a fund you respect is already in.

Because both the search and the detail calls are async and Crunchbase pages render slowly behind the challenge, the right way to call them from code is a submit-and-poll helper. api.logposervices.com sits behind Cloudflare with roughly a 90-second edge timeout, so a synchronous call on a slow render returns a 524 even though the job keeps going server-side — always poll.

import os, time, requests

API_KEY = os.environ["LOGPOSE_API_KEY"]
BASE = "https://api.logposervices.com/api/v1"
HEADERS = {"X-API-Key": API_KEY}


def submit_and_wait(path: str, params: dict, timeout_s: int = 120) -> dict:
    r = requests.get(f"{BASE}/{path}", params=params, headers=HEADERS, timeout=30)
    r.raise_for_status()
    job_id = r.json()["job_id"]
    deadline = time.time() + timeout_s
    while time.time() < deadline:
        s = requests.get(f"{BASE}/jobs/{job_id}", headers=HEADERS, timeout=15).json()
        if s["status"] == "completed":
            return requests.get(
                f"{BASE}/jobs/{job_id}/result", headers=HEADERS, timeout=15
            ).json()
        if s["status"] == "failed":
            raise RuntimeError(s.get("error", "unknown failure"))
        time.sleep(2)
    raise TimeoutError(f"job {job_id} did not finish in {timeout_s}s")

Step 4 — Score and Rank by Recency and Fit

Once each company has a latest round attached, the deal-flow list is a sort, not a search. A simple, transparent score beats a clever one here — you want to be able to explain why a row is at the top.

Two factors carry most of the signal: how recently the round was announced, and how well the company fits the thesis. Recency is a decay on the announced date; fit is a small set of boolean bonuses you can read off the detail result.

import datetime as dt


def score(company: dict, stage_band: set[str]) -> float:
    rounds = company.get("funding_rounds", [])
    if not rounds:
        return 0.0
    latest = max(rounds, key=lambda x: x.get("announced_on", ""))

    # Recency: full marks this week, decaying to ~0 over a year.
    try:
        announced = dt.date.fromisoformat(latest["announced_on"])
        days_old = (dt.date.today() - announced).days
    except (KeyError, ValueError):
        days_old = 365
    recency = max(0.0, 1.0 - days_old / 365.0)

    # Fit: stage in band, HQ in target geography, a known lead investor.
    fit = 0.0
    if latest.get("stage") in stage_band:
        fit += 0.5
    if company.get("hq_country") in {"United States", "Canada"}:
        fit += 0.3
    if latest.get("lead_investors"):
        fit += 0.2

    return round(0.6 * recency + 0.4 * fit, 3)

Tune the weights to your fund — an early-stage fund leans harder on recency because the window to get in is short; a thesis-driven fund leans harder on fit. The output is one number per company, and the deal-flow sheet is that list sorted descending.

Step 5 — The Weekly Diff: Surface Only Net-New Raises

This is the step that turns a one-time list into a deal-flow engine. Persist the org slugs you have already reviewed along with the funding stage you last saw them at. Each week, re-run the thesis search and detail pulls, and surface only two kinds of rows: slugs you have never seen, and slugs whose stage advanced since last week (a Seed company that is now Series A raised in the interim — exactly the event you want to catch).

The persistence is tiny — one row per slug. SQLite is plenty; there is no reason to run Postgres for a deal-flow tracker.

import sqlite3, datetime as dt
from contextlib import closing

DB = "deal_flow.db"


def init_db():
    with closing(sqlite3.connect(DB)) as c:
        c.execute(
            """CREATE TABLE IF NOT EXISTS seen (
                   slug TEXT PRIMARY KEY,
                   name TEXT,
                   last_stage TEXT,
                   first_seen_at TEXT,
                   last_seen_at TEXT
               )"""
        )
        c.commit()


def diff_thesis(keywords: list[str], stage_band: set[str]) -> list[dict]:
    """Return only net-new or stage-advanced companies for this run."""
    init_db()
    now = dt.datetime.utcnow().isoformat()
    candidates: dict[str, dict] = {}

    # 1) Discover org slugs across every thesis keyword.
    for kw in keywords:
        res = submit_and_wait(
            "ecommerce/crunchbase/orgsearch", {"query": kw, "pages": 1}
        )
        for org in res.get("organizations", []):
            slug = org.get("slug")
            if slug:
                candidates[slug] = org  # dedupe across keywords by slug

    # 2) Pull detail, score, and diff against what we have already reviewed.
    new_rows: list[dict] = []
    with closing(sqlite3.connect(DB)) as c:
        cur = c.cursor()
        prior = {row[0]: row[1] for row in cur.execute("SELECT slug, last_stage FROM seen")}

        for slug in candidates:
            detail = submit_and_wait(
                "ecommerce/crunchbase/organization", {"url": slug}
            )
            rounds = detail.get("funding_rounds", [])
            latest = max(rounds, key=lambda x: x.get("announced_on", ""), default={})
            stage = latest.get("stage")

            is_new = slug not in prior
            advanced = (not is_new) and stage and stage != prior.get(slug)
            if (is_new or advanced) and stage in stage_band:
                new_rows.append(
                    {
                        "slug": slug,
                        "name": detail.get("name", ""),
                        "stage": stage,
                        "amount": latest.get("amount", ""),
                        "announced_on": latest.get("announced_on", ""),
                        "lead_investors": ", ".join(latest.get("lead_investors", [])),
                        "score": score(detail, stage_band),
                        "reason": "new" if is_new else f"advanced→{stage}",
                    }
                )

            cur.execute(
                """INSERT INTO seen (slug, name, last_stage, first_seen_at, last_seen_at)
                   VALUES (?, ?, ?, ?, ?)
                   ON CONFLICT(slug) DO UPDATE SET
                       last_stage = excluded.last_stage,
                       last_seen_at = excluded.last_seen_at""",
                (slug, detail.get("name", ""), stage, now, now),
            )
        c.commit()

    new_rows.sort(key=lambda r: r["score"], reverse=True)
    return new_rows

On the first run, every slug is "new" relative to an empty database, so you get the full backfill — that is your initial deal-flow list. From the second Monday on, the function returns only the genuinely new and the stage-advanced, ranked. That is the feed you read.

Step 6 — Output a Deal-Flow Sheet

The last step is making the ranked rows usable by a human who is not looking at a terminal. A CSV drops straight into a spreadsheet, a shared Notion table, or a CRM import.

import csv, datetime as dt

KEYWORDS = ["climate hardware", "grid software", "industrial decarbonization"]
STAGE_BAND = {"Seed", "Series A"}


def write_sheet(rows: list[dict]) -> str:
    stamp = dt.date.today().isoformat()
    path = f"deal_flow_{stamp}.csv"
    cols = ["score", "name", "stage", "amount", "announced_on",
            "lead_investors", "reason", "slug"]
    with open(path, "w", newline="") as f:
        w = csv.DictWriter(f, fieldnames=cols)
        w.writeheader()
        for r in rows:
            w.writerow({c: r.get(c, "") for c in cols})
    return path


if __name__ == "__main__":
    rows = diff_thesis(KEYWORDS, STAGE_BAND)
    out = write_sheet(rows)
    print(f"{len(rows)} new/advanced companies → {out}")
    for r in rows[:15]:
        print(f"  {r['score']:>5}  {r['name']} — {r['stage']} ({r['announced_on']}) [{r['reason']}]")

Schedule the script for early Monday with cron, GitHub Actions, or any hosted scheduler, and the sheet is waiting before the partner meeting. If you would rather the new rows land in Slack or an inbox, the same rows list POSTs to a webhook in a few lines — the diff has already done the hard part of deciding what is worth an alert.

Scaling Deal Flow Across Theses and Geographies with LogPose

The single-thesis script scales to a full fund's coverage by varying two things: the keyword list and the stage band. A multi-strategy fund running early-stage climate, growth-stage fintech, and a watch-list of competitors' portfolios is three keyword sets, three stage bands, three SQLite tables — or one table keyed by a thesis column the same way the Yellow Pages monitor keys by category_geo.

Two operational realities make this easier to run on a managed endpoint than on a homegrown scraper. First, Crunchbase fronts its pages with Cloudflare Turnstile, which blocks headless browsers outright — a DIY scraper either loops forever on the challenge or burns engineering time keeping a headful, fingerprint-managed browser alive. A managed endpoint absorbs that challenge for you, so your code only ever sees clean JSON. Second, the async submit-and-poll pattern is identical across LogPose's other endpoints, so the same diff harness that watches Crunchbase rounds can watch Yellow Pages categories for new businesses or product listings without a second integration shape. For multi-page discovery on broad theses, connecting a Crunchbase Pro account via account_id unlocks pagination past the first page; for sharp filters, the first page is usually the cohort that matters.

Where Crunchbase Deal Flow Works Well — and Where It Doesn't

This pipeline is the right tool when your sourcing is thesis-driven and recurring: you have a small set of sectors and stages, you care about reaching founders soon after a raise, and you want the same list refreshed weekly without re-reading it by hand. The diff loop is genuinely high-leverage there, and Crunchbase's funding coverage is deep enough that the latest-round read is reliable for announced and disclosed rounds.

It is the wrong tool for a few honest cases. If you need stealth or pre-announcement deals, Crunchbase is structurally late — it surfaces rounds after they are public, so the diff catches the announcement, not the close. If your thesis is broad and you need hundreds of results per keyword, you will hit the first-page pagination limit unless you connect a Pro account. And if your sourcing is relationship-driven rather than data-driven — warm intros, a tight network — a deal-flow scraper is a supplement to your funnel, not the funnel itself. The tool earns its place when the volume of companies in your space is larger than you can track by reading, which for most sector-focused funds it is.

A word on legality, since it always comes up: this pipeline reads public company profiles — the same pages any logged-out visitor sees. Scraping public, non-authenticated web data has been treated as permissible in the US (hiQ Labs v. LinkedIn, 9th Cir. 2022), though that addresses the CFAA, not a site's terms of service or copyright in the underlying compilation. Respect Crunchbase's terms, do not scrape behind a login you are not entitled to use, and treat the output as sourcing signal rather than redistributable data. If you are operating at scale or commercializing the dataset, get your own legal read.

Get Started

Sign up at logposervices.com, generate an API key from Tool → API Keys, and submit your first thesis search against /api/v1/ecommerce/crunchbase/orgsearch?query=.... Poll the returned job_id, fetch the result, and you have the first column of your deal-flow sheet — then drop in the diff script above to make it net-new from the second Monday on.

Related reading: How to scrape Crunchbase startup funding data for the end-to-end extraction details, Crunchbase API alternatives for funding and investor data for the tool landscape, and How to monitor Yellow Pages for new businesses for the same diff pattern applied to local lead-gen.

Frequently asked questions

How fresh is Crunchbase funding data?
Crunchbase is one of the faster-updating funding sources because it blends machine ingestion of press releases and SEC filings with a community of contributors and a venture-data team. A priced round announced via press release typically appears on the company profile within a few days; rounds disclosed through regulatory filings can lag a week or two until the filing is processed. The point of a weekly diff is to absorb that variance — you do not need to know the exact hour a round landed, you need to catch it the first week it shows up in your thesis filter, which a Monday re-run reliably does. Treat the date on the profile as 'announced or disclosed,' not 'closed,' since founders often close weeks before the public announcement.
Can you get alerted when a company raises a new round?
Not natively from a single Crunchbase endpoint — there is no public 'new round' webhook you can subscribe to per company. The standard pattern is the weekly-diff loop described in this post: store the org slugs (and their last-seen funding stage) you have already reviewed, re-run your thesis search and detail pulls on a schedule, and surface only the organizations whose stage changed or that appeared for the first time. That is roughly twenty lines of Python on top of any scraper that returns a stable slug and the funding stage. If you want the alert to land in Slack or an inbox instead of a console, the diff script's output rows are trivial to POST to a webhook the same way you would for any other lead feed.

Related posts

Comparison

Crunchbase API Alternatives for Funding and Investor Data

10 min read
Comparison

PhantomBuster Alternatives for B2B Prospecting Pipelines

10 min read
Tutorial

How to Scrape Crunchbase for Startup Funding Data

11 min read