Is it legal to scrape public Crunchbase profile data?

The firmographic data on a public Crunchbase organization page — company name, description, headquarters, founding year, and the headline rounds it has disclosed — is displayed without authentication to anyone who opens the page. Scraping public web data is not a CFAA violation in the United States, per hiQ Labs v. LinkedIn (9th Cir. 2022), where the court held that accessing data a site makes publicly available does not constitute unauthorized access. That precedent is about the access-method question, not a blanket permission. The important distinction for an investor: if you hold a paid Crunchbase license (Pro, Enterprise, or an API contract), that agreement carries its own Terms of Service governing bulk and structured export of their dataset, and those contractual terms bind you independently of the public-access question. The pipeline in this guide reads the same public organization pages a human visitor sees; if you also pay for a Crunchbase license to get their full structured financials, treat the license ToS as the controlling document for anything you export from inside the paid product.

Why does Crunchbase need the async submit-and-poll pattern instead of a direct request?

Crunchbase organization and search pages sit behind a Cloudflare challenge that has to be cleared before the real content renders, so a fetch is not a single fast HTTP round-trip — it is a managed browser session that waits for the challenge to pass and the page to hydrate, which routinely takes longer than a normal API call. On top of that, api.logposervices.com is itself proxied through Cloudflare, which kills any single inbound connection at roughly 90 seconds. If you tried to hold one synchronous request open while a Crunchbase page cleared its challenge and paged through results, you would hit that 90-second edge cap and get a dropped connection rather than data. Submitting a job, getting a job id back immediately, and polling for the result sidesteps both problems: the slow Cloudflare-gated work happens server-side with no open connection to time out, and your script just watches the queue.

← Back to blogStrategy

The Deal Scout's Weekly Funding Digest from Crunchbase

June 23, 2026 · 12 min read

If you scout deals for a fund or run corporate development, your job is not "watch the whole startup market." It is to know, every week, which companies inside your specific thesis just raised — climate hardware, vertical SaaS for logistics, developer tooling, whatever your partners actually write checks into. Crunchbase is the canonical public source for that signal: founders and PR teams keep their funding announcements current there, so a fresh round usually shows up on the org page within days. The problem is that staying current means refreshing a dozen saved searches by hand every Monday, eyeballing which names are actually new, and copying the interesting ones into a doc before standup.

This guide builds the thing that replaces that ritual: an automated weekly funding digest, scoped to your thesis keywords, deduped so the same company surfacing under two searches counts once, enriched with detail from each org page, filtered down to genuinely recent rounds, and formatted to drop into Notion or Slack. We will cover why Crunchbase forces an asynchronous scraping pattern, how to fan a thesis keyword list across two search endpoints, how to merge and dedupe by organization, how to enrich the top hits, and finally how a scheduled monitor turns the one-off run into a standing weekly digest with a net-new diff. The example thesis is "climate + logistics + dev tools," but the same code covers any thesis by swapping the keyword list.

Why Crunchbase Coverage Is a Fan-Out Problem

A single Crunchbase search is anchored on one query string, and any real investment thesis is wider than one keyword. "Climate" alone misses "carbon capture," "grid software," and "battery"; "logistics" misses "freight," "supply chain," and "last mile." If you run one search you cover a sliver of your thesis; if you want the whole thesis you have to run many searches and stitch the results together.

That fan-out immediately creates two structural problems, and the entire pipeline is built around solving them.

The first is duplication. Overlapping keywords are a feature, not a bug — broad coverage requires that "freight software" and "supply chain SaaS" both return the same standout company, because you would rather see it twice than miss it. But that means the merged result set is full of duplicate organizations, and you cannot dedupe on company name reliably (subsidiaries and rebrands share names, and the same company can render with slightly different display strings across searches). The clean key is the organization's own identifier, derived from its Crunchbase URL slug, which is stable regardless of which keyword surfaced it.

The second is freshness. A keyword search returns companies matching the keyword, not companies that raised this week matching the keyword. The recency filter is a separate step that reads each candidate's actual round data and keeps only the recent ones — which means you have to enrich before you can filter, and you only want to enrich the candidates worth the slower per-org call.

So the shape is: fan a keyword list across the search endpoints, merge and dedupe by org, enrich the survivors, filter to recent rounds, format the digest. Because Crunchbase is Cloudflare-gated, every one of those scraping steps runs asynchronously.

Step 1: Define the Thesis and Confirm One Search

Start by writing your thesis down as a flat list of keywords. This list is your coverage — a missing keyword is a blind spot — so err toward more terms, since dedupe collapses the overlap they create.

THESIS = [
    "climate software", "carbon capture", "grid software",
    "freight software", "supply chain saas", "last mile logistics",
    "developer tools", "api infrastructure", "observability",
]

Before looping over nine keywords, confirm one search works end to end with curl. Crunchbase has two complementary search endpoints: orgsearch finds organizations matching a keyword, and fundsearch finds funds and funding entities matching it — together they cover both the companies raising and the investors moving in a space. Both are asynchronous: you submit, get a job id back, then poll.

# 1) Submit one org search — returns a job id immediately
curl -G "https://api.logposervices.com/api/v1/ecommerce/crunchbase/orgsearch" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode "query=climate software" \
  --data-urlencode "pages=2"
# → {"job_id": "cb_4a91...", "status": "pending"}

# 2) Poll the job until status == "completed"
curl -H "X-API-Key: lp_xxxxxxx" \
  https://api.logposervices.com/api/v1/jobs/cb_4a91

# 3) Fetch the result rows
curl -H "X-API-Key: lp_xxxxxxx" \
  https://api.logposervices.com/api/v1/jobs/cb_4a91/result

The submit-then-poll dance is not optional here. Crunchbase pages sit behind a Cloudflare challenge that must be cleared in a real browser session before the content renders, so the work is slower than a plain HTTP request — and api.logposervices.com is itself behind Cloudflare, which severs any single connection at roughly 90 seconds. Hold one synchronous request open across a challenge-clearing, multi-page search and you will hit that edge cap. Submit the job, let it run server-side, poll for the result.

pages=2 is a sensible default per keyword for a weekly digest — the most recently active companies surface in the first pages of a search, and you do not need deep paging when the recency filter downstream is doing the real narrowing.

Step 2: Fan the Thesis Across Both Search Endpoints

Now wire the search step into a fire-all-then-poll loop. For a nine-keyword thesis hitting two endpoints, that is eighteen jobs — you submit them all up front (each returns instantly with a job id) and then poll the outstanding ids until they finish, so the whole fan-out runs concurrently server-side instead of one keyword at a time.

import os, time, requests

API_KEY = os.environ["LOGPOSE_API_KEY"]
BASE = "https://api.logposervices.com/api/v1"
HEADERS = {"X-API-Key": API_KEY}


def submit_search(kind, query, pages=2):
    """kind is 'orgsearch' or 'fundsearch'."""
    r = requests.get(
        f"{BASE}/ecommerce/crunchbase/{kind}",
        params={"query": query, "pages": pages},
        headers=HEADERS, timeout=30,
    )
    r.raise_for_status()
    return r.json()["job_id"]


def collect(job_ids, poll_every=6, timeout_s=1200):
    """Poll a batch of job ids; return the merged list of result rows."""
    pending = set(job_ids)
    rows, deadline = [], time.time() + timeout_s
    while pending and time.time() < deadline:
        for jid in list(pending):
            s = requests.get(f"{BASE}/jobs/{jid}", headers=HEADERS, timeout=15).json()
            status = s.get("status")
            if status == "completed":
                res = requests.get(f"{BASE}/jobs/{jid}/result",
                                   headers=HEADERS, timeout=30).json()
                rows.extend(res.get("results", []))
                pending.discard(jid)
            elif status == "failed":
                print(f"  search job {jid} failed: {s.get('error')}")
                pending.discard(jid)
        if pending:
            time.sleep(poll_every)
    if pending:
        print(f"  {len(pending)} jobs still running at timeout — collect later")
    return rows


# Fan the thesis across both endpoints, then poll the whole batch
job_ids = []
for kw in THESIS:
    job_ids.append(submit_search("orgsearch", kw, pages=2))
    job_ids.append(submit_search("fundsearch", kw, pages=2))

print(f"submitted {len(job_ids)} search jobs")
raw = collect(job_ids)
print(f"collected {len(raw)} raw rows (pre-dedupe)")

Submitting first and polling second is what keeps the wall-clock time flat as the thesis grows — eighteen Cloudflare-gated searches run in parallel up to your account's concurrency cap, and your script just watches the queue drain rather than waiting on each one in sequence.

Step 3: Merge and Dedupe by Organization

The merged set now contains the same company multiple times — once per keyword it matched, plus any overlap between the org and fund searches. Collapse it on the organization identifier derived from the Crunchbase URL, which is stable per company across every search that surfaced it. Deduping on the URL slug beats deduping on display name, because the same company can render with different name strings across searches but its /organization/<slug> path is constant.

import re


def org_key(row):
    """Stable per-org key from the Crunchbase URL slug."""
    url = row.get("url") or ""
    m = re.search(r"/organization/([^/?#]+)", url)
    if m:
        return m.group(1).lower()
    # Fallback for fund rows or odd URLs: normalized name
    name = (row.get("name") or "").strip().lower()
    return name or None


def dedupe(rows):
    seen, unique = {}, []
    for r in rows:
        key = org_key(r)
        if not key:
            continue
        if key in seen:
            # keep the row that already carries a usable URL
            if not seen[key].get("url") and r.get("url"):
                seen[key].update(r)
            continue
        seen[key] = r
        unique.append(r)
    return unique


candidates = dedupe(raw)
print(f"{len(candidates)} unique organizations after dedupe")
# e.g. 18 searches x ~40 rows -> ~700 raw -> ~250 unique orgs

The fallback to a normalized name covers fund rows or any row where the URL was not in the expected shape, so a real candidate is never silently dropped because one identifier was missing. After this step you have a clean list of unique organizations across the whole thesis — but you still do not know which of them raised recently. That is the enrichment step.

Step 4: Enrich the Candidates from Their Org Pages

A search result row carries the headline fields — name, URL, a short description — but not the structured round detail you need to decide whether a company belongs in this week's digest. The organization endpoint resolves a single org's detail page by URL and returns the fuller record, including its disclosed rounds. It is the slower of the three calls because it loads and clears the Cloudflare challenge for one full org page, so you enrich deliberately: only the candidates worth the per-org cost.

def submit_org(url):
    r = requests.get(
        f"{BASE}/ecommerce/crunchbase/organization",
        params={"url": url},
        headers=HEADERS, timeout=30,
    )
    r.raise_for_status()
    return r.json()["job_id"]


# Enrich the candidates that have a resolvable org URL
to_enrich = [c for c in candidates if "/organization/" in (c.get("url") or "")]
print(f"enriching {len(to_enrich)} organizations")

org_jobs = {submit_org(c["url"]): c["url"] for c in to_enrich}
detail_rows = collect(list(org_jobs.keys()))

# Index the enriched detail by org key for the next step
details = {org_key(d): d for d in detail_rows if org_key(d)}
print(f"enriched {len(details)} organizations")

This is the same fire-all-then-poll pattern as the search step — submit every org job, then poll the batch — so even a couple hundred enrichments run concurrently rather than serially. If your thesis is broad enough that enriching every candidate is more work than you want each week, narrow to_enrich first: keep only orgs whose search-result description matches a tighter sub-thesis, or only the first N per keyword, and enrich those. The recency filter in the next step assumes you have the round data, so enrich whatever you intend to consider.

Step 5: Filter to Recent Rounds

Now apply the freshness step that turns a thesis-matched list into a funding digest. Read each enriched org's most recent round, parse its date, and keep only the ones inside your digest window — one week for a weekly cadence, with a little slack because Crunchbase entries sometimes lag the announcement by a few days.

from datetime import datetime, timezone, timedelta

WINDOW_DAYS = 10  # 7-day cadence + a few days of slack for late entries


def latest_round(detail):
    """Return (date, round_row) for the most recent disclosed round, or None."""
    rounds = detail.get("funding_rounds") or []
    dated = []
    for fr in rounds:
        raw = fr.get("announced_on") or fr.get("date") or ""
        for fmt in ("%Y-%m-%d", "%b %d, %Y", "%Y-%m"):
            try:
                dt = datetime.strptime(raw[:len(fmt) + 4], fmt).replace(tzinfo=timezone.utc)
                dated.append((dt, fr))
                break
            except ValueError:
                continue
    if not dated:
        return None
    return max(dated, key=lambda x: x[0])


cutoff = datetime.now(timezone.utc) - timedelta(days=WINDOW_DAYS)
recent = []
for key, detail in details.items():
    lr = latest_round(detail)
    if not lr:
        continue
    when, fr = lr
    if when >= cutoff:
        recent.append((when, detail, fr))

recent.sort(key=lambda x: x[0], reverse=True)
print(f"{len(recent)} organizations with a round in the last {WINDOW_DAYS} days")

The multi-format date parsing earns its place because Crunchbase does not render dates uniformly across pages, and a digest that silently dropped a fresh round because of one unexpected date string would be worse than useless. Sorting newest-first means the most recent raises sit at the top of the digest, which is the order a partner wants to read them in.

Step 6: Format the Weekly Digest

The last step turns the recent-round list into something a human reads in thirty seconds. Build a compact per-company block — name, the round and amount if disclosed, the date, a one-line description, and the link — and join them into a single digest payload you can post to Notion or Slack.

def fmt_amount(fr):
    amt = fr.get("money_raised") or fr.get("amount")
    return f" · {amt}" if amt else ""


def build_digest(recent):
    today = datetime.now(timezone.utc).strftime("%b %d, %Y")
    lines = [f"*Thesis funding digest — week of {today}*",
             f"{len(recent)} new rounds across your thesis", ""]
    for when, detail, fr in recent:
        name = detail.get("name", "Unknown")
        stage = fr.get("round_type") or fr.get("type") or "Round"
        date = when.strftime("%b %d")
        desc = (detail.get("description") or "").strip()
        if len(desc) > 140:
            desc = desc[:137] + "..."
        url = detail.get("url", "")
        lines.append(f"• *{name}* — {stage}{fmt_amount(fr)} ({date})")
        if desc:
            lines.append(f"  {desc}")
        if url:
            lines.append(f"  {url}")
        lines.append("")
    return "\n".join(lines)


digest = build_digest(recent)
print(digest)

That string drops straight into a Slack message or a Notion block. The truncated description keeps each entry skimmable; the round type and amount give the partner the one number they care about; the link is there for the one or two companies worth a deeper look. You now have, in a single run, the artifact that used to take a Monday morning of tab-refreshing — but it is still a manual run. The next section removes the manual part.

Scaling Into a Standing Weekly Digest

Everything above is one execution: you ran it, you got this week's digest. The deal-scout shape, though, is a recurring cadence — the same thesis, re-pulled every week, where what you actually care about is the rounds you have not already seen. Two things make that practical, and the second is where you stop maintaining infrastructure.

First, the thesis is just data, so re-running is free in engineering terms — the same submit_search / collect / dedupe / enrich / filter / build_digest functions run unchanged each week; only the date window moves. Second, a weekly cadence needs a net-new diff, because a company that raised eight days ago will still fall inside a ten-day window next week and you do not want to report it twice. Because every entry carries a stable org key, the diff is trivial in principle: store last week's set of org keys, re-run, and surface only the keys you have not seen before.

The part worth not building yourself is the scheduler and the state store behind that diff. LogPose exposes a monitor primitive — POST /api/v1/monitors — that polls a saved search on a schedule and fires when new organizations appear, with notify_channels of email, webhook, telegram, slack, or discord. Pointed at your saved thesis searches, it does the weekly poll, holds the seen-set for the net-new diff, and pushes only the new rounds to Slack (or a webhook that writes them into Notion) — which removes the cron job, the database of seen org keys, and the "did the run actually fire on Monday" babysitting from your build. That is the piece that turns a script you remember to run into a digest that simply arrives.

The Honest Fit

This approach fits well when your sourcing is thesis-driven and your cadence is weekly: a defined set of keywords, deduped coverage across them, enrichment from public org pages, a recency filter, and a digest that lands in the tool your team already lives in — all without standing up your own Cloudflare-clearing browser fleet and proxy rotation. The async search-then-enrich pattern and the org-key dedupe are the two primitives that make broad thesis coverage reliable rather than a pile of duplicate tabs.

Where it is not the right tool: this is not a real-time deal wire. The cadence is a weekly poll of public pages, so a round that broke an hour ago is a next-cycle event, not an instant alert — if your edge depends on being first within minutes, you need a live feed, not a digest. And it is not a substitute for a paid Crunchbase enterprise license: if you need their full structured financials, investor graphs, and sanctioned bulk export, license the product and use this pipeline alongside it for thesis-scoped weekly coverage rather than as a replacement. Used for what it is — automated, deduped, thesis-scoped weekly sourcing — it replaces the Monday ritual cleanly.

Get Started

Sign up at logposervices.com and generate an API key under Tool → API Keys.
export LOGPOSE_API_KEY=lp_xxxxxxx
Confirm one thesis keyword resolves, then build the fan-out:

curl -G "https://api.logposervices.com/api/v1/ecommerce/crunchbase/orgsearch" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode "query=climate software" \
  --data-urlencode "pages=2"

Then run the fan-out over your full thesis list, dedupe by org key, enrich the survivors via /api/v1/ecommerce/crunchbase/organization?url=..., filter to recent rounds, and format the digest. Once the script produces the digest you want, point a monitor at your saved thesis searches so it polls weekly, diffs against the org keys it has already seen, and pushes only the net-new rounds to Slack or Notion — and export the digest to your team's workspace from there.

Related reading: How to build a VC deal-flow list from Crunchbase for the sourcing fundamentals, How to scrape Crunchbase startup funding data for the field-level extraction detail, and Crunchbase API alternatives for funding and investor data for the tooling landscape.

External: Crunchbase, hiQ Labs v. LinkedIn.

The Deal Scout's Weekly Funding Digest from Crunchbase

Why Crunchbase Coverage Is a Fan-Out Problem

Step 1: Define the Thesis and Confirm One Search

Step 2: Fan the Thesis Across Both Search Endpoints

Step 3: Merge and Dedupe by Organization

Step 4: Enrich the Candidates from Their Org Pages

Step 5: Filter to Recent Rounds

Step 6: Format the Weekly Digest

Scaling Into a Standing Weekly Digest

The Honest Fit

Get Started

Frequently asked questions

Related posts

The Deal Scout's Weekly Funding Digest from Crunchbase

Frequently asked questions

Related posts

How a Cold-Email Agency Pulls 500 Fresh Local Leads a Week

How DTC Brands Catch a Competitor's Price Drop the Same Day

The Etsy Seller's Trend Radar: Find Rising Products Before They Peak