Is it legal to scrape Yellow Pages for cold-email leads?

Yellow Pages business listings — name, address, phone, website, and category — are public B2B data displayed without authentication to anyone who opens the page, and the contact enrichment in this pipeline reads each business's own public website, which is the same information a human visitor sees. Scraping public web data is not a CFAA violation in the United States per hiQ Labs v. LinkedIn (9th Cir. 2022), and EU/UK precedent treats public business contact information as collectible under a legitimate-interest basis. The genuinely regulated step is downstream and worth taking seriously: how you contact the leads is governed by CAN-SPAM (US email), GDPR (EU personal data), and CASL (Canada), not how you collected them. That means honoring unsubscribe requests, identifying yourself, and not using deceptive subject lines is the real compliance work — the data collection is the easy part.

Why dedupe across cycles instead of just scraping the same searches each week?

Directory listings change slowly — most businesses in a category-and-city search are the same week over week. If you re-scrape and email the whole list every cycle, you re-contact people who already got your sequence, which burns your domain reputation and trips spam filters faster than anything else. A weekly cadence only works if each cycle delivers net-new businesses. The fix is a stable identifier: store the set of business ids you have already pulled, re-run the searches, and surface only the rows whose id you have not seen before. That diff is what turns a static list into a standing pipeline, and it is why the dedupe key matters as much as the scrape itself.

← Back to blogStrategy

How a Cold-Email Agency Pulls 500 Fresh Local Leads a Week

June 23, 2026 · 12 min read

If you run a cold-email agency serving local SMB clients, your product is not "a list" — it is a steady supply of fresh, reachable businesses in a defined niche and geography, delivered every week without you re-doing the work. A client selling bookkeeping to dental practices in three states needs new dentists to email each Monday, not the same 2,000 rows recycled until the open rate collapses. The hard part is not finding businesses; directories are full of them. The hard part is turning a directory into an outreach-ready, deduplicated, net-new feed on a weekly cadence — and doing it across the dozens of category-and-city combinations a single client's territory implies.

This guide is the full weekly pipeline. We will cover why a single directory search caps short of a territory, how to build a list of category-by-city searches, how to fire an enriched leads job per combination that visits each business's own site for emails and phones, poll and collect the results, dedupe by a stable id so you only keep net-new businesses across cycles, and write an outreach-ready CSV that drops straight into your sending tool or CRM. The example niche is dental practices across a few Texas cities, but the same code covers HVAC contractors in the Southeast or law firms in the Midwest by swapping two lists.

Why One Search Will Never Cover a Territory

A cold-email agency's unit of work is a niche × geography — "bookkeeping for dentists in Texas," "pest control for restaurants in Florida." Neither dimension is one query. A directory like Yellow Pages organizes results by a category term and a location term, and a single search returns one category in one city, paginated:

https://www.yellowpages.com/search?search_terms=dentists&geo_location_terms=Austin%2C+TX

The structure is ?search_terms=<category>&geo_location_terms=<city, state>. That single search covers exactly one category in one city, and even there it is paginated — you get roughly 30 listings per page, so a city with a few hundred dentists takes several pages to exhaust. Asking for one search does not give you a territory; it gives you one cell of a grid.

So the geography has to be enumerated. "Texas" is not a geo_location_terms value that returns the whole state — it is Austin, Dallas, Houston, San Antonio, Fort Worth, El Paso, and the secondary cities, each its own search. And the niche is usually more than one category term: a dental client cares about "dentists," but also "orthodontists," "pediatric dentists," and "oral surgeons," each a separate search with its own results.

That turns the week's work into a grid problem: build the full list of {category} × {city} searches, run one enriched job per cell, then merge and dedupe everything into a single net-new feed. Get the grid right and coverage is just a matter of how many cells you run; get it wrong and you are emailing the same downtown listings every week while the suburbs go untouched.

Step 1: Build the Category-by-City Search Grid

You do not want to hand-write URLs for forty combinations. Define the niche as a list of category terms and the territory as a list of cities, then take the cross product and build one Yellow Pages search URL per pair.

Here is a small, dependency-free helper that does it:

from urllib.parse import quote_plus


def yp_url(category, city):
    """Build a Yellow Pages search URL for one category in one city.

    city should include the state, e.g. "Austin, TX".
    """
    return (
        "https://www.yellowpages.com/search?"
        f"search_terms={quote_plus(category)}"
        f"&geo_location_terms={quote_plus(city)}"
    )


def search_grid(categories, cities):
    """Cross product of categories x cities -> list of search URLs."""
    return [yp_url(cat, city) for cat in categories for city in cities]


# One client's niche x territory
CATEGORIES = ["dentists", "orthodontists", "pediatric dentists", "oral surgeons"]
CITIES = ["Austin, TX", "Dallas, TX", "Houston, TX",
          "San Antonio, TX", "Fort Worth, TX"]

urls = search_grid(CATEGORIES, CITIES)
print(f"{len(urls)} searches to cover the territory")
# → 20 searches (4 categories x 5 cities)

Twenty searches is a typical week for one client. Build the city list once per client territory — pull the cities from the client's service-area definition, or from a metro list if they sell nationally by region. The category list is the niche, and it is worth being generous: adding "cosmetic dentists" or "emergency dentists" as extra terms surfaces businesses that the bare "dentists" search ranks below the fold, and dedupe collapses the overlap, so extra category terms only ever add coverage. If twenty searches is more than a first pass needs, drop to the primary category and the top three cities, then widen once the pipeline is running.

Step 2: Fire an Enriched Leads Job per Search

Yellow Pages exposes two endpoints for a search. The plain search endpoint returns the directory rows — name, address, phone, website, category. The leads endpoint does that and then visits each business's own website to pull contact details the directory does not carry: email addresses, additional phones, and social profiles. For cold email, the leads endpoint is the one you want, because a directory phone number is not an outreach-ready email.

Every call is asynchronous: you submit, get a job id back, then poll. Confirm one search works with curl before you loop:

# 1) Submit one search — returns a job id immediately
curl -G "https://api.logposervices.com/api/v1/ecommerce/yellowpages/leads" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode "url=https://www.yellowpages.com/search?search_terms=dentists&geo_location_terms=Austin%2C+TX" \
  --data-urlencode "pages=3"
# → {"job_id": "yp_4c1e...", "status": "pending"}

# 2) Poll the job until status == "completed"
curl -H "X-API-Key: lp_xxxxxxx" \
  https://api.logposervices.com/api/v1/jobs/yp_4c1e

# 3) Fetch the enriched rows
curl -H "X-API-Key: lp_xxxxxxx" \
  https://api.logposervices.com/api/v1/jobs/yp_4c1e/result

The leads path does more work than a plain search because the website-enrichment step opens each business site — so it is the slower of the two Yellow Pages endpoints. That makes the async pattern non-negotiable, especially across a grid. api.logposervices.com sits behind Cloudflare, which kills any single connection at roughly 90 seconds. A 3-page enriched search can run longer than that, so never wait on one inline request — submit the job, let it run server-side, and poll for the result.

pages=3 is about 90 listings from one category-city search, which exhausts most local categories before the long tail thins out. You rarely need deep paging on a single search; the grid — more categories and more cities — is what buys you volume, not page 12 of one search.

Step 3: Submit the Grid and Poll It

For a full territory you are submitting twenty-plus jobs, so the right pattern is fire-all-then-poll: submit every search up front (each returns instantly with a job id), then poll the outstanding job ids in a loop until they all finish. This keeps the whole grid running in parallel server-side instead of waiting on each search in sequence.

import os, time, requests

API_KEY = os.environ["LOGPOSE_API_KEY"]
BASE = "https://api.logposervices.com/api/v1"
HEADERS = {"X-API-Key": API_KEY}


def submit(url, pages=3):
    r = requests.get(
        f"{BASE}/ecommerce/yellowpages/leads",
        params={"url": url, "pages": pages},
        headers=HEADERS, timeout=30,
    )
    r.raise_for_status()
    return r.json()["job_id"]


def collect(job_ids, poll_every=5, timeout_s=1200):
    """Poll a batch of job ids; return the merged list of business rows."""
    pending = set(job_ids)
    rows, deadline = [], time.time() + timeout_s
    while pending and time.time() < deadline:
        for jid in list(pending):
            s = requests.get(f"{BASE}/jobs/{jid}", headers=HEADERS, timeout=15).json()
            status = s.get("status")
            if status == "completed":
                res = requests.get(f"{BASE}/jobs/{jid}/result",
                                   headers=HEADERS, timeout=30).json()
                rows.extend(res.get("listings", []))
                pending.discard(jid)
            elif status == "failed":
                print(f"  search job {jid} failed: {s.get('error')}")
                pending.discard(jid)
        if pending:
            time.sleep(poll_every)
    if pending:
        print(f"  {len(pending)} jobs still running at timeout — collect later")
    return rows


# Submit the whole grid, then poll it
job_ids = [submit(u, pages=3) for u in urls]
print(f"submitted {len(job_ids)} search jobs")
all_rows = collect(job_ids)
print(f"collected {len(all_rows)} raw rows (pre-dedupe)")

Submitting first and polling second is what turns a twenty-search grid from twenty sequential waits into a few minutes of wall-clock time — the jobs run concurrently on the server up to your account's concurrency cap, and your script just watches the queue drain. The enriched endpoint is the slower one, so give collect a generous timeout; a territory of enriched searches can take longer than a plain directory pull.

Step 4: Dedupe to Net-New Businesses

Deduping happens at two levels, and a weekly pipeline needs both. The first is within this cycle: overlapping category terms guarantee the same dental practice appears under "dentists" and "cosmetic dentists," so the merged list has duplicates. The second is across cycles: most businesses this week were also there last week, and re-emailing them is what kills a sending domain. Both collapse on a stable per-business identifier.

import json, os


def dedupe_within(rows):
    """Collapse duplicates inside one cycle's merged rows."""
    seen, unique = set(), []
    for r in rows:
        key = r.get("id") or r.get("phone_raw") or r.get("website")
        if not key or key in seen:
            continue
        seen.add(key)
        unique.append(r)
    return unique


def net_new(rows, state_path="seen_ids.json"):
    """Return only rows whose id was not seen in a previous cycle."""
    seen = set()
    if os.path.exists(state_path):
        with open(state_path) as f:
            seen = set(json.load(f))

    fresh = []
    for r in rows:
        key = r.get("id") or r.get("phone_raw") or r.get("website")
        if not key or key in seen:
            continue
        seen.add(key)
        fresh.append(r)

    with open(state_path, "w") as f:
        json.dump(sorted(seen), f)
    return fresh


cycle = dedupe_within(all_rows)
print(f"{len(cycle)} unique businesses this cycle")

leads = net_new(cycle)
print(f"{len(leads)} net-new businesses vs last cycle")
# e.g. 20 searches -> ~1,200 raw -> ~600 unique -> ~500 net-new on a weekly run

The fallback chain (id → phone_raw → website) covers the rare row where the directory omitted the stable id, so you never silently drop a real lead just because one identifier was missing. And the persisted seen_ids.json is the whole trick behind the weekly cadence: each run loads the businesses you have already pulled, emits only the new ones, and writes the union back. On a fresh territory the first cycle is large and every cycle after is the net-new tail — typically a few hundred businesses a week as new practices open and the directory updates.

Step 5: Understand the Enriched Fields (and What's Missing)

Be honest with yourself about where each field comes from, because it changes how you treat it. The directory listing gives you the firmographic core. The website enrichment step is what adds the outreach-grade contact fields — and those only exist when the business maintains a reachable website.

Field	Source	Notes
name	Directory listing	Always present
address	Directory listing	Full + parsed parts
category	Directory listing	e.g. "Dentists", "Orthodontists"
phone / phone_raw	Directory listing	Formatted + digits-only
website	Directory listing	Present for most established businesses
id	Directory listing	Stable dedupe key
emails	Website enrichment	Scraped from the business's own site; empty if no site or no public email
socials	Website enrichment	Facebook / Instagram / LinkedIn handles found on the site
extra_phones	Website enrichment	Numbers on the site beyond the directory listing

The honest caveat: the directory does not publish email addresses. Nothing in the Yellow Pages listing contains an email. The emails and socials fields are derived by visiting the website and reading what the business put on its own public site — a contact page, a footer mailto:, a "follow us" bar. That means email coverage tracks website quality: a practice with a real site and a contact page enriches cleanly; a small operator with only a directory listing and a phone number will have an empty emails field, and no enrichment step can invent one. For a cold-email agency this is the field that matters most, so it is worth measuring per niche — established categories like dental and legal enrich well, while trades and one-person shops lean more on phone.

Step 6: Write an Outreach-Ready CSV

The last step turns the deduped, enriched, net-new list into a CSV your sending tool or CRM importer can consume directly. Flatten the list-valued fields (emails, socials) into delimited strings, drop rows with no way to reach the business, and split out an email-only file for the cold-email sequence.

import csv


def write_csv(leads, out_path, require_email=False):
    fields = ["name", "category", "email", "phone", "website",
              "socials", "address", "id"]
    written = 0
    with open(out_path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fields, extrasaction="ignore")
        w.writeheader()
        for r in leads:
            phone = r.get("phone_raw") or ""
            emails = r.get("emails") or []
            # Reachable = has a phone OR at least one website-derived email
            if len(phone) < 10 and not emails:
                continue
            # Cold-email file: only rows that actually have an email
            if require_email and not emails:
                continue
            w.writerow({
                "name": r.get("name", ""),
                "category": r.get("category", ""),
                "email": (emails[0] if emails else ""),
                "phone": r.get("phone", ""),
                "website": r.get("website", ""),
                "socials": " | ".join(r.get("socials") or []),
                "address": r.get("address", ""),
                "id": r.get("id", ""),
            })
            written += 1
    return written


# Full reachable list (email or phone), plus an email-only sending file
total = write_csv(leads, "dentists_tx_all.csv")
emailable = write_csv(leads, "dentists_tx_email.csv", require_email=True)
print(f"wrote {total} reachable rows, {emailable} with an email")

Two choices earn their keep. Keeping the first email only (emails[0]) makes the CSV one-row-per-business, which is what a sending tool's importer expects. And splitting an email-only file from the full reachable list keeps your two channels clean: the email file feeds the cold-email sequence directly, while the rows that enriched to a phone but no email go to a call list or a LinkedIn touch instead of being deleted. The result is a tight, deduped, net-new feed for the week — an email on the majority of rows and a phone as the fallback on the rest — ready to import Monday morning.

Scaling This Across Clients and Cycles

The pipeline above is one client for one week. The agency shape is several clients, each with its own niche and territory, each refreshed weekly. Two things make that practical.

First, the grid is just two lists, so a multi-client run is a loop over {client: (categories, cities)} feeding the same submit / collect / dedupe_within / net_new / write_csv functions — nothing in the pipeline changes per client except the category and city lists and the path to that client's seen_ids.json. Second, the weekly cadence is really a new-business detector, and you do not have to host the cron-plus-state yourself. LogPose exposes a monitor primitive that polls a saved search on a schedule and fires when new businesses appear, so instead of re-running the whole grid blind every Monday, you let the monitor watch each search and notify you when there is something net-new to pull. It removes the scheduler and most of the state store from your build:

curl -X POST "https://api.logposervices.com/api/v1/monitors" \
  -H "X-API-Key: lp_xxxxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.yellowpages.com/search?search_terms=dentists&geo_location_terms=Austin%2C+TX",
    "name": "TX dentists — new listings",
    "metric": "new_listings",
    "condition": "gte",
    "threshold": 1,
    "check_interval_hours": 24,
    "notify_channels": ["email", "slack"]
  }'

That is the piece that turns a one-time grid into a standing per-client pipeline: a monitor per saved search, an email or Slack ping when new businesses appear, and your net_new diff producing the week's outreach file.

The Honest Fit

This approach fits well when your clients sell to local SMBs in defined categories and metros — home services, dental and medical clinics, trades, restaurants, local retail — and you want a clean, deduped, contact-enriched, net-new feed each week without standing up your own headless-browser fleet, proxy rotation, and dedupe-state store. The async leads endpoint, the explicit category-by-city grid, and the stable id used both within and across cycles are the primitives that make a weekly cadence reliable rather than a Monday scramble.

Where it is not the right tool: if your client sells to enterprise and needs firmographics like employee count, revenue band, or funding history, a directory does not carry those and a B2B data vendor will serve you better. And the email caveat is worth repeating honestly: enrichment reads public business websites, so coverage is strong for niches where every business maintains a site and thinner for trades where many do not. For local SMB outreach that trade is the right one — the email carries the cold-email channel, and the phone carries the rest of the list.

Get Started

Sign up at logposervices.com and generate an API key under Tool → API Keys.
export LOGPOSE_API_KEY=lp_xxxxxxx
Test one search, then build the grid:

curl -G "https://api.logposervices.com/api/v1/ecommerce/yellowpages/leads" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode "url=https://www.yellowpages.com/search?search_terms=dentists&geo_location_terms=Austin%2C+TX" \
  --data-urlencode "pages=3"

Then run the search_grid helper over your client's category and city lists, submit one /api/v1/ecommerce/yellowpages/leads?url=...&pages=3 job per cell, dedupe within the cycle and against seen_ids.json for net-new, and write the outreach-ready CSV. If you also want a second source, GET /api/v1/ecommerce/googlemaps/leads runs the same enrich-and-dedupe shape on Google Maps and merges cleanly into the same feed.

Related reading: How to build a B2B lead list from Yellow Pages with no code for the directory fundamentals, How to enrich business leads with emails, phones, and socials for the website-enrichment step in depth, and How to scrape Yellow Pages emails for cold outreach for the email-coverage details.

External: Yellow Pages, hiQ Labs v. LinkedIn, CAN-SPAM Act Compliance Guide.

How a Cold-Email Agency Pulls 500 Fresh Local Leads a Week

Why One Search Will Never Cover a Territory

Step 1: Build the Category-by-City Search Grid

Step 2: Fire an Enriched Leads Job per Search

Step 3: Submit the Grid and Poll It

Step 4: Dedupe to Net-New Businesses

Step 5: Understand the Enriched Fields (and What's Missing)

Step 6: Write an Outreach-Ready CSV

Scaling This Across Clients and Cycles

The Honest Fit

Get Started

Frequently asked questions

Related posts

How a Cold-Email Agency Pulls 500 Fresh Local Leads a Week

Frequently asked questions

Related posts

How DTC Brands Catch a Competitor's Price Drop the Same Day

The Retail Arbitrage Data Routine: Spotting Underpriced Inventory Before Other Resellers

Apollo.io Alternatives for the Local Businesses Apollo Doesn't Have