← Back to blogTutorial

How to Enrich Business Leads with Emails, Phones, and Socials

· 12 min read

The most common lead-gen complaint sounds the same regardless of niche: "I have a list of company names, but I don't have emails or phone numbers." A spreadsheet of business names is almost worthless on its own — the work that makes the list usable is enrichment, which means starting with whatever public source has the broadest coverage for your niche and then chaining through two or three additional steps to fill in the contact fields your sales team actually dials and emails. This guide walks the full pipeline end to end: seed from a public source, extract the website, discover the email, find the LinkedIn, and run quality control on the merged output.

The Enrichment Ladder

Every working enrichment pipeline is the same four rungs. You climb them in order because each rung depends on the field the previous one filled in.

StepInputOutputCoverage
1. SeedCategory + city ("dentists in Austin")Name, address, phone, category, website, rating95–100%
2. Website normalizationWebsite URL from step 1Cleaned domain, canonical homepage60–80%
3. Email discoveryDomain from step 2One or more verified emails30–60%
4. Social discoveryName + domainLinkedIn company page, optionally Facebook/Instagram40–70%

The match rate compounds, so the realistic end-to-end yield from a 1,000-row seed is 200–400 fully-enriched leads. Setting expectations honestly with the team consuming the list is the single most important non-technical step.

Step 1: The Seed Pull

Two public sources cover the vast majority of B2B niches: Yellow Pages and Google Maps. They have different strengths.

  • Yellow Pages is best for US-only national coverage of trades and local services. Coverage is very high for plumbers, electricians, roofers, contractors, attorneys, doctors. The weakness is the website field — Yellow Pages returns a website for only about 40% of listings.
  • Google Maps has the broadest international coverage and a much higher website fill rate (about 70–80% for active small businesses). It is the better seed when website is critical, which it is for steps 2 and 3.

The general rule: if you need email, start from Maps. If you only need phone, start from Yellow Pages because the coverage is broader and the data structure is simpler.

Submit a Yellow Pages search:

curl -G "https://api.logposervices.com/api/v1/ecommerce/yellowpages/search" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode "search_terms=dentists" \
  --data-urlencode "geo_location_terms=Austin, TX" \
  --data-urlencode "pages=3"
# → {"job_id": "yp_8f3a..."}

Poll, then fetch:

curl -H "X-API-Key: lp_xxxxxxx" \
  "https://api.logposervices.com/api/v1/jobs/yp_8f3a/result"

The shape that comes back includes name, phone, website, address, categories, rating, review_count, and the YP-internal business_id. Three pages of YP returns roughly 90 rows.

If you want richer data and the website coverage matters, swap in Maps — same async submit-poll-result pattern, see How to scrape Google Maps for local business leads for the URL-building details.

Step 2: Normalizing the Website Field

The raw website field from any public source is noisy. Common cases the pipeline has to handle:

  • Tracking-redirect wrappers (https://yellowpages.com/r?...)
  • Subpages (example.com/contact) instead of the homepage
  • www versus apex inconsistencies
  • Trailing slashes and query strings
  • Facebook page URLs in the website slot (facebook.com/companyname)
  • Empty string or null for the ~25% of businesses with no website

The goal of this step is to derive a clean domain — example.com — that step 3 can plug into an email-discovery API.

import re
from urllib.parse import urlparse


def normalize_domain(raw: str | None) -> str | None:
    """Take a noisy website field and return a clean apex-or-www domain."""
    if not raw:
        return None
    raw = raw.strip()
    if not raw:
        return None

    # Add scheme if missing so urlparse works
    if not raw.startswith(("http://", "https://")):
        raw = "https://" + raw

    try:
        host = urlparse(raw).netloc.lower()
    except ValueError:
        return None

    # Strip port
    host = host.split(":")[0]

    # Reject social-as-website
    if any(s in host for s in ("facebook.com", "instagram.com", "twitter.com", "x.com", "linkedin.com")):
        return None

    # Reject directory redirect wrappers
    if host.endswith(("yellowpages.com", "yelp.com", "google.com")):
        return None

    # Strip leading www. for the email-discovery key, but keep it for HTTP probes
    apex = re.sub(r"^www\.", "", host)
    return apex or None

The two non-obvious calls in that function are rejecting Facebook URLs and directory wrappers. Both happen often enough in the raw output that not handling them drops your enrichment match rate by 5–10 points.

Step 3: Discovering the Email

This is the rung where you stop building and start consuming a third-party service. There is no free way to derive a verified email from a domain at scale — every working pipeline calls out to one of the four established providers (Hunter, Apollo, Snov, FindThatLead), all of which expose REST APIs with pay-per-lookup pricing.

The common interface is domain_search (or equivalent), which takes the apex domain and returns every email the provider has indexed for that domain plus a confidence score.

import os
import requests


def hunter_domain_emails(domain: str) -> list[dict]:
    """Return Hunter's known emails for a domain, sorted by confidence."""
    r = requests.get(
        "https://api.hunter.io/v2/domain-search",
        params={"domain": domain, "api_key": os.environ["HUNTER_API_KEY"]},
        timeout=15,
    )
    r.raise_for_status()
    data = r.json().get("data", {})
    emails = data.get("emails", [])
    return sorted(emails, key=lambda e: e.get("confidence", 0), reverse=True)

The pattern that gives the highest signal-to-noise: for each domain, pull the top three emails by confidence, then filter to ones with a job-title pattern that matches your buyer (owner, founder, manager, director) and discard the generic mailboxes (info@, sales@, contact@) unless that is your only option for a row.

When the third-party provider returns nothing for a domain — and this happens for roughly half of small-business domains — the practical fallback is to scrape the website's /contact and /about pages directly and regex out any mailto links. That fallback recovers an additional 10–15% of the previously-empty rows.

Step 4: Finding the LinkedIn

LinkedIn does not publish a public company-search API for non-customers, but Google indexes LinkedIn company pages and a site-restricted Google query finds them reliably. You build the query as "Company Name" site:linkedin.com/company, run it through a search-API call, and take the first organic result.

curl -G "https://api.logposervices.com/api/v1/search/google/search" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode 'q="Smith Family Dental" site:linkedin.com/company' \
  --data-urlencode "pages=1"
# → {"job_id": "g_8f3a..."}

Three caveats matter here. First, the match rate is roughly 40–70% — many small businesses simply do not have a LinkedIn company page. Second, false positives happen when a query like "Smith Dental" matches the wrong location's LinkedIn page; defending against that requires checking that the LinkedIn page's listed city matches the seed row's city, which means a second scrape. Third, this step is high-cost per row compared to steps 1–3, so most pipelines only enrich the rows that already have a verified email — that is, you climb the ladder in order and skip the upper rungs for rows that fall off lower down.

The Full Pipeline

Putting all four steps together. This script reads a seed CSV produced by step 1, enriches each row, and writes a fully-enriched output CSV. The structure is deliberately linear so the failure modes are easy to debug.

import csv
import os
import time
from typing import Iterator

import requests

API_KEY = os.environ["LOGPOSE_API_KEY"]
HUNTER_KEY = os.environ["HUNTER_API_KEY"]
BASE = "https://api.logposervices.com/api/v1"
HEADERS = {"X-API-Key": API_KEY}


def submit_and_wait(path: str, params: dict, timeout_s: int = 120) -> dict:
    r = requests.get(f"{BASE}/{path}", params=params, headers=HEADERS, timeout=30)
    r.raise_for_status()
    job_id = r.json()["job_id"]
    deadline = time.time() + timeout_s
    while time.time() < deadline:
        s = requests.get(f"{BASE}/jobs/{job_id}", headers=HEADERS, timeout=15).json()
        if s["status"] == "completed":
            return requests.get(f"{BASE}/jobs/{job_id}/result", headers=HEADERS, timeout=15).json()
        if s["status"] == "failed":
            raise RuntimeError(s.get("error", "unknown failure"))
        time.sleep(2)
    raise TimeoutError(f"job {job_id} did not finish in {timeout_s}s")


def enrich_row(row: dict) -> dict:
    out = {**row, "email": "", "linkedin": "", "enriched_at": ""}

    domain = normalize_domain(row.get("website"))
    if not domain:
        return out

    # Step 3: emails for the domain
    try:
        emails = hunter_domain_emails(domain)
        if emails:
            out["email"] = emails[0]["value"]
    except requests.HTTPError:
        pass

    # Step 4: LinkedIn — only run if email already found, to control cost
    if out["email"]:
        try:
            serp = submit_and_wait(
                "search/google/search",
                {"q": f'"{row["name"]}" site:linkedin.com/company', "pages": 1},
            )
            first = next((r for r in serp.get("organic", []) if "linkedin.com/company" in r.get("url", "")), None)
            if first:
                out["linkedin"] = first["url"]
        except (requests.HTTPError, RuntimeError, TimeoutError):
            pass

    out["enriched_at"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
    return out


def enrich_csv(in_path: str, out_path: str) -> tuple[int, int]:
    enriched_count = 0
    total = 0
    with open(in_path, encoding="utf-8") as fi, open(out_path, "w", newline="", encoding="utf-8") as fo:
        reader = csv.DictReader(fi)
        fieldnames = (reader.fieldnames or []) + ["email", "linkedin", "enriched_at"]
        writer = csv.DictWriter(fo, fieldnames=fieldnames, extrasaction="ignore")
        writer.writeheader()
        for row in reader:
            total += 1
            enriched = enrich_row(row)
            if enriched["email"] or enriched["linkedin"]:
                enriched_count += 1
            writer.writerow(enriched)
    return enriched_count, total


if __name__ == "__main__":
    matched, total = enrich_csv("seed_austin_dentists.csv", "enriched_austin_dentists.csv")
    print(f"{matched}/{total} rows enriched ({matched * 100 // total}% match rate)")

Run it against the CSV from step 1 and you have an enriched list. The match-rate line at the end is the only metric that matters — track it across runs and you'll quickly learn which niches enrich well and which do not.

Quality Control

Before the enriched list goes to the sales team, three checks save the most pain.

Email format validation. Run every discovered email through a regex (^[^\s@]+@[^\s@]+\.[^\s@]+$) and reject anything that fails. Hunter and similar providers occasionally return malformed values; catching them at the pipeline boundary saves bounce-rate damage later.

Bounce verification. Either run the discovered emails through a verification API (NeverBounce, ZeroBounce, MailboxValidator) before they go into a sequence, or rely on the warmup tooling in your outbound platform to bounce-test inside the first send. The first option is more expensive per row; the second risks burning your sending domain's reputation if the bounce rate exceeds 5%.

Generic-mailbox filter. Emails like info@, contact@, sales@, support@ should be flagged so the sales team knows they did not receive a personal email and adjusts the outreach accordingly. Most pipelines move these to a separate sheet for a different sequence.

Phone normalization. The phone field from Yellow Pages and Maps comes in three formats ((512) 555-0142, 512-555-0142, +1 512 555 0142). Normalize to +1XXXXXXXXXX for the dialer integration. The simplest fix is re.sub(r"[^\d+]", "", phone).

The cleaned, validated, enriched CSV is what the sales team gets. The raw enriched output is what you keep in cold storage in case you need to re-derive.

Scaling to Thousands of Leads

Three scale points start to bite around the 1,000-row mark.

The seed scrape. A 1,000-row seed needs roughly 10 separate searches because a single Yellow Pages or Maps query maxes out around 100 unique results. Run them as a bulk submission instead of sequentially:

requests.post(
    f"{BASE}/ecommerce/yellowpages/search/bulk",
    headers=HEADERS,
    json={
        "targets": [
            {"search_terms": "dentists", "geo_location_terms": "Austin, TX", "pages": 3},
            {"search_terms": "dentists", "geo_location_terms": "Round Rock, TX", "pages": 3},
            {"search_terms": "dentists", "geo_location_terms": "Cedar Park, TX", "pages": 3},
        ]
    },
).raise_for_status()

Bulk runs the targets in parallel up to your concurrency cap, which cuts a 10-target seed pull from 10 minutes sequential to roughly 2 minutes wall-clock.

The enrichment loop. Step 3 and step 4 are I/O-bound and embarrassingly parallel. Switch the enrich_row loop to a thread pool or, better, an asyncio.gather over an httpx.AsyncClient pool of 10–20 workers. Empirically this drops a 1,000-row enrich from about 90 minutes to 10 minutes, and the third-party providers' rate limits become the bottleneck before the platform does.

Recurring refresh. Once the pipeline runs, the natural next step is to monitor the seed for net-new businesses — businesses that have appeared in a Yellow Pages or Maps result since the last scrape. The pattern is documented for the Yellow Pages case in How to monitor Yellow Pages for new businesses in your category; the same diff-loop logic applies if you key on the seed's stable identifier (Yellow Pages business_id, Google Maps cid).

Legality and Ethics

The seed-scrape step (Yellow Pages, Maps) is on settled legal ground in the US for public business data — hiQ Labs v. LinkedIn (9th Cir. 2022) is the controlling precedent — and broadly compliant in the EU under GDPR's legitimate-interest basis for B2B contact data. The email-discovery step relies on third-party providers that have their own ToS and disclose their data sources in their documentation; using them does not transfer legal risk to you beyond ordinary breach-of-contract exposure.

The real compliance work is the outreach. CAN-SPAM (US), CASL (Canada), and the GDPR / ePrivacy regime (EU) each impose distinct rules on cold-email and cold-call campaigns: opt-out mechanisms, sender identification, transparency on data source when asked, and in some jurisdictions an explicit consent requirement before a first email. None of those are pipeline problems — they are outreach-tooling and legal-review problems.

Common Mistakes

  • Enriching every seed row instead of only the qualified ones. Filter the seed before enrichment — drop rows with no website (no chance of an email), no phone (no chance of a dial), or zero reviews on Google Maps (disproportionately closed businesses).
  • Trusting the website field unfiltered. Maps and Yellow Pages occasionally put a Facebook URL or a directory redirect in the website slot. Without the normalize_domain step, those silently destroy your email match rate.
  • Skipping verification before launch. Sending to an unverified enriched list drives bounce rates above 5%, which gets the sending domain throttled by every major inbox provider for weeks.
  • Re-running enrichment too often. Most enriched fields change slowly — emails maybe once a year, websites less than that. Re-enriching the same row every week burns provider credits with near-zero new signal. Re-enrich quarterly at most.
  • Treating phone and email as interchangeable for local services. For trades and home services, phone is the working channel and email is the polite-rejection channel. Build the pipeline around the phone column and treat email as a bonus field.

Get Started

  1. Sign up at logposervices.com and generate an API key under Tool → API Keys.
  2. export LOGPOSE_API_KEY=lp_xxxxxxx and export HUNTER_API_KEY=... (or your provider of choice).
  3. Run a Yellow Pages or Google Maps seed scrape for your target niche and city.
  4. Pipe the resulting CSV through the enrichment script above.
  5. Validate, dedupe, and hand off to the sales team.

Related reading: How to build a B2B lead list from Yellow Pages (no code) for the simplest possible seed, How to scrape Google Maps for local business leads for the higher-coverage alternative, and How to monitor Yellow Pages for new businesses for the recurring-refresh pattern that turns this from a one-off into a pipeline.

External: Hunter.io, Apollo.io, Snov.io, hiQ Labs v. LinkedIn, CAN-SPAM Act.

Frequently asked questions

What does 'lead enrichment' actually mean?
Enrichment is the process of taking a sparse lead record — usually just a business name and a city — and progressively filling in the fields that make it usable for outbound sales: phone, address, website, decision-maker name, email, and LinkedIn profile. You start with whatever public source has the cheapest broadest coverage (a Maps scrape or a directory like Yellow Pages), then chain through a website discovery step, then an email-discovery step. Each step has its own match rate; the realistic end-to-end yield from a 1,000-row seed list is 200–400 emailable contacts.
Why not just buy a list from ZoomInfo or Apollo?
Two reasons. First, the bought-list market is heavily biased toward enterprise titles and US-based companies — for local-services niches (plumbers, dentists, contractors, gyms, salons) and for non-US markets, coverage is poor and stale. Second, the marginal cost per enriched lead in a self-built pipeline is roughly an order of magnitude lower than a bought list once you factor in the per-seat platform fees. The trade-off is build time: a pipeline that pulls from Maps and then enriches via Hunter or Apollo's API takes a day to set up, but then runs on a cron for the next year.
What is a realistic email-discovery match rate?
The honest number is 30–60% depending on the niche. Discovery rates are highest for software companies and professional services where the company website prominently lists staff (60%+), and lowest for local services and contractor businesses where the website is often a one-page site with only a generic info@ address (20–35%). The realistic working number for a mixed B2B prospect list is 40–45% of seed rows ending with a verified email.
Is phone or email the more useful identifier for local-services outbound?
Phone, by a wide margin. For local services (home services, healthcare, hospitality, professional services), the decision-maker is the owner-operator, the owner-operator answers the phone, and a five-minute conversation outperforms a multi-step email sequence on response rate by roughly an order of magnitude. Lead programs targeting local services spend disproportionate effort on email discovery when they should be cleaning the phone column and dialing. For B2B SaaS-to-enterprise outbound, the inverse is true — email is the only channel that scales.
What's the legal posture on combining a Maps scrape with an email-discovery API?
Both steps individually are on settled legal ground in the US (CFAA does not apply to public business data per hiQ Labs v. LinkedIn, 9th Cir. 2022) and broadly compliant in the EU under GDPR's legitimate-interest basis for B2B contact data. The downstream step — cold-email outreach — is where the real compliance work sits. CAN-SPAM in the US, CASL in Canada, and the GDPR / ePrivacy regime in the EU each impose distinct requirements on the outreach itself: opt-out language, identification, transparency about the data source on request. The enrichment pipeline is not the risky step; the outreach campaign that consumes it is.

Related posts

Tutorial

How to Monitor Yellow Pages for New Businesses in Your Category

9 min read
Tutorial

How to Scrape Google Maps for Local Business Leads

10 min read
Tutorial

How to Build a B2B Lead List from Yellow Pages (No Code)

9 min read