How to Enrich Business Leads with Emails, Phones, and Socials
The most common lead-gen complaint sounds the same regardless of niche: "I have a list of company names, but I don't have emails or phone numbers." A spreadsheet of business names is almost worthless on its own — the work that makes the list usable is enrichment, which means starting with whatever public source has the broadest coverage for your niche and then chaining through two or three additional steps to fill in the contact fields your sales team actually dials and emails. This guide walks the full pipeline end to end: seed from a public source, extract the website, discover the email, find the LinkedIn, and run quality control on the merged output.
The Enrichment Ladder
Every working enrichment pipeline is the same four rungs. You climb them in order because each rung depends on the field the previous one filled in.
| Step | Input | Output | Coverage |
|---|---|---|---|
| 1. Seed | Category + city ("dentists in Austin") | Name, address, phone, category, website, rating | 95–100% |
| 2. Website normalization | Website URL from step 1 | Cleaned domain, canonical homepage | 60–80% |
| 3. Email discovery | Domain from step 2 | One or more verified emails | 30–60% |
| 4. Social discovery | Name + domain | LinkedIn company page, optionally Facebook/Instagram | 40–70% |
The match rate compounds, so the realistic end-to-end yield from a 1,000-row seed is 200–400 fully-enriched leads. Setting expectations honestly with the team consuming the list is the single most important non-technical step.
Step 1: The Seed Pull
Two public sources cover the vast majority of B2B niches: Yellow Pages and Google Maps. They have different strengths.
- Yellow Pages is best for US-only national coverage of trades and local services. Coverage is very high for plumbers, electricians, roofers, contractors, attorneys, doctors. The weakness is the website field — Yellow Pages returns a website for only about 40% of listings.
- Google Maps has the broadest international coverage and a much higher website fill rate (about 70–80% for active small businesses). It is the better seed when website is critical, which it is for steps 2 and 3.
The general rule: if you need email, start from Maps. If you only need phone, start from Yellow Pages because the coverage is broader and the data structure is simpler.
Submit a Yellow Pages search:
curl -G "https://api.logposervices.com/api/v1/ecommerce/yellowpages/search" \
-H "X-API-Key: lp_xxxxxxx" \
--data-urlencode "search_terms=dentists" \
--data-urlencode "geo_location_terms=Austin, TX" \
--data-urlencode "pages=3"
# → {"job_id": "yp_8f3a..."}
Poll, then fetch:
curl -H "X-API-Key: lp_xxxxxxx" \
"https://api.logposervices.com/api/v1/jobs/yp_8f3a/result"
The shape that comes back includes name, phone, website, address, categories, rating, review_count, and the YP-internal business_id. Three pages of YP returns roughly 90 rows.
If you want richer data and the website coverage matters, swap in Maps — same async submit-poll-result pattern, see How to scrape Google Maps for local business leads for the URL-building details.
Step 2: Normalizing the Website Field
The raw website field from any public source is noisy. Common cases the pipeline has to handle:
- Tracking-redirect wrappers (
https://yellowpages.com/r?...) - Subpages (
example.com/contact) instead of the homepage wwwversus apex inconsistencies- Trailing slashes and query strings
- Facebook page URLs in the website slot (
facebook.com/companyname) - Empty string or
nullfor the ~25% of businesses with no website
The goal of this step is to derive a clean domain — example.com — that step 3 can plug into an email-discovery API.
import re
from urllib.parse import urlparse
def normalize_domain(raw: str | None) -> str | None:
"""Take a noisy website field and return a clean apex-or-www domain."""
if not raw:
return None
raw = raw.strip()
if not raw:
return None
# Add scheme if missing so urlparse works
if not raw.startswith(("http://", "https://")):
raw = "https://" + raw
try:
host = urlparse(raw).netloc.lower()
except ValueError:
return None
# Strip port
host = host.split(":")[0]
# Reject social-as-website
if any(s in host for s in ("facebook.com", "instagram.com", "twitter.com", "x.com", "linkedin.com")):
return None
# Reject directory redirect wrappers
if host.endswith(("yellowpages.com", "yelp.com", "google.com")):
return None
# Strip leading www. for the email-discovery key, but keep it for HTTP probes
apex = re.sub(r"^www\.", "", host)
return apex or None
The two non-obvious calls in that function are rejecting Facebook URLs and directory wrappers. Both happen often enough in the raw output that not handling them drops your enrichment match rate by 5–10 points.
Step 3: Discovering the Email
This is the rung where you stop building and start consuming a third-party service. There is no free way to derive a verified email from a domain at scale — every working pipeline calls out to one of the four established providers (Hunter, Apollo, Snov, FindThatLead), all of which expose REST APIs with pay-per-lookup pricing.
The common interface is domain_search (or equivalent), which takes the apex domain and returns every email the provider has indexed for that domain plus a confidence score.
import os
import requests
def hunter_domain_emails(domain: str) -> list[dict]:
"""Return Hunter's known emails for a domain, sorted by confidence."""
r = requests.get(
"https://api.hunter.io/v2/domain-search",
params={"domain": domain, "api_key": os.environ["HUNTER_API_KEY"]},
timeout=15,
)
r.raise_for_status()
data = r.json().get("data", {})
emails = data.get("emails", [])
return sorted(emails, key=lambda e: e.get("confidence", 0), reverse=True)
The pattern that gives the highest signal-to-noise: for each domain, pull the top three emails by confidence, then filter to ones with a job-title pattern that matches your buyer (owner, founder, manager, director) and discard the generic mailboxes (info@, sales@, contact@) unless that is your only option for a row.
When the third-party provider returns nothing for a domain — and this happens for roughly half of small-business domains — the practical fallback is to scrape the website's /contact and /about pages directly and regex out any mailto links. That fallback recovers an additional 10–15% of the previously-empty rows.
Step 4: Finding the LinkedIn
LinkedIn does not publish a public company-search API for non-customers, but Google indexes LinkedIn company pages and a site-restricted Google query finds them reliably. You build the query as "Company Name" site:linkedin.com/company, run it through a search-API call, and take the first organic result.
curl -G "https://api.logposervices.com/api/v1/search/google/search" \
-H "X-API-Key: lp_xxxxxxx" \
--data-urlencode 'q="Smith Family Dental" site:linkedin.com/company' \
--data-urlencode "pages=1"
# → {"job_id": "g_8f3a..."}
Three caveats matter here. First, the match rate is roughly 40–70% — many small businesses simply do not have a LinkedIn company page. Second, false positives happen when a query like "Smith Dental" matches the wrong location's LinkedIn page; defending against that requires checking that the LinkedIn page's listed city matches the seed row's city, which means a second scrape. Third, this step is high-cost per row compared to steps 1–3, so most pipelines only enrich the rows that already have a verified email — that is, you climb the ladder in order and skip the upper rungs for rows that fall off lower down.
The Full Pipeline
Putting all four steps together. This script reads a seed CSV produced by step 1, enriches each row, and writes a fully-enriched output CSV. The structure is deliberately linear so the failure modes are easy to debug.
import csv
import os
import time
from typing import Iterator
import requests
API_KEY = os.environ["LOGPOSE_API_KEY"]
HUNTER_KEY = os.environ["HUNTER_API_KEY"]
BASE = "https://api.logposervices.com/api/v1"
HEADERS = {"X-API-Key": API_KEY}
def submit_and_wait(path: str, params: dict, timeout_s: int = 120) -> dict:
r = requests.get(f"{BASE}/{path}", params=params, headers=HEADERS, timeout=30)
r.raise_for_status()
job_id = r.json()["job_id"]
deadline = time.time() + timeout_s
while time.time() < deadline:
s = requests.get(f"{BASE}/jobs/{job_id}", headers=HEADERS, timeout=15).json()
if s["status"] == "completed":
return requests.get(f"{BASE}/jobs/{job_id}/result", headers=HEADERS, timeout=15).json()
if s["status"] == "failed":
raise RuntimeError(s.get("error", "unknown failure"))
time.sleep(2)
raise TimeoutError(f"job {job_id} did not finish in {timeout_s}s")
def enrich_row(row: dict) -> dict:
out = {**row, "email": "", "linkedin": "", "enriched_at": ""}
domain = normalize_domain(row.get("website"))
if not domain:
return out
# Step 3: emails for the domain
try:
emails = hunter_domain_emails(domain)
if emails:
out["email"] = emails[0]["value"]
except requests.HTTPError:
pass
# Step 4: LinkedIn — only run if email already found, to control cost
if out["email"]:
try:
serp = submit_and_wait(
"search/google/search",
{"q": f'"{row["name"]}" site:linkedin.com/company', "pages": 1},
)
first = next((r for r in serp.get("organic", []) if "linkedin.com/company" in r.get("url", "")), None)
if first:
out["linkedin"] = first["url"]
except (requests.HTTPError, RuntimeError, TimeoutError):
pass
out["enriched_at"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
return out
def enrich_csv(in_path: str, out_path: str) -> tuple[int, int]:
enriched_count = 0
total = 0
with open(in_path, encoding="utf-8") as fi, open(out_path, "w", newline="", encoding="utf-8") as fo:
reader = csv.DictReader(fi)
fieldnames = (reader.fieldnames or []) + ["email", "linkedin", "enriched_at"]
writer = csv.DictWriter(fo, fieldnames=fieldnames, extrasaction="ignore")
writer.writeheader()
for row in reader:
total += 1
enriched = enrich_row(row)
if enriched["email"] or enriched["linkedin"]:
enriched_count += 1
writer.writerow(enriched)
return enriched_count, total
if __name__ == "__main__":
matched, total = enrich_csv("seed_austin_dentists.csv", "enriched_austin_dentists.csv")
print(f"{matched}/{total} rows enriched ({matched * 100 // total}% match rate)")
Run it against the CSV from step 1 and you have an enriched list. The match-rate line at the end is the only metric that matters — track it across runs and you'll quickly learn which niches enrich well and which do not.
Quality Control
Before the enriched list goes to the sales team, three checks save the most pain.
Email format validation. Run every discovered email through a regex (^[^\s@]+@[^\s@]+\.[^\s@]+$) and reject anything that fails. Hunter and similar providers occasionally return malformed values; catching them at the pipeline boundary saves bounce-rate damage later.
Bounce verification. Either run the discovered emails through a verification API (NeverBounce, ZeroBounce, MailboxValidator) before they go into a sequence, or rely on the warmup tooling in your outbound platform to bounce-test inside the first send. The first option is more expensive per row; the second risks burning your sending domain's reputation if the bounce rate exceeds 5%.
Generic-mailbox filter. Emails like info@, contact@, sales@, support@ should be flagged so the sales team knows they did not receive a personal email and adjusts the outreach accordingly. Most pipelines move these to a separate sheet for a different sequence.
Phone normalization. The phone field from Yellow Pages and Maps comes in three formats ((512) 555-0142, 512-555-0142, +1 512 555 0142). Normalize to +1XXXXXXXXXX for the dialer integration. The simplest fix is re.sub(r"[^\d+]", "", phone).
The cleaned, validated, enriched CSV is what the sales team gets. The raw enriched output is what you keep in cold storage in case you need to re-derive.
Scaling to Thousands of Leads
Three scale points start to bite around the 1,000-row mark.
The seed scrape. A 1,000-row seed needs roughly 10 separate searches because a single Yellow Pages or Maps query maxes out around 100 unique results. Run them as a bulk submission instead of sequentially:
requests.post(
f"{BASE}/ecommerce/yellowpages/search/bulk",
headers=HEADERS,
json={
"targets": [
{"search_terms": "dentists", "geo_location_terms": "Austin, TX", "pages": 3},
{"search_terms": "dentists", "geo_location_terms": "Round Rock, TX", "pages": 3},
{"search_terms": "dentists", "geo_location_terms": "Cedar Park, TX", "pages": 3},
]
},
).raise_for_status()
Bulk runs the targets in parallel up to your concurrency cap, which cuts a 10-target seed pull from 10 minutes sequential to roughly 2 minutes wall-clock.
The enrichment loop. Step 3 and step 4 are I/O-bound and embarrassingly parallel. Switch the enrich_row loop to a thread pool or, better, an asyncio.gather over an httpx.AsyncClient pool of 10–20 workers. Empirically this drops a 1,000-row enrich from about 90 minutes to 10 minutes, and the third-party providers' rate limits become the bottleneck before the platform does.
Recurring refresh. Once the pipeline runs, the natural next step is to monitor the seed for net-new businesses — businesses that have appeared in a Yellow Pages or Maps result since the last scrape. The pattern is documented for the Yellow Pages case in How to monitor Yellow Pages for new businesses in your category; the same diff-loop logic applies if you key on the seed's stable identifier (Yellow Pages business_id, Google Maps cid).
Legality and Ethics
The seed-scrape step (Yellow Pages, Maps) is on settled legal ground in the US for public business data — hiQ Labs v. LinkedIn (9th Cir. 2022) is the controlling precedent — and broadly compliant in the EU under GDPR's legitimate-interest basis for B2B contact data. The email-discovery step relies on third-party providers that have their own ToS and disclose their data sources in their documentation; using them does not transfer legal risk to you beyond ordinary breach-of-contract exposure.
The real compliance work is the outreach. CAN-SPAM (US), CASL (Canada), and the GDPR / ePrivacy regime (EU) each impose distinct rules on cold-email and cold-call campaigns: opt-out mechanisms, sender identification, transparency on data source when asked, and in some jurisdictions an explicit consent requirement before a first email. None of those are pipeline problems — they are outreach-tooling and legal-review problems.
Common Mistakes
- Enriching every seed row instead of only the qualified ones. Filter the seed before enrichment — drop rows with no website (no chance of an email), no phone (no chance of a dial), or zero reviews on Google Maps (disproportionately closed businesses).
- Trusting the website field unfiltered. Maps and Yellow Pages occasionally put a Facebook URL or a directory redirect in the website slot. Without the
normalize_domainstep, those silently destroy your email match rate. - Skipping verification before launch. Sending to an unverified enriched list drives bounce rates above 5%, which gets the sending domain throttled by every major inbox provider for weeks.
- Re-running enrichment too often. Most enriched fields change slowly — emails maybe once a year, websites less than that. Re-enriching the same row every week burns provider credits with near-zero new signal. Re-enrich quarterly at most.
- Treating phone and email as interchangeable for local services. For trades and home services, phone is the working channel and email is the polite-rejection channel. Build the pipeline around the phone column and treat email as a bonus field.
Get Started
- Sign up at logposervices.com and generate an API key under Tool → API Keys.
export LOGPOSE_API_KEY=lp_xxxxxxxandexport HUNTER_API_KEY=...(or your provider of choice).- Run a Yellow Pages or Google Maps seed scrape for your target niche and city.
- Pipe the resulting CSV through the enrichment script above.
- Validate, dedupe, and hand off to the sales team.
Related reading: How to build a B2B lead list from Yellow Pages (no code) for the simplest possible seed, How to scrape Google Maps for local business leads for the higher-coverage alternative, and How to monitor Yellow Pages for new businesses for the recurring-refresh pattern that turns this from a one-off into a pipeline.
External: Hunter.io, Apollo.io, Snov.io, hiQ Labs v. LinkedIn, CAN-SPAM Act.