← Back to blogTutorial

How to Scrape Google Search Results for Competitor Research

· 10 min read

Every competitor-research project starts with the same question: what does my rival actually rank for, and where am I losing the page they are winning? The answer lives in the Google SERP — the ordered organic block, the People Also Ask box, and the related searches at the bottom of the page. Those three blocks together describe a competitor's keyword footprint: the queries they own, the topics Google associates with them, and the adjacent questions their content has not yet answered. Ahrefs and SEMrush will sell you a version of this, but their numbers come from their own crawl on their own cadence, and you cannot pin the country, language, or exact query operators you care about.

This guide builds the footprint yourself: pull a keyword's SERP and read which competitor domains own which positions, restrict a scrape to a single rival domain to enumerate what they rank for, mine PAA and related searches for content-gap topics, loop a whole keyword list into a footprint table, and write the result to a CSV your strategist can sort. The whole pipeline is roughly 60 lines of Python on one SERP-scrape endpoint.

What's Actually in a Google SERP

Before scraping anything, it helps to be precise about the four blocks a single results page hands back, because each one answers a different competitor-research question.

BlockWhat's insideWhat it tells you
organicOrdered organic results — position, title, url, displayed_url, descriptionWho owns the query and at what rank
adsSponsored slots above and below organic, labeled separatelyWho is paying to defend or attack the term
paaPeople Also Ask — the expandable Q&A panel Google injects mid-SERPThe questions Google ties to this topic cluster
relatedRelated searches at the bottom of the pageAdjacent queries and the shape of the topic graph

For footprint research you care about all four. The organic block is the scoreboard. The ads block tells you which competitors are spending to defend a term — a strong signal of commercial value. The paa and related blocks are the content-gap goldmine: Google telling you, for free, which sub-questions and adjacent queries belong to this topic. A rival who ranks for the head term but answers none of the PAA questions has left a gap you can take.

One honest caveat up front: the SERP is volatile and partly personalized. The same query from the same country can shift by a position or two between scrapes minutes apart, and Google biases results on the requesting IP's inferred location. Footprint research tolerates this far better than daily rank tracking does — you are reading the shape of who-owns-what, not chasing single-position movements — but set country and language explicitly so the SERP you analyze is a clean, reproducible reference rather than your own logged-in browser's personalized view.

Step 1: Pull a Keyword's SERP and Read the Owners

Start with a single query and look at who holds the top positions. Every Google Search endpoint here is asynchronous: you submit a job, poll until it completes, then fetch the result. Confirm the parameters with curl before writing any code.

# 1) Submit the search job
curl -G "https://api.logposervices.com/api/v1/search/google/search" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode "q=project management software" \
  --data-urlencode "pages=2" \
  --data-urlencode "country=us" \
  --data-urlencode "language=en"
# → {"job_id": "gs_8f3a...", "status": "pending"}

# 2) Poll the job until status == "completed"
curl -H "X-API-Key: lp_xxxxxxx" \
  "https://api.logposervices.com/api/v1/jobs/gs_8f3a"
# → {"job_id": "gs_8f3a...", "status": "completed"}

# 3) Fetch the parsed result
curl -H "X-API-Key: lp_xxxxxxx" \
  "https://api.logposervices.com/api/v1/jobs/gs_8f3a/result"

Each page returns roughly 10 organic results, so pages=2 gives you the top 20 — the right depth for footprint work, since positions past 20 rarely receive measurable traffic and the page-to-page variance starts to dominate. Reading the organic array, you get the ordered list of domains that own this query. That alone answers the first competitor question: for "project management software," who sits at positions 1–10, and where does my domain land — if it lands at all? Mapping the domain at each position is the foundation everything else builds on.

Step 2: Restrict a Scrape to a Rival's Domain

Knowing a competitor owns a query is useful. Knowing every page of theirs that ranks for a query — and how many indexed pages they have competing for it — is what shapes strategy. The sites filter applies a site: restriction to the query, so the SERP comes back limited to the domain (or comma-separated domains) you name.

# What does competitor.com rank for on this topic?
curl -G "https://api.logposervices.com/api/v1/search/google/search" \
  -H "X-API-Key: lp_xxxxxxx" \
  --data-urlencode "q=project management software" \
  --data-urlencode "sites=competitor.com" \
  --data-urlencode "pages=3" \
  --data-urlencode "country=us"

The result is every page on competitor.com that Google considers relevant to that query, in rank order. Run it across your shortlist of head terms and you get a per-rival map of which of their URLs compete for which topics — the raw material for "they have a dedicated comparison page ranking for this, we only have a blog post." Pair it with exclude_sites to strip user-generated noise (exclude_sites=reddit.com,quora.com) and see the editorial SERP your content actually competes in, or with in_title (which applies intitle:) to find which of a rival's pages target a keyword in the title tag versus merely mention it in the body. The exact_phrase, exclude_terms, filetype, and time_range filters extend the same idea — filetype=pdf surfaces a competitor's gated whitepapers that rank, and time_range=month isolates what they have published recently.

This is where SERP scraping does something the position number alone cannot. The paa and related blocks are Google's own map of the topic cluster around your keyword. The People Also Ask box lists the sub-questions users ask; related searches list the adjacent queries. Cross-reference those against what a competitor's pages actually cover, and the gaps are the content you should write next.

The mining logic is simple: collect the PAA questions and related searches for each head term, then check which competitor domains rank when you fire each of those as its own query. A PAA question that no top competitor answers well is an opening. A related search that one rival dominates and you are absent from is a defensive priority. You are turning Google's own topic graph into a prioritized content backlog, with the competitor footprint layered on top so you know whether each gap is an open field or a contested one.

Be honest about what is and is not in this data. PAA and related searches are themselves personalized and volatile — the set you get depends partly on country and the moment you scrape. Treat them as directional topic signals, not a fixed taxonomy. The value is in the overlap: questions and related queries that appear across several head terms in your set are the durable topic pillars; one-off entries are noise.

Step 4: Loop a Keyword List Into a Footprint Table

The single-query view is a spot check. The deliverable a strategist wants is a table: for every keyword in your research set, which competitor domains appear in the top 10, at what positions, plus the PAA questions attached to each. Define the input as a small CSV.

keyword,country,language
project management software,us,en
best gantt chart tool,us,en
agile sprint planning tool,us,en
kanban board app,us,en
team task tracker,gb,en

The pipeline reads each row, scrapes the SERP, extracts the organic domains and their positions, pulls the PAA questions, and accumulates a footprint matrix. The script below does exactly that with requests and the standard library — no framework.

import os, time, csv
from collections import defaultdict
from urllib.parse import urlparse
import requests

API_KEY = os.environ["LOGPOSE_API_KEY"]
BASE = "https://api.logposervices.com/api/v1"
HEADERS = {"X-API-Key": API_KEY}

# Domains you consider competitors. Anything outside this set is noise.
WATCHLIST = {"competitor.com", "rival.io", "challenger.co", "example.com"}


def submit_and_wait(path: str, params: dict, timeout_s: int = 90) -> dict:
    """Submit an async SERP job, poll until done, return the parsed result.

    Cloudflare drops a connection at ~90s, so never expect an inline
    answer — always poll the job_id.
    """
    r = requests.get(f"{BASE}/{path}", params=params, headers=HEADERS, timeout=30)
    r.raise_for_status()
    job_id = r.json()["job_id"]
    deadline = time.time() + timeout_s
    while time.time() < deadline:
        s = requests.get(f"{BASE}/jobs/{job_id}", headers=HEADERS, timeout=15).json()
        if s["status"] == "completed":
            return requests.get(
                f"{BASE}/jobs/{job_id}/result", headers=HEADERS, timeout=15
            ).json()
        if s["status"] == "failed":
            raise RuntimeError(s.get("error", "unknown failure"))
        time.sleep(2)
    raise TimeoutError(f"job {job_id} did not finish in {timeout_s}s")


def domain_of(url: str) -> str:
    host = urlparse(url).netloc.lower()
    return host[4:] if host.startswith("www.") else host


def footprint_for(keyword: str, country: str, language: str) -> dict:
    """Return competitor positions + PAA questions for one keyword."""
    serp = submit_and_wait(
        "search/google/search",
        {"q": keyword, "pages": 2, "country": country, "language": language},
    )
    positions = {}  # competitor domain -> best (lowest) position seen
    for i, row in enumerate(serp.get("organic", []), start=1):
        d = domain_of(row.get("url", ""))
        if d in WATCHLIST and d not in positions:
            positions[d] = i
    paa = [q.get("question") for q in serp.get("paa", []) if q.get("question")]
    related = serp.get("related", [])
    return {"keyword": keyword, "positions": positions, "paa": paa, "related": related}

The footprint_for function is the whole algorithm: walk the organic block once, record the best position for each domain on your watchlist, and grab the PAA questions and related searches alongside. Everything competitor-research needs falls out of that single pass — who ranks, where, and what topics Google attaches to the query.

Step 5: Store the Footprint to CSV

A footprint table is only useful if a non-engineer can sort it. The driver below loops the keyword CSV, builds two outputs — a wide footprint.csv (one row per keyword, one column per competitor's position) and a long content_gaps.csv (every PAA question, keyed to its source keyword) — and writes both.

def main():
    with open("keywords.csv") as f:
        rows = list(csv.DictReader(f))

    results = []
    for row in rows:
        try:
            results.append(footprint_for(row["keyword"], row["country"], row["language"]))
        except Exception as e:
            print(f"  ! {row['keyword']}: {e}")
            results.append({"keyword": row["keyword"], "positions": {}, "paa": [], "related": []})

    competitors = sorted(WATCHLIST)

    # 1) Wide footprint: position of each competitor per keyword
    with open("footprint.csv", "w", newline="") as f:
        w = csv.writer(f)
        w.writerow(["keyword"] + competitors)
        for r in results:
            w.writerow([r["keyword"]] + [r["positions"].get(c, "") for c in competitors])

    # 2) Long content-gap list: every PAA question, deduped, keyed to keyword
    seen = set()
    with open("content_gaps.csv", "w", newline="") as f:
        w = csv.writer(f)
        w.writerow(["source_keyword", "paa_question"])
        for r in results:
            for q in r["paa"]:
                if q not in seen:
                    seen.add(q)
                    w.writerow([r["keyword"], q])

    print(f"Wrote footprint.csv ({len(results)} keywords) and content_gaps.csv ({len(seen)} questions)")


if __name__ == "__main__":
    main()

Open footprint.csv in a spreadsheet and you have the artifact: keywords down the rows, competitors across the columns, each cell the rival's best organic position (blank where they do not rank in the top 20). Sort by a competitor's column to see exactly which queries they dominate; scan the blank cells in your own column to find the terms you are losing. content_gaps.csv is the deduped backlog of every question Google associates with your topic set — the writing brief, pre-prioritized by how often each question recurred across keywords.

Scaling This Across a Wide Keyword List

The driver above scrapes one keyword at a time, which is fine for a 50-term competitive set and becomes the wall-clock bottleneck somewhere past a few hundred. Two adjustments handle a real research portfolio.

Parallelize the submit step. Each job completes in seconds, but issuing them sequentially serializes the waiting. Wrap footprint_for in a concurrent.futures.ThreadPoolExecutor with 10–20 workers and the polling inside each submit_and_wait overlaps. A 2,000-keyword footprint drops from hours to minutes.

Manage the proxy and CAPTCHA problem. This is the wall most DIY SERP scrapers hit by week two. Google starts CAPTCHA-walling an IP after a few hundred queries from the same address per day, and country-bound personalization means a US footprint needs a US IP, a UK footprint a UK IP. The manual version — proxy pool maintenance, IP warm-up, country-bound residential rotation, CAPTCHA recovery — quietly becomes more code than the footprint pipeline itself. The LogPose web scraping API handles rotation, country-bound IP selection, and CAPTCHA recovery behind the /api/v1/search/google/search endpoint shown above: set country=us and a US residential IP is selected automatically, set country=de and the next request comes from a German one. The footprint pipeline stays at ~60 lines because the infrastructure lives on the API side. The same async pattern powers Google News and Shopping — switch the path to /search/google/news or /search/google/shopping to footprint those verticals without rewriting the loop.

Honest Limits of SERP-Based Competitor Research

A footprint built from SERP scrapes is a real, auditable snapshot, but be clear about its edges so you read the table correctly.

Positions oscillate. Treat a competitor at position 3 versus 4 as the same tier; only sustained, multi-position shifts across repeated scrapes are real movement. Footprint research is about who-owns-what bands, not pixel-perfect rank.

Personalization is partial. Setting country and language on a clean session (the scraper does not log in) gets you close to the anonymous SERP for that geography, but Google still varies results on signals you cannot fully control. Run the same set from the same country at a consistent time for comparability.

PAA and related are directional. They are Google's live topic associations, not a stable taxonomy — the durable signal is the overlap across keywords, not any single entry.

The Cloudflare edge timeout is ~90 seconds. api.logposervices.com sits behind Cloudflare, so a synchronous request that runs long returns an edge error even though the job continues server-side. Always poll the job_id; never expect an inline result on a large page count. The pipeline above already does this correctly.

LogPose fits this workflow when you want clean, structured SERP blocks — organic, ads, PAA, related — across many keywords and countries without owning the proxy and CAPTCHA layer, and the same async shape extends to News and Shopping when your footprint research grows past plain web search. The honest constraint is that it is a SERP-scrape endpoint, not a backlink or traffic-estimation tool: it tells you who ranks and for what, not how much traffic that rank earns. Pair it with your analytics for the volume side of the picture.

Get Started

  1. Sign up at logposervices.com and generate an API key under Tool → API Keys.
  2. export LOGPOSE_API_KEY=lp_xxxxxxx
  3. Build a keywords.csv with your competitive set and call /api/v1/search/google/search?q=... against the first row to confirm the shape.
  4. Run the pipeline above and open footprint.csv — your competitor keyword footprint, ready to sort.

Related reading: How to track your Google search rankings daily for the day-over-day rank-monitoring version of this pipeline, a SerpAPI alternative for SERP, Maps, and News for the broader endpoint comparison, and DataForSEO alternatives for SERP and rank data if you are evaluating managed SERP-data vendors.

External: Google Search Central, hiQ Labs v. LinkedIn.

Frequently asked questions

Is it legal to scrape Google Search results?
Scraping the public Google SERP is what every commercial SEO platform — Ahrefs, SEMrush, Moz, Serpstat — has done for two decades. The rendered HTML of `google.com/search?q=...` is public, indexable by other search engines, and sits behind no login wall. US case law (hiQ Labs v. LinkedIn, 9th Cir. 2022) confirms that scraping publicly accessible web data is not a CFAA violation. Google's Terms of Service do forbid automated access to its underlying APIs without a key, but reading the rendered public results page is a materially different risk profile from scraping an authenticated service, because no Google account is involved and nothing is republished as a competing product. Treat the data as research input, respect rate limits, and do not resell raw SERP HTML as a product, and you are in the same posture every rank-tracking vendor has operated in since 2010.
Why not use the Google Custom Search JSON API for competitor research?
The Custom Search JSON API returns results from a Programmable Search Engine, which is a filtered, re-ranked subset of Google's index — not the SERP a normal user sees in their browser. For competitor research that gap is disqualifying: the positions, the domains that appear, and the People Also Ask and related-search blocks you are specifically mining for content gaps either diverge from the real SERP or are absent from the JSON product entirely. The API also caps free usage at 10,000 queries per day and its Terms of Service forbid using it to build a rank-tracking or competitor-intelligence product. Every commercial SEO tool ignores the JSON API and scrapes the actual rendered SERP, because matching what a competitor's customer actually sees is the entire point of the exercise. If your footprint table is built on JSON-API positions, it will not match a single screenshot a client takes in their own browser.

Related posts

Comparison

DataForSEO Alternatives for SERP and Rank Data

10 min read
Tutorial

How to Track Your Google Search Rankings Daily

11 min read
Tutorial

How to Build a Service-Area Lead List from Google Maps

11 min read