GEO data backbone · stealth scraping · multi-cloud

AI-answer collection engine

The data layer behind the GEO metrics: a multi-provider engine that collects AI answers and citations from ChatGPT, Perplexity, and Google AI Overview/AI Mode — via reverse-engineered anonymous endpoints (curl_cffi TLS-fingerprint impersonation, a Cloudflare-Turnstile + proof-of-work bypass), seven paid scraper/SERP vendors, and official APIs — deployed across local, AWS Lambda, and Google Cloud Run, normalized to one schema. Internal platform infra; bypass cores adapt open-source bases.

GEOSCRAPINGANTI-BOTAWS-LAMBDACLOUD-RUN

Honest outcomes

answer engines collected

3 + APIs

ChatGPT · Perplexity · Google AI Mode/Overview

deployment targets

local · AWS Lambda · Cloud Run

paid scraper/SERP vendors

normalized to one schema

Perplexity TLS path

~5s/query

self-reported; no proxy

endurance test

1,000 requests

IP-degradation study

01 —

Why

Every GEO metric in this portfolio — visibility %, share of voice, citation position — is only as good as the data underneath it. To know whether a brand is mentioned and cited inside an AI answer, you first have to collect that answer, faithfully, from each engine, at scale, on a schedule. That collection layer is the unglamorous part nobody demos, and it is the hardest part to keep working.

The major answer engines actively resist automated collection: Cloudflare challenges, Turnstile, proof-of-work, bot fingerprinting, and IP reputation all sit between you and a clean answer. Off-the-shelf browser automation works on a laptop and then falls over the moment it runs from a data-centre IP. I needed a collection engine that survives in the cloud, across providers, without quietly returning garbage.

Honest scope up front: this is the internal data backbone for a GEO platform, not a public product, and a teammate owns a separate parallel scraper set that is not my work. The hardest bypass cores adapt and harden open-source bases rather than being invented from scratch — credited as such below.

Browser automation kept getting blocked in the cloud; impersonating the TLS handshake did not — so the engine stopped driving a browser and started speaking the protocol directly.

the pivot that made cloud collection reliable

02 —

What

The engine collects AI answers and their citations three ways, normalized to one schema: official provider APIs (OpenAI, Claude, Gemini, Perplexity Sonar), seven paid scraper/SERP vendors (Bright Data, Oxylabs, DataForSEO, ScrapingDog, SearchAPI, Serper, ScraperAPI), and bespoke anti-bot scrapers for the engines whose answers are not available any other way — ChatGPT, Perplexity, and Google AI Overview / AI Mode.

It runs in three places: locally for development, on AWS Lambda for parallel fan-out, and on Google Cloud Run for the heaviest stealth paths. The same query can be answered by whichever route is cheapest and most reliable that day, and the results come back in one comparable shape regardless of source.

The standout components are a self-healing AWS Lambda scraper for Google AI Mode — a three-state S3 fallback with manifest locks, a safe-upload gate that refuses to poison good state, and quarantine cooldowns — and a Perplexity collector that defeats Cloudflare with TLS-fingerprint impersonation alone, no browser and no proxy.

03 —

How

The key insight was to stop pretending to be a browser and start matching one at the byte level. Instead of driving headless Chrome (which the cloud blocks), the stealth paths use curl_cffi to impersonate a real browser’s TLS/JA3 fingerprint, then solve the anonymous-session handshake — including a Cloudflare-Turnstile bytecode-VM solver and a proof-of-work step — to reach the same endpoints the web app uses. Those Turnstile/PoW cores adapt published open-source work, hardened and wrapped for deployment.

Reliability is treated as the product. Browser-automation routes (playwright-stealth, patchright) were tried first and documented as cloud dead-ends; the pivot to TLS impersonation is backed by a 1,000-request endurance study of how a single IP degrades under load, which is what justified the residential-proxy and fingerprint-rotation strategy. The Lambda state machine assumes failure and recovers from it instead of corrupting the dataset.

Everything funnels through a normalization layer so an answer from a paid vendor, an official API, and a hand-built scraper are directly comparable — same fields, same citation/source split — which is what makes the downstream GEO scoring trustworthy rather than apples-to-oranges.

04 —

Where it stands

A multi-provider, multi-cloud AI-answer collection engine that keeps working where naive scrapers fail: reverse-engineered anonymous endpoints for ChatGPT and Perplexity, a self-healing Lambda scraper for Google AI Mode, seven paid-vendor fallbacks, and one normalized output schema — the measurable foundation the GEO metrics stand on.

Kept honest: latency figures (e.g. the ~5s/query Perplexity path) are self-reported, not independently benchmarked; the Turnstile/PoW bypass cores adapt open-source bases rather than being invented here; and a co-author’s parallel scraper set is excluded from this credit. The engineering I claim is the hardening, the cloud deployment, the self-healing state machine, and the normalization layer.

05 —

Stack

Pythoncurl_cffiplaywright-stealth / patchrightAWS LambdaGoogle Cloud RunS3 state machine

All case studies Ask my AI twin about this