Large-Scale Data Crawling Infrastructure

Multi-Market Extraction Pipeline on AWS — Campaign Orchestration, Proxy Management, and Structured Data at Scale

Advanced24 min read2026-01

EU Markets

Data Loss Events

Deploy Time

< 5min

01_THE_CHALLENGE

Each target market runs a different platform with different HTML structure, different anti-bot vendors, and different data schemas. The system needed to: run campaigns concurrently across markets, handle anti-bot defenses per-platform, normalize heterogeneous data into a unified schema, recover from failures without data loss, and be operable by non-engineers through a management interface.

02_THE_SOLUTION

Layered architecture across four planes: Orchestration Layer: Campaign state machine (PENDING → RUNNING → PAUSED → COMPLETE → ERROR) managed via PostgreSQL with advisory locks preventing duplicate crawl workers. AWS SQS queues decouple discovery from detail collection. Worker Layer: Dockerized Node.js workers deployed on AWS ECS. Each worker pulls tasks from SQS, executes requests through the proxy layer, and writes raw HTML + parsed data back to S3 and PostgreSQL. Workers scale horizontally via ECS task count. Anti-Detection Layer: Per-platform proxy pools (residential/datacenter/Tor), TLS fingerprint selection, request timing with Gaussian jitter, UA rotation, and session-consistent cookie management. Proxy health tracked per-domain with automatic failover. Data Layer: Raw HTML archived to S3. Parsed records normalized into a unified schema with EUR currency conversion and deduplication by canonical identifier. Strapi CMS as the admin interface for campaign configuration and data review. Publication pipeline outputs structured datasets for downstream consumers.

03_IMPACT_METRICS

Technical_Impact

5+ EU markets running concurrently on the same infrastructure
Campaign state machine with full recovery — no data loss on worker crash
Proxy toolbox: residential, datacenter, Tor — automatically selected per target
GUI-based selector tool: new markets configurable without code changes
Terraform-managed AWS infrastructure: ECS, SQS, S3, RDS, CloudFront
Sentry error tracking with per-campaign alerting and debugging queries
Deploy pipeline: GitHub Actions → ECR → ECS rolling update, < 5 min deploys

Business_Impact

Sole engineer — designed, built, deployed, and maintained the entire system
Millions of records collected and normalized across production campaigns
Non-technical operators manage campaigns via Strapi without engineering involvement
Data latency: new market listings available within hours of publication
System runs continuously — campaigns pause and resume without operator intervention

04_TECHNICAL_DEEP_DIVE

05_TECH_STACK

AWSDistributed SystemsWeb CrawlingDockerPostgreSQLCampaign OrchestrationProxy NetworksTypeScript

Want the full technical breakdown?

The wire-format analysis, architecture diagrams, and protocol-level detail live on Al Bayrouni. The contact form is for consulting and engagement discussions.

READ ON AL BAYROUNI →DISCUSS THIS PROJECT