Large-Scale Data Crawling Infrastructure
Multi-Market Extraction Pipeline on AWS — Campaign Orchestration, Proxy Management, and Structured Data at Scale
Advanced24 min read2026-01
01_THE_CHALLENGE
Each target market runs a different platform with different HTML structure, different anti-bot vendors, and different data schemas. The system needed to: run campaigns concurrently across markets, handle anti-bot defenses per-platform, normalize heterogeneous data into a unified schema, recover from failures without data loss, and be operable by non-engineers through a management interface.
02_THE_SOLUTION
Layered architecture across four planes:
Orchestration Layer: Campaign state machine (PENDING → RUNNING → PAUSED → COMPLETE → ERROR) managed via PostgreSQL with advisory locks preventing duplicate crawl workers. AWS SQS queues decouple discovery from detail collection.
Worker Layer: Dockerized Node.js workers deployed on AWS ECS. Each worker pulls tasks from SQS, executes requests through the proxy layer, and writes raw HTML + parsed data back to S3 and PostgreSQL. Workers scale horizontally via ECS task count.
Anti-Detection Layer: Per-platform proxy pools (residential/datacenter/Tor), TLS fingerprint selection, request timing with Gaussian jitter, UA rotation, and session-consistent cookie management. Proxy health tracked per-domain with automatic failover.
Data Layer: Raw HTML archived to S3. Parsed records normalized into a unified schema with EUR currency conversion and deduplication by canonical identifier. Strapi CMS as the admin interface for campaign configuration and data review. Publication pipeline outputs structured datasets for downstream consumers.
03_IMPACT_METRICS
Technical_Impact
- 5+ EU markets running concurrently on the same infrastructure
- Campaign state machine with full recovery — no data loss on worker crash
- Proxy toolbox: residential, datacenter, Tor — automatically selected per target
- GUI-based selector tool: new markets configurable without code changes
- Terraform-managed AWS infrastructure: ECS, SQS, S3, RDS, CloudFront
- Sentry error tracking with per-campaign alerting and debugging queries
- Deploy pipeline: GitHub Actions → ECR → ECS rolling update, < 5 min deploys
Business_Impact
- Sole engineer — designed, built, deployed, and maintained the entire system
- Millions of records collected and normalized across production campaigns
- Non-technical operators manage campaigns via Strapi without engineering involvement
- Data latency: new market listings available within hours of publication
- System runs continuously — campaigns pause and resume without operator intervention
05_TECH_STACK
AWSDistributed SystemsWeb CrawlingDockerPostgreSQLCampaign OrchestrationProxy NetworksTypeScript
Want the full technical breakdown?
The wire-format analysis, architecture diagrams, and protocol-level detail live on Al Bayrouni. The contact form is for consulting and engagement discussions.