A Mumbai-based PropTech SaaS was manually pulling property data from three major portals — MagicBricks, 99acres, and Housing.com — for five Indian cities (Mumbai, Delhi, Bangalore, Pune, Hyderabad). This process consumed 40+ hours per week of analyst time and was still producing incomplete, outdated data by the time it reached their ML valuation models.
The client needed a fully automated, daily-updating pipeline that would deliver city-wise property listings (new construction, resale, rental) with prices, floor plans, amenities, and agent contact info into their PostgreSQL database — all without purchasing expensive data licenses from the portals (which quoted ₹15–20L/year).
Parallel Playwright Scrapers
Built three separate scrapers for each portal, running in parallel on AWS EC2. MagicBricks required full headless browser automation due to Cloudflare protection. 99acres and Housing.com required JS rendering but were accessible via Playwright with proper session management.
Indian Residential IP Pool
Integrated a pool of 50,000+ Indian residential IPs (Mumbai, Delhi, Bangalore subnets) to avoid geo-blocks and appear as genuine Indian users. Each scraper session used a fresh IP per city per portal.
PostgreSQL with Daily Delta Updates
Instead of full re-scrapes, we built a hashing system that detects listing changes (price updates, status changes). Only new or changed listings write to the database, reducing compute costs by 70%.
City-Wise Scheduling
Each of the 5 cities across 3 portals runs on a staggered schedule (15 city×portal combinations), distributed across 6-hour windows to avoid rate limit spikes.
- MagicBricks Cloudflare JS challenge requiring browser-grade TLS fingerprint
- 99acres aggressive rate limiting at 3 requests/minute per IP
- Housing.com fully React SPA with lazy-loaded listing grids requiring scroll simulation
- Dynamic price rendering via XHR calls requiring API interception