The Client Challenge
A Mumbai-based PropTech startup (name withheld for confidentiality) had built a property valuation model that required fresh listing data from all three major Indian real estate portals: MagicBricks, 99acres, and Housing.com.
Their data science team was manually downloading CSV exports from these platforms โ a process that took 40+ hours per week and still missed thousands of new listings. The data was always 3โ7 days stale by the time it entered their model.
They needed: daily fresh data, automated delivery, coverage across 5 cities (Mumbai, Pune, Hyderabad, Bangalore, Delhi NCR), and at least 95% completeness. Their valuation model's accuracy directly depended on data freshness and completeness.
Our Technical Approach
After analyzing all three portals, we designed a parallel scraping architecture:
- 3 separate Playwright scrapers โ one per portal, each tuned for that portal's specific anti-bot system and URL structure.
- Indian residential proxy pool โ critical because all three portals geo-restrict some data and 99acres blocked non-Indian IPs. We used an Indian ISP-level residential proxy pool.
- City-wise partitioning โ each city ran in parallel, with scrapers starting simultaneously across all 5 cities.
- PostgreSQL delivery with daily delta updates โ only changed/new listings are inserted/updated each day, keeping the database lean and efficient.
- Deduplication layer โ listings often appear on multiple portals with different IDs but same property. Our system fingerprinted properties by location, size, and price to deduplicate cross-portal duplicates.
Technical Challenges We Solved
MagicBricks: Uses Cloudflare Enterprise with aggressive JS fingerprinting. We solved this with Playwright stealth mode, realistic viewport settings, and controlled request timing to mimic human browsing patterns.
99acres: Has aggressive rate limiting โ more than 3 requests per second from the same IP triggers a soft block. Solution: distributed requests across a 200+ IP proxy pool with per-IP rate limiting logic. Also uses React-based lazy loading for listing cards, requiring scroll simulation.
Housing.com: Dynamic React SPA with client-side data fetching via internal APIs. We reverse-engineered their GraphQL API (used internally by their own frontend) and called it directly โ more reliable than parsing the rendered HTML.
Results & Performance
The pipeline achieved:
- ๐ 2.1 million listings scraped and delivered on the first production run
- โก 48 hours from project kick-off to first delivery
- ๐ Daily delta updates running automatically via cron at 2 AM IST
- โ 99.3% uptime over 18 months of continuous operation
- ๐ฐ โน12L/year savings vs. the cost of buying equivalent data from portal data divisions
- ๐ 18% improvement in their valuation model accuracy due to data freshness
The client has since expanded the pipeline to cover 3 additional cities and added Nobroker.in as a fourth source. The system now runs entirely on auto-pilot with alerting for any drop in completion rate below 95%.