Why Scrape Amazon in 2025?
Amazon hosts over 350 million product listings and processes billions in sales daily. For e-commerce businesses, researchers, and data analysts, Amazon is the richest source of product pricing, review sentiment, BSR rankings, and market intelligence on the internet.
Whether you're monitoring competitor pricing in real-time, building price comparison tools, conducting consumer research, or training AI models on product descriptions — Amazon scraping is a critical capability for data-driven businesses in 2025.
The challenge is that Amazon is also among the most aggressively bot-protected websites. Naive scrapers get blocked within minutes. This guide covers the professional techniques we use in production at DataScraper.in to reliably extract Amazon data at scale.
What Data Can You Extract From Amazon?
Amazon product pages expose a wealth of structured data. Here's what you can reliably scrape:
- Product details: Title, ASIN, description, bullet points, brand, category, product dimensions, weight
- Pricing data: Current price, list price, discount percentage, price history, deal prices
- Reviews & ratings: Overall rating, review count, review text, reviewer profiles, helpful votes, verified purchase status
- Seller data: Seller name, fulfillment method (FBA vs FBM), offer count, buy box winner
- Rankings: Best Seller Rank (BSR) by category, search rank for specific keywords
- Images: Product image URLs in multiple resolutions
- Search results: SERP listings, sponsored vs organic results, product positions
Choosing the Right Tool: Scrapy vs Playwright
For Amazon scraping, your tool choice significantly affects reliability and scale:
Scrapy is excellent for high-volume, efficient crawling of simple product pages. It's fast, memory-efficient, and has built-in support for proxy middleware, retry logic, and pipelines. For static product pages (common in Amazon), Scrapy is often sufficient and faster.
Playwright (or Puppeteer/Selenium) is necessary when Amazon serves bot challenges, requires JavaScript execution, or when you need to simulate human behavior to bypass detection. It's slower but far more reliable for protected endpoints.
At DataScraper.in, we typically use a hybrid approach: Scrapy for bulk product data extraction with smart proxy rotation, falling back to Playwright for pages that trigger anti-bot challenges.
import scrapy
class AmazonSpider(scrapy.Spider):
name = 'amazon_products'
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'scrapy_rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
},
'ROTATING_PROXY_LIST_PATH': 'proxies.txt',
'DOWNLOAD_DELAY': 2,
'RANDOMIZE_DOWNLOAD_DELAY': True,
}
def start_requests(self):
asins = ['B08N5KWB9H', 'B09G9FPCDV', 'B0BDHWDR12']
for asin in asins:
url = f'https://www.amazon.com/dp/{asin}'
yield scrapy.Request(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept-Language': 'en-US,en;q=0.9',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
})
def parse(self, response):
yield {
'asin': response.url.split('/dp/')[1].split('/')[0],
'title': response.css('#productTitle::text').get('').strip(),
'price': response.css('.a-price .a-offscreen::text').get(''),
'rating': response.css('[data-hook="average-star-rating"] .a-size-base::text').get(''),
'review_count': response.css('[data-hook="total-review-count"]::text').get(''),
'bsr': response.css('#SalesRank .a-list-item::text').getall(),
}Handling Amazon's Anti-Bot Defenses
Amazon uses several layers of bot detection that you must address for reliable scraping:
- IP-based rate limiting: Amazon blocks datacenter IPs within minutes. Solution: use residential proxy rotation (ISP or residential proxies, not datacenter).
- Browser fingerprinting: Amazon checks browser headers, TLS fingerprint, and JavaScript properties. Solution: use realistic browser headers and Playwright with proper viewport/user-agent settings.
- CAPTCHA challenges: Amazon may serve image CAPTCHAs for suspicious activity. Solution: implement exponential backoff and retry with a different proxy, or use CAPTCHA-solving services as last resort.
- Session cookies: Amazon tracks sessions. Solution: maintain consistent cookie jars per proxy session; don't mix IP and session cookies.
The most important defense is residential proxy rotation. Datacenter proxies (AWS, GCP, etc.) are blocklisted by Amazon. You need genuine residential IPs from ISPs like Comcast, AT&T, etc.
Parsing Product Data Correctly
Amazon's HTML structure changes frequently. Instead of relying on brittle CSS selectors alone, we recommend a multi-signal approach for critical fields:
For pricing, Amazon stores data in multiple places: the displayed price, the offscreen accessibility price, and JSON-LD structured data in the page. The JSON-LD data is often more reliable:
import json
import re
def extract_product_data(response):
# Try JSON-LD first (most reliable)
ld_json = response.css('script[type="application/ld+json"]::text').get('')
if ld_json:
try:
data = json.loads(ld_json)
if isinstance(data, list):
data = data[0]
return {
'name': data.get('name'),
'price': data.get('offers', {}).get('price'),
'currency': data.get('offers', {}).get('priceCurrency'),
'rating': data.get('aggregateRating', {}).get('ratingValue'),
}
except json.JSONDecodeError:
pass
# Fall back to CSS selectors
return {
'name': response.css('#productTitle::text').get('').strip(),
'price': response.css('.a-price .a-offscreen::text').get(''),
}Scaling to Thousands of ASINs
For large-scale Amazon scraping (10,000+ products), you need distributed architecture:
- Job queue: Use Redis or AWS SQS to manage ASIN lists and retry failed extractions
- Multiple workers: Run 5–20 parallel Scrapy/Playwright instances across multiple servers
- Proxy pool management: Rotate through a pool of 500+ residential proxies; retire blocked ones automatically
- Scheduling: For price monitoring, use cron or Airflow to schedule recurring runs
- Error handling: Track success rates per proxy; auto-detect structural changes (HTML selectors breaking)
At DataScraper.in, we've scraped 1M+ Amazon ASINs in single batch runs for clients building price intelligence tools. The key is robust infrastructure, not just good scraping code.
Legal Considerations
Amazon's Terms of Service prohibit automated scraping. However, scraping publicly accessible product data (pricing, descriptions, reviews) is generally considered legal under US and EU law when done for legitimate purposes like price comparison, research, and competitive intelligence.
Key principles we follow at DataScraper.in: (1) Only scrape publicly accessible data — no account authentication required. (2) Respect rate limits and don't overload Amazon's servers. (3) Don't scrape personal data of users. (4) Consult legal counsel if your use case involves sensitive jurisdictions or purposes.
For commercial projects, we recommend checking with a lawyer familiar with the hiQ Labs v. LinkedIn precedent and similar cases that established the legality of scraping public web data.