AI Training Data Case Study: 847M Hindi Tokens in 21 Days

The Challenge

A Bangalore-based AI startup was fine-tuning a large language model on Hindi language tasks — news summarization, question-answering, and conversational Hindi. Existing Hindi datasets (CC-100, IndicCorp) were heavily skewed toward formal news text and contained significant amounts of English-Hindi code-switching that degraded model performance on pure Hindi tasks.

The startup needed a custom, domain-diverse corpus of 500M+ clean Hindi tokens covering colloquial language, informal registers, government communications, and domain-specific vocabulary (legal, medical, educational). Budget and timeline constraints meant the entire corpus had to be collected, cleaned, and delivered within 21 days.

Our Approach

01

Multi-Source Web Crawling

Built specialized crawlers for 120 Hindi-language sources: national and regional news portals (Dainik Bhaskar, Amar Ujala, Navbharat Times), government press releases (PIB Hindi, state government sites), Hindi Quora, discussion forums, Wikipedia Hindi, and Hindi web blogs. Each source had a custom extraction template to isolate article text from navigation, ads, and boilerplate.

02

Language Detection & Quality Filtering

Applied FastText language identification to filter out English-heavy or Hinglish paragraphs (LID score threshold > 0.92). Removed paragraphs with over 15% non-Devanagari characters. Stripped HTML, ran Unicode normalization (NFC), and removed duplicate sentences using MD5 hashing at the paragraph level.

03

MinHash LSH Deduplication

Applied MinHash with Locality Sensitive Hashing (LSH) across the full corpus to detect near-duplicate documents (Jaccard similarity > 0.8). Eliminated ~180M tokens of near-duplicates, improving corpus diversity.

04

HuggingFace Datasets Packaging

Delivered the final corpus in Apache Parquet format, organized into train/validation splits, with HuggingFace Datasets metadata. Client could load directly with datasets.load_dataset() for immediate training pipeline integration.

Technical Challenges Solved

Identifying and parsing 120 different Hindi site layouts at scale
Language detection edge cases for Hinglish (mixed Hindi-English) content
Near-duplicate detection across 1B+ token raw corpus required distributed MinHash with Spark
Several government sites had heavy JavaScript rendering requiring Playwright

Tech Stack

PythonScrapyPlaywrightFastTextMinHash LSHApache ParquetHuggingFace DatasetsApache SparkAWS S3AWS EMR

Key Results

847M Hindi tokens

Delivered in 21 days

18% benchmark improvement

Ready to Build Your Data Pipeline?

Every project starts with a free consultation and sample data delivery. No commitment required.

Get Free Quote View All Case Studies