A Bangalore-based AI startup was fine-tuning a large language model on Hindi language tasks — news summarization, question-answering, and conversational Hindi. Existing Hindi datasets (CC-100, IndicCorp) were heavily skewed toward formal news text and contained significant amounts of English-Hindi code-switching that degraded model performance on pure Hindi tasks.
The startup needed a custom, domain-diverse corpus of 500M+ clean Hindi tokens covering colloquial language, informal registers, government communications, and domain-specific vocabulary (legal, medical, educational). Budget and timeline constraints meant the entire corpus had to be collected, cleaned, and delivered within 21 days.
Multi-Source Web Crawling
Built specialized crawlers for 120 Hindi-language sources: national and regional news portals (Dainik Bhaskar, Amar Ujala, Navbharat Times), government press releases (PIB Hindi, state government sites), Hindi Quora, discussion forums, Wikipedia Hindi, and Hindi web blogs. Each source had a custom extraction template to isolate article text from navigation, ads, and boilerplate.
Language Detection & Quality Filtering
Applied FastText language identification to filter out English-heavy or Hinglish paragraphs (LID score threshold > 0.92). Removed paragraphs with over 15% non-Devanagari characters. Stripped HTML, ran Unicode normalization (NFC), and removed duplicate sentences using MD5 hashing at the paragraph level.
MinHash LSH Deduplication
Applied MinHash with Locality Sensitive Hashing (LSH) across the full corpus to detect near-duplicate documents (Jaccard similarity > 0.8). Eliminated ~180M tokens of near-duplicates, improving corpus diversity.
HuggingFace Datasets Packaging
Delivered the final corpus in Apache Parquet format, organized into train/validation splits, with HuggingFace Datasets metadata. Client could load directly with datasets.load_dataset() for immediate training pipeline integration.
- Identifying and parsing 120 different Hindi site layouts at scale
- Language detection edge cases for Hinglish (mixed Hindi-English) content
- Near-duplicate detection across 1B+ token raw corpus required distributed MinHash with Spark
- Several government sites had heavy JavaScript rendering requiring Playwright