Add multilingual OCR language selection to the /extract endpoint and improve LLM prompt for metadata extraction
📄 Multilingual OCR Enhancement – Feature Specification
🧩 Description
The current OCR pipeline always uses the Telugu (te) language pack, which forces users to manually edit code when they need a different language (e.g., English or Hindi).
This limits usability for the intended multilingual use case (English + Telugu + Hindi).
🎯 Goals
1. Expose OCR Language as a Request Parameter
-
Add a new field
langto the FastAPI/extractendpoint -
Supported values (enum):
-
en→ English -
te→ Telugu -
devanagari→ Hindi
-
-
Ensure the selected language:
- Flows through the API layer
- Is passed to the CLI (internally, without breaking interface)
- Reaches
ExtractionPipeline.process_pdf() - Is forwarded to the VParse client
2. Replace Hardcoded Language in vparse_client.py
- Update function signature:
parse_pdf_via_vparse(file_path, lang="en")
- Remove hardcoded Telugu (
te) - Use the passed
langdynamically
3. Upgrade the LLM Prompt
Enhance the extraction prompt to handle noisy multilingual OCR output.
Prompt Requirements:
-
Clearly explain:
- OCR noise
- Mixed-language text scenarios
-
Include extraction heuristics:
-
Look for:
- “By”, “©”, “Published by”
- Telugu equivalents
- Hindi (Devanagari) equivalents
-
-
Enforce strict output rules:
- JSON only
- No Markdown
- No extra text
- Use
nullfor uncertain values - Avoid ISBN duplication
Target Fields:
- Title
- Author
- Publisher
- Published Date
4. CLI Compatibility
- Keep existing CLI unchanged
- Do not introduce breaking changes
-
langoption applies only to API usage
✅ Acceptance Criteria
-
Swagger UI displays:
- Dropdown for
lang - Options:
en,te,devanagari
- Dropdown for
-
All tests pass:
bookextractor/tests/*
-
Manual validation:
- Upload English PDF → correct metadata with
lang=en
- Upload English PDF → correct metadata with
-
No regressions:
- Dockerfiles remain functional
- CI pipelines unaffected
- Other endpoints unchanged
🏷 Labels
- feature
- multilingual
- documentation
- enhancement
🗺 Milestone
Phase 2 – Multilingual OCR
🔗 Related Merge Request
- MR !6 (merged)
- Branch: feat/dockerize
Edited by Lakshy Yarlagadda