Add multilingual OCR language selection to the /extract endpoint and improve LLM prompt for metadata extraction

📄 Multilingual OCR Enhancement – Feature Specification

🧩 Description

The current OCR pipeline always uses the Telugu (te) language pack, which forces users to manually edit code when they need a different language (e.g., English or Hindi).

This limits usability for the intended multilingual use case (English + Telugu + Hindi).

🎯 Goals

1. Expose OCR Language as a Request Parameter

Add a new field lang to the FastAPI /extract endpoint
Supported values (enum):
- en → English
- te → Telugu
- devanagari → Hindi
Ensure the selected language:
- Flows through the API layer
- Is passed to the CLI (internally, without breaking interface)
- Reaches ExtractionPipeline.process_pdf()
- Is forwarded to the VParse client

2. Replace Hardcoded Language in `vparse_client.py`

Update function signature:

parse_pdf_via_vparse(file_path, lang="en")

Remove hardcoded Telugu (te)
Use the passed lang dynamically

3. Upgrade the LLM Prompt

Enhance the extraction prompt to handle noisy multilingual OCR output.

Prompt Requirements:

Clearly explain:
- OCR noise
- Mixed-language text scenarios
Include extraction heuristics:
- Look for:
  - “By”, “©”, “Published by”
  - Telugu equivalents
  - Hindi (Devanagari) equivalents
Enforce strict output rules:
- JSON only
- No Markdown
- No extra text
- Use null for uncertain values
- Avoid ISBN duplication

Target Fields:

Title
Author
Publisher
Published Date

4. CLI Compatibility

Keep existing CLI unchanged
Do not introduce breaking changes
lang option applies only to API usage

✅ Acceptance Criteria

Swagger UI displays:
- Dropdown for lang
- Options: en, te, devanagari
All tests pass:

bookextractor/tests/*

Manual validation:
- Upload English PDF → correct metadata with lang=en
No regressions:
- Dockerfiles remain functional
- CI pipelines unaffected
- Other endpoints unchanged

🏷 Labels

feature
multilingual
documentation
enhancement

🗺 Milestone

Phase 2 – Multilingual OCR

🔗 Related Merge Request

MR !6 (merged)
Branch: feat/dockerize

Edited Apr 29, 2026 by Lakshy Yarlagadda