feat: add OCR language selection to /extract API and improve LLM prompt
📄 Merge Request Description
🧩 Overview
This MR introduces parameterized OCR language selection and a refined LLM prompt to improve multilingual extraction accuracy and usability.
🔧 Changes
| Change | File(s) | Summary |
|---|---|---|
| API update | bookextractor/main.py |
Added lang enum (en, te, devanagari) as a Form field to /extract. The selected value is passed to process_pdf. |
| Pipeline signature | bookextractor/pipeline.py |
Updated method signature to process_pdf(pdf_path, benchmark=False, lang="en"), forwarding lang to parse_pdf_via_vparse. |
| VParse client | bookextractor/vparse_client.py |
Updated function parse_pdf_via_vparse(file_path, lang="en") to include lang_list: <lang> in the request payload. |
| Prompt overhaul |
bookextractor/pipeline.py (method: extract_semantic_fields) |
Rewritten prompt with explicit instructions, multilingual handling, and strict JSON output rules. |
| Enum definition | bookextractor/main.py |
Added OCRLanguage enum for Swagger UI dropdown. |
🧾 Commits
-
f771dea(HEAD → feat/dockerize) feat: add OCR language selection to/extractAPI and improve LLM prompt -
523ec9cfeat(docker): containerize bookextractor + vparse with model persistence -
77ba5bafix(docker): add build deps and CPU torch to BookExtractor Dockerfile -
d82fdc6feat(docker): add multi-backend VParse OCR with per-mode Dockerfiles -
132a006(origin/feat/multi-format-pipeline-phase1, feat/multi-format-pipeline-phase1) feat: implement multi-format pipeline (Phase 1) -
629b082feat: implement multi-format pipeline (Phase 1)
🧪 Testing
-
API
- Swagger UI displays
langdropdown with supported values
- Swagger UI displays
-
Manual Test
curl -F file=@sample.pdf -F lang=en http://localhost:8000/extract
-
Returns JSON with correctly extracted English metadata
-
Unit Tests
-
Existing tests pass without modification:
tests/test_pipeline.pytests/test_vparse_client.py
-
✅ Checklist
-
Code compiles
python -m py_compile $(git ls-files '*.py')
-
Documentation updated ( README.md, API usage section) -
No new lint warnings ( ruff) -
CI pipeline passes ( gitlab-ci.yml)
🔗 Related Issues
- Closes #5 (closed) – “Add multilingual OCR language selection”
🚀 Status
Ready for review & merge into feat/multi-format-pipeline-phase1.