Skip to content

feat: add OCR language selection to /extract API and improve LLM prompt

📄 Merge Request Description

🧩 Overview

This MR introduces parameterized OCR language selection and a refined LLM prompt to improve multilingual extraction accuracy and usability.


🔧 Changes

Change File(s) Summary
API update bookextractor/main.py Added lang enum (en, te, devanagari) as a Form field to /extract. The selected value is passed to process_pdf.
Pipeline signature bookextractor/pipeline.py Updated method signature to process_pdf(pdf_path, benchmark=False, lang="en"), forwarding lang to parse_pdf_via_vparse.
VParse client bookextractor/vparse_client.py Updated function parse_pdf_via_vparse(file_path, lang="en") to include lang_list: <lang> in the request payload.
Prompt overhaul bookextractor/pipeline.py (method: extract_semantic_fields) Rewritten prompt with explicit instructions, multilingual handling, and strict JSON output rules.
Enum definition bookextractor/main.py Added OCRLanguage enum for Swagger UI dropdown.

🧾 Commits

  • f771dea (HEAD → feat/dockerize) feat: add OCR language selection to /extract API and improve LLM prompt

  • 523ec9c feat(docker): containerize bookextractor + vparse with model persistence

  • 77ba5ba fix(docker): add build deps and CPU torch to BookExtractor Dockerfile

  • d82fdc6 feat(docker): add multi-backend VParse OCR with per-mode Dockerfiles

  • 132a006 (origin/feat/multi-format-pipeline-phase1, feat/multi-format-pipeline-phase1) feat: implement multi-format pipeline (Phase 1)

  • 629b082 feat: implement multi-format pipeline (Phase 1)


🧪 Testing

  • API

    • Swagger UI displays lang dropdown with supported values
  • Manual Test

curl -F file=@sample.pdf -F lang=en http://localhost:8000/extract
  • Returns JSON with correctly extracted English metadata

  • Unit Tests

    • Existing tests pass without modification:

      • tests/test_pipeline.py
      • tests/test_vparse_client.py

Checklist

  • Code compiles
python -m py_compile $(git ls-files '*.py')
  • Documentation updated (README.md, API usage section)
  • No new lint warnings (ruff)
  • CI pipeline passes (gitlab-ci.yml)

🔗 Related Issues

  • Closes #5 (closed)“Add multilingual OCR language selection”

🚀 Status

Ready for review & merge into feat/multi-format-pipeline-phase1.

Edited by Lakshy Yarlagadda

Merge request reports

Loading