Skip to content

feat: Add `backfill` command group to trigger server-side database backfills

Ahlad Pataparla requested to merge feat/server-backfill into develop

Overview

This MR adds a new backfill top-level command group to the Corpus Client CLI, allowing operators to trigger existing server-side backfill scripts (media_duration.py, file_hashing.py, snr_frequency.py, backfill_version_0.py) directly from the CLI via subprocess execution.

The CLI remains lightweight and dependency-free — it does not pull in FastAPI, SQLModel, moviepy, librosa, or Celery. Instead, it wraps the existing backfill logic by invoking the scripts in corpus-server-app/backfill/ with uv run python.

What does this MR do and why?

Running backfill scripts previously required SSH-ing into the server, navigating to the server app directory, and manually executing Python scripts with the right environment. This MR centralizes that workflow into the CLI, making it:

  • Discoverablecorpus-client backfill --help shows all available operations
  • Consistent — unified argument interface across all backfill types
  • Safe — no shell execution; controlled inputs; exit codes forwarded; dry-run supported
  • Flexible — auto-detects corpus-server-app location or accepts explicit override

Changes Made

New Files

File Purpose
src/corpus_client_cli/backfill.py Full implementation of the backfill Typer app with 4 subcommands, server path resolution, argument building, and subprocess execution
tests/test_backfill.py 19 unit tests covering path resolution, validation, argument construction, defaults, exit code forwarding, env injection, and uv fallback

Modified Files

File Change
src/corpus_client_cli/cli.py Registers backfill_app as a top-level Typer subcommand
CHANGELOG.md Documents the new command group and capabilities
README.md Adds backfill to feature list, quick-start example, and dedicated command section with options table
docs/command-reference.md Adds "Backfill Commands" section with full syntax, options, and examples for all 4 subcommands
docs/user-manual.md Adds backfill quick-reference entries and a new "Backfill Operations" section
docs/troubleshooting.md Adds "Backfill Issues" TOC entry and 5 new troubleshooting scenarios

Technical Details

  • Server path resolution priority:
    1. --server-path CLI option
    2. CORPUS_SERVER_APP_PATH environment variable
    3. Auto-detect by walking up from CLI install location looking for a sibling corpus-server-app/
  • Validation: Checks that the resolved path contains all 4 expected backfill scripts before execution
  • Environment: DATABASE_URL is injected into the subprocess environment; the CLI option defaults to the $DATABASE_URL env var
  • Execution: Uses subprocess.run with shell=False; prefers uv run python but falls back to plain python with a warning
  • Output: stdout/stderr stream live; the CLI forwards the script's exit code

Type of Change

  • New feature (non-breaking change that adds functionality)
  • 📝 Documentation update
  • 🧪 Test update

Related Issues / References

  • Follows implementation plan: plans/backfill-command.md

How to Set Up and Validate Locally

  1. Pull this branch
  2. Ensure you have a local corpus-server-app sibling directory (or set CORPUS_SERVER_APP_PATH)
  3. Install dependencies:
    uv sync
  4. Run a dry-run backfill:
    uv run corpus-client backfill duration -d $DATABASE_URL --dry-run --limit 10
  5. Verify help output:
    uv run corpus-client backfill --help
    uv run corpus-client backfill file-hash --help

Testing Done

  • Manual testing completed
  • Unit tests added/updated
  • pytest passes

Test Cases Covered:

Scenario Expected Result Status
Auto-detect server path from sibling directory Resolves correctly
Override via --server-path Uses provided path
Override via CORPUS_SERVER_APP_PATH env var Uses env var path
Missing server path Exits with code 1 and helpful message
Missing expected backfill scripts Exits with code 1 and lists missing scripts
Build common args with all flags set Produces correct argument list
Build common args with defaults only Produces minimal argument list
duration command default options Uses both media type, 3 workers, 10 batch size
file-hash command with algorithm override Passes --algorithm md5
snr command with media type filter Passes --media-type video
version-0 command with no specific options Only common args passed
Subprocess execution forwards exit code 0 CLI exits 0
Subprocess execution forwards exit code 2 CLI exits 2
DATABASE_URL injected into subprocess env Env var set correctly
uv available → uses uv run python Command starts with uv
uv missing → falls back to python Command starts with python and shows warning
FileNotFoundError during execution Returns exit code 1
Unexpected exception during execution Returns exit code 1
ruff check Passes
mypy check Passes
bandit check Passes (with # nosec annotations on controlled subprocess calls)

Code Quality Checklist

Code Standards

  • Code follows project conventions (naming, structure, formatting)
  • No print() statements or debug code left in code
  • No unused imports, variables, or functions
  • No duplicate code
  • Type hints are properly defined (no type: ignore unless justified)
  • ruff and mypy checks pass

Python Best Practices

  • Exception handling is appropriate
  • Context managers used where applicable

CLI Best Practices

  • Typer commands follow project patterns
  • Help text is clear and complete
  • Exit codes are appropriate
  • Error messages are user-friendly

Error Handling

  • Errors are caught and handled gracefully
  • User-friendly error messages displayed
  • Appropriate exit codes for different error types

Documentation

  • README.md updated (command interface changed)
  • docs/*.md updated (command-reference, user-manual, troubleshooting)
  • CHANGELOG.md updated
  • Docstrings added/updated for new functions

Known Limitations / Technical Debt

  • Subprocess execution requires the corpus-server-app source code to be available on the host running the CLI. It cannot run against a deployed Docker container or remote server.
  • The backfill scripts' Python dependencies (moviepy, librosa, etc.) must be resolvable in the server app's environment; the CLI does not manage them.
  • No progress streaming from the subprocess beyond raw stdout/stderr.

Additional Notes

  • The subprocess usage is intentionally constrained: script names are hardcoded, shell=False is used, and all user inputs are validated or typed through Typer. # nosec annotations suppress bandit warnings on the two controlled lines.
Edited by Ahlad Pataparla

Merge request reports

Loading