feat: Add `backfill` command group to trigger server-side database backfills
Overview
This MR adds a new backfill top-level command group to the Corpus Client CLI, allowing operators to trigger existing server-side backfill scripts (media_duration.py, file_hashing.py, snr_frequency.py, backfill_version_0.py) directly from the CLI via subprocess execution.
The CLI remains lightweight and dependency-free — it does not pull in FastAPI, SQLModel, moviepy, librosa, or Celery. Instead, it wraps the existing backfill logic by invoking the scripts in corpus-server-app/backfill/ with uv run python.
What does this MR do and why?
Running backfill scripts previously required SSH-ing into the server, navigating to the server app directory, and manually executing Python scripts with the right environment. This MR centralizes that workflow into the CLI, making it:
-
Discoverable —
corpus-client backfill --helpshows all available operations - Consistent — unified argument interface across all backfill types
- Safe — no shell execution; controlled inputs; exit codes forwarded; dry-run supported
-
Flexible — auto-detects
corpus-server-applocation or accepts explicit override
Changes Made
New Files
| File | Purpose |
|---|---|
src/corpus_client_cli/backfill.py |
Full implementation of the backfill Typer app with 4 subcommands, server path resolution, argument building, and subprocess execution |
tests/test_backfill.py |
19 unit tests covering path resolution, validation, argument construction, defaults, exit code forwarding, env injection, and uv fallback |
Modified Files
| File | Change |
|---|---|
src/corpus_client_cli/cli.py |
Registers backfill_app as a top-level Typer subcommand |
CHANGELOG.md |
Documents the new command group and capabilities |
README.md |
Adds backfill to feature list, quick-start example, and dedicated command section with options table |
docs/command-reference.md |
Adds "Backfill Commands" section with full syntax, options, and examples for all 4 subcommands |
docs/user-manual.md |
Adds backfill quick-reference entries and a new "Backfill Operations" section |
docs/troubleshooting.md |
Adds "Backfill Issues" TOC entry and 5 new troubleshooting scenarios |
Technical Details
-
Server path resolution priority:
-
--server-pathCLI option -
CORPUS_SERVER_APP_PATHenvironment variable - Auto-detect by walking up from CLI install location looking for a sibling
corpus-server-app/
-
- Validation: Checks that the resolved path contains all 4 expected backfill scripts before execution
-
Environment:
DATABASE_URLis injected into the subprocess environment; the CLI option defaults to the$DATABASE_URLenv var -
Execution: Uses
subprocess.runwithshell=False; prefersuv run pythonbut falls back to plainpythonwith a warning - Output: stdout/stderr stream live; the CLI forwards the script's exit code
Type of Change
-
✨ New feature (non-breaking change that adds functionality) -
📝 Documentation update -
🧪 Test update
Related Issues / References
- Follows implementation plan:
plans/backfill-command.md
How to Set Up and Validate Locally
- Pull this branch
- Ensure you have a local
corpus-server-appsibling directory (or setCORPUS_SERVER_APP_PATH) - Install dependencies:
uv sync - Run a dry-run backfill:
uv run corpus-client backfill duration -d $DATABASE_URL --dry-run --limit 10 - Verify help output:
uv run corpus-client backfill --help uv run corpus-client backfill file-hash --help
Testing Done
-
Manual testing completed -
Unit tests added/updated -
pytest passes
Test Cases Covered:
| Scenario | Expected Result | Status |
|---|---|---|
| Auto-detect server path from sibling directory | Resolves correctly | |
Override via --server-path
|
Uses provided path | |
Override via CORPUS_SERVER_APP_PATH env var |
Uses env var path | |
| Missing server path | Exits with code 1 and helpful message | |
| Missing expected backfill scripts | Exits with code 1 and lists missing scripts | |
| Build common args with all flags set | Produces correct argument list | |
| Build common args with defaults only | Produces minimal argument list | |
duration command default options |
Uses both media type, 3 workers, 10 batch size |
|
file-hash command with algorithm override |
Passes --algorithm md5
|
|
snr command with media type filter |
Passes --media-type video
|
|
version-0 command with no specific options |
Only common args passed | |
| Subprocess execution forwards exit code 0 | CLI exits 0 | |
| Subprocess execution forwards exit code 2 | CLI exits 2 | |
DATABASE_URL injected into subprocess env |
Env var set correctly | |
uv available → uses uv run python
|
Command starts with uv
|
|
uv missing → falls back to python
|
Command starts with python and shows warning |
|
FileNotFoundError during execution |
Returns exit code 1 | |
| Unexpected exception during execution | Returns exit code 1 | |
ruff check |
Passes | |
mypy check |
Passes | |
bandit check |
Passes (with # nosec annotations on controlled subprocess calls) |
Code Quality Checklist
Code Standards
-
Code follows project conventions (naming, structure, formatting) -
No print() statements or debug code left in code -
No unused imports, variables, or functions -
No duplicate code -
Type hints are properly defined (no type: ignore unless justified) -
ruff and mypy checks pass
Python Best Practices
-
Exception handling is appropriate -
Context managers used where applicable
CLI Best Practices
-
Typer commands follow project patterns -
Help text is clear and complete -
Exit codes are appropriate -
Error messages are user-friendly
Error Handling
-
Errors are caught and handled gracefully -
User-friendly error messages displayed -
Appropriate exit codes for different error types
Documentation
-
README.md updated (command interface changed) -
docs/*.md updated (command-reference, user-manual, troubleshooting) -
CHANGELOG.md updated -
Docstrings added/updated for new functions
Known Limitations / Technical Debt
- Subprocess execution requires the
corpus-server-appsource code to be available on the host running the CLI. It cannot run against a deployed Docker container or remote server. - The backfill scripts' Python dependencies (moviepy, librosa, etc.) must be resolvable in the server app's environment; the CLI does not manage them.
- No progress streaming from the subprocess beyond raw stdout/stderr.
Additional Notes
- The subprocess usage is intentionally constrained: script names are hardcoded,
shell=Falseis used, and all user inputs are validated or typed through Typer.# nosecannotations suppress bandit warnings on the two controlled lines.