fix: Async Infrastructure Refactor & Reliability Stabilization
🚀 MR Description: Async Infrastructure Refactor & Reliability Stabilization
fixes #65 (closed)
The Improvement: Why this change?
The previous architecture relied on multi-threaded parallelization within a blocking synchronous framework. While functional for single-user lookups, it suffered from three critical flaws under load (e.g., Team Leaderboard scans):
Exponential Thread Explosion: A single team scan of 10 users would spawn 10 user threads, each spawning another 10+ commit threads. This "Nested Parallelism" (100+ concurrent threads) caused extreme context-switching overhead and triggered GitLab's connection pool limits, leading to frequent Connection Reset errors.
Zombie Thread Accumulation: Streamlit's "re-run on every interaction" model often orphaned ad-hoc ThreadPoolExecutors stored in session state. These orphaned threads continued to consume memory and network sockets in the background (Zombie Threads), eventually destabilizing the entire application.
Network Starvation: Blocking I/O meant that threads were sitting idle waiting for network responses, wasting OS resources and making the application feel sluggish during "batch" operations.
🛠 ️ Key Technical Implementations
- Async-Native Infrastructure Layer
We have implemented true asynchronous versions for all core GitLab API operations (_async functions in commits.py, projects.py, issues.py, etc.).
Benefit: Replaces expensive OS-level threads with lightweight Python coroutines.
Impact: 100+ concurrent network requests can now be managed by a single event loop thread, drastically reducing CPU and memory overhead.
- Streamlit Resource Lifecycle Optimization (ui/main.py)
Infrastructure initialization has been migrated to st.cache_resource with a mandatory cleanup callback.
Callback: cleanup_gitlab_client explicitly shuts down the background thread and connection pool on resource release.
Benefit: Ensures exactly one persistent client instance per session, completely eliminating the "Zombie Thread" leak.
- Structured Batch Concurrency (batch.py)
Refactored the batch processing logic to use asyncio.gather for flattened, controlled concurrency.
Fail-Safe Processing: Implemented _safe_process which captures individual worker exceptions and returns a Crash status rather than failing the entire batch.
Benefit: The Team Leaderboard is now resilient to flaky API responses or individual user data inconsistencies.
- 100% Backward Compatibility
All existing synchronous public APIs (e.g., get_user_commits) have been preserved to ensure existing UI components and tests continue to work. They now serve as high-level "sync bridges" to the underlying async core.
🚦 Verification & Testing
Test Suite Pass Rate: 64/64 tests passed.
Async Verification: Updated test_batch.py and test_async_batch.py to specifically validate the new _async codepaths.
Infrastructure Standardization: Verified that all commit stats (morning_commits, afternoon_commits) and project count mappings are consistent between the new and old models.
📈 Performance Impact
Users should observe a significant improvement in responsiveness when scanning large teams. The "Team Leaderboard" now completes scans with consolidated network usage, significantly reducing the risk of GitLab-side rate limiting or local environment instability.