Skip to content

fix: Async Infrastructure Refactor & Reliability Stabilization

Bikkumalla Sai Krishna requested to merge issue_75 into dev

🚀 MR Description: Async Infrastructure Refactor & Reliability Stabilization

fixes #65 (closed)

The Improvement: Why this change?

The previous architecture relied on multi-threaded parallelization within a blocking synchronous framework. While functional for single-user lookups, it suffered from three critical flaws under load (e.g., Team Leaderboard scans):

Exponential Thread Explosion: A single team scan of 10 users would spawn 10 user threads, each spawning another 10+ commit threads. This "Nested Parallelism" (100+ concurrent threads) caused extreme context-switching overhead and triggered GitLab's connection pool limits, leading to frequent Connection Reset errors.

Zombie Thread Accumulation: Streamlit's "re-run on every interaction" model often orphaned ad-hoc ThreadPoolExecutors stored in session state. These orphaned threads continued to consume memory and network sockets in the background (Zombie Threads), eventually destabilizing the entire application.

Network Starvation: Blocking I/O meant that threads were sitting idle waiting for network responses, wasting OS resources and making the application feel sluggish during "batch" operations.

🛠️ Key Technical Implementations

  1. Async-Native Infrastructure Layer
    We have implemented true asynchronous versions for all core GitLab API operations (_async functions in commits.py, projects.py, issues.py, etc.).

Benefit: Replaces expensive OS-level threads with lightweight Python coroutines.
Impact: 100+ concurrent network requests can now be managed by a single event loop thread, drastically reducing CPU and memory overhead.

  1. Streamlit Resource Lifecycle Optimization (ui/main.py)
    Infrastructure initialization has been migrated to st.cache_resource with a mandatory cleanup callback.

Callback: cleanup_gitlab_client explicitly shuts down the background thread and connection pool on resource release.

Benefit: Ensures exactly one persistent client instance per session, completely eliminating the "Zombie Thread" leak.

  1. Structured Batch Concurrency (batch.py)
    Refactored the batch processing logic to use asyncio.gather for flattened, controlled concurrency.

Fail-Safe Processing: Implemented _safe_process which captures individual worker exceptions and returns a Crash status rather than failing the entire batch.

Benefit: The Team Leaderboard is now resilient to flaky API responses or individual user data inconsistencies.

  1. 100% Backward Compatibility
    All existing synchronous public APIs (e.g., get_user_commits) have been preserved to ensure existing UI components and tests continue to work. They now serve as high-level "sync bridges" to the underlying async core.

🚦 Verification & Testing

Test Suite Pass Rate: 64/64 tests passed.

Async Verification: Updated test_batch.py and test_async_batch.py to specifically validate the new _async codepaths.

Infrastructure Standardization: Verified that all commit stats (morning_commits, afternoon_commits) and project count mappings are consistent between the new and old models.

📈 Performance Impact

Users should observe a significant improvement in responsiveness when scanning large teams. The "Team Leaderboard" now completes scans with consolidated network usage, significantly reducing the risk of GitLab-side rate limiting or local environment instability.

Edited by Bikkumalla Sai Krishna

Merge request reports

Loading