bug : Incorrect Commit Attribution Due to Overly Aggressive Fuzzy Identity Matching
Root Cause Analysis: Commit Mixing
This report details why commits from different users are being incorrectly aggregated into a single user's report in the GitLab Compliance Checker.
1. Technical Root Cause: Aggressive Fuzzy Matching
The primary issue lies in the Identity Matching logic located in services/batch/client.py. To handle cases where users commit with different names/emails, the system implements a "fuzzy" matching strategy that is currently too broad.
The "Prefix/Suffix" Bug
The code specifically uses startswith() and endswith() on normalized strings.
# From services/batch/client.py (Lines 325-338)
elif (
ns_cname
and ns_uname
and len(ns_uname) >= 3
and (ns_cname.startswith(ns_uname) or ns_cname.endswith(ns_uname)) # <-- BROKEN LOGIC
):
is_match = True
Scenario:
-
Your Username:
sai -
Other User:
saikrishna -
The Result: Because
saikrishnastarts withsai, the system marks all of Saikrishna's commits as belonging to you.
Loose Email Local-Part Matching
The system also tries to match the part of the email before the @ symbol to your username.
# From services/batch/client.py (Lines 305-312)
elif target_username and c_email_local == target_username:
is_match = True
If your username is common (e.g., dev or admin) and another user has an email like [email protected], their commits will be pulled into your report.
2. Infrastructure Issue: Lack of Server-Side Filtering
Currently, the application fetches every single commit in a project and brings them to your local machine before filtering them.
# From services/batch/client.py (Line 281)
commits = await self.client._async_get_paginated(
f"/projects/{p_id}/repository/commits",
params={"all": True}, # <-- No author filter used here
...
)
Because the API isn't told which specific author to look for, the application "sees" millions of commits from hundreds of developers, increasing the mathematical probability that the fuzzy matching logic will trigger a false positive.
3. Recommended Fixes
To resolve this, we will move to a Strict Identity Model:
- Remove Prefix/Suffix Matching: Only allow exact matches for usernames and normalized names.
-
Server-Side Pre-Filtering: Add
author=YOUR_EMAILorauthor=YOUR_USERNAMEto the API request to let GitLab perform the first pass of filtering. -
Prioritize GitLab Internal IDs: If the commit
authorobject contains a GitLabidorusername, we should use that as the primary source of truth, as it is already linked by GitLab's own internal logic.
[!NOTE] I have prepared an implementation plan to apply these fixes immediately upon your approval.