Vulnerability deduplication process - Gitlabhq

Tier: Free, Premium, Ultimate
Offering: GitLab.com, GitLab Self-Managed, GitLab Dedicated

When a pipeline contains jobs that produce multiple security reports of the same type, it is possible that the same vulnerability is present in multiple reports. This duplication is common when different scanners are used to increase coverage, but can also exist in a single report. Vulnerability deduplication automatically consolidates duplicate vulnerabilities across scans, helping you focus on unique vulnerabilities while maintaining full scanning coverage.

The logic for deduplicating vulnerabilities varies depending on the scan type:

SAST vulnerabilities are deduplicated using the scope-offset algorithm.
Secret detection vulnerabilities are deduplicated per value and file.
All other vulnerabilities are considered a duplicate of another vulnerability when their scan type, location, and identifiers are the same.

The scan type must match because each can have its own definition for the location of a vulnerability. For example, static analyzers are able to locate a file path and line number, whereas a container scanning analyzer uses the image name instead.

When comparing identifiers, GitLab does not compare CWE and WASC during deduplication because they are "type identifiers" and are used to classify groups of vulnerabilities. Including these identifiers would result in many vulnerabilities being incorrectly considered duplicates. Two vulnerabilities are considered unique if none of their identifiers match.

In a set of duplicated vulnerabilities, the first occurrence of a vulnerability is kept and the remaining are skipped. Security reports are processed in alphabetical file path order, and vulnerabilities are processed sequentially in the order they appear in a report.

For scan types that can run more than one GitLab analyzer, such as SAST, scanner priority also determines which vulnerability is kept. When two GitLab analyzers detect the same vulnerability, the result from the higher-priority analyzer is kept. This is how GitLab maintains continuity when one analyzer replaces another. For example, results from the Gemnasium analyzer take priority over results from a deprecated analyzer that covered the same dependencies.

How deduplication is applied

GitLab deduplicates vulnerabilities at more than one stage:

Within a single report. GitLab analyzers remove duplicate vulnerabilities from their own report before the report is sent to GitLab. Reports from third-party scanners do not include this step.
Across reports of the same scan type in a pipeline. When a pipeline produces multiple reports of the same scan type, GitLab deduplicates vulnerabilities across those reports.
Across pipeline runs. GitLab tracks each vulnerability across pipeline runs so that a vulnerability detected again in a later run is recognized as the same vulnerability rather than a new one.

GitLab never deduplicates vulnerabilities across different scan types. For more information, see scan type.

Primary identifier stability

Deduplication and cross-pipeline tracking depend on the primary identifier remaining stable across scans. The primary identifier is the first entry in a vulnerability's identifiers.

If a scanner reports a different primary identifier for the same vulnerability in a later run, GitLab cannot match the new result to the existing one. The existing vulnerability is marked as no longer detected, and the new result is recorded as a separate vulnerability.

Third-party scanners should use a stable rule key as the primary identifier, such as a SonarQube rule key. A primary identifier that changes between runs, such as a value generated for each scan, breaks tracking and produces duplicate vulnerabilities.

Location definitions by scan type

The location used for deduplication is dependent on the scan type.

Container scanning

Location is usually defined only by the Docker image name, not the image tag.
However, the image tag is considered part of the location if the image tag matches semantic versioning (semver) syntax and doesn't look like a Git commit hash. For example:
The following locations are treated as duplicates:
- registry.gitlab.com/group-name/project-name/image1:12345019:libcrypto3
- registry.gitlab.com/group-name/project-name/image1:libcrypto3
The following locations are treated as unique:
- registry.gitlab.com/group-name/project-name/image1:v19202021:libcrypto3
- registry.gitlab.com/group-name/project-name/image1:libcrypto3

Dynamic application security testing (DAST)

Location is defined by the URL path, HTTP method, and HTTP parameters.
Two vulnerabilities are considered duplicates if they occur at the same URL endpoint with the same HTTP method.

Dependency scanning

Location is defined by the package name and version.
Two vulnerabilities are considered duplicates if they affect the same package version.

Scope-offset signatures

When security scanners analyze your code, they sometimes report the same vulnerability multiple times, especially when code is refactored or moved around. Advanced vulnerability tracking uses a smart deduplication system to recognize when these are actually the same issue, not new ones.

Imagine you have a security issue in a function. If a developer refactors the code and moves that function to a different line, the scanner might report it as a new vulnerability. Without deduplication, you'd see duplicate alerts for the same problem, making it harder to track what you actually need to fix.

When using scope-offset signatures, GitLab creates a unique "fingerprint" for each vulnerability using the following information:

Filename: The file that contains the vulnerability.
Scope: The code context where the vulnerability lives (like a function name or class name).
Offset: The position relative to that scope.

This combination creates a signature that stays the same even when code moves around, as long as it stays within the same scope.

Scope-offset tracking applies to GitLab SAST analyzers, which include the required tracking data in their reports. This tracking data is specific to the GitLab report format, so third-party SARIF uploads cannot provide it. Third-party SARIF uploads instead use a location fingerprint based on the file path and line numbers. With this fingerprint, a vulnerability can be recorded as a new finding when unrelated code changes shift its line numbers.

Example

Say you have this Ruby code:

ruby

class OuterClass
  class InnerClassA
    def function_A(x)
      puts "calling call1"
      call1(x)        # ← Vulnerability found here on line 5
    end
    call2("calling call 2")
  end
end

The scanner finds a vulnerability on line 5. GitLab needs to figure out whether the vulnerability is in OuterClass, InnerClassA, or function_A? The scanner calculates which scope is the best fit by measuring the distance from the vulnerability to the beginning and to the end of each scope:

OuterClass (lines 1-9): Distance = (5-1) + (9-5) = 8
InnerClassA (lines 2-8): Distance = (5-2) + (8-5) = 6
function_A (lines 3-6): Distance = (5-3) + (6-5) = 3

The smallest distance wins, so GitLab identifies function_A as the scope.

GitLab creates a signature like lib/outer_class.rb|OuterClass[0]|InnerClassA[0]|function_A[0]:2 to identify the location of the vulnerability. If the function or class that contains the vulnerability is moved to a different location within its parent scope, the vulnerability will not be reintroduced. However, if OuterClass is renamed the scope is different and a new vulnerability is created.

Deduplication examples

Here are some examples of how vulnerability deduplication behaves.

Matching identifiers and location, mismatching scan type

First vulnerability:
- Scan type: dependency_scanning
- Location fingerprint: adc83b19e793491b1c6ea0fd8b46cd9f32e592fc
- Identifiers: CVE-2022-25510
Second vulnerability:
- Scan type: container_scanning
- Location fingerprint: adc83b19e793491b1c6ea0fd8b46cd9f32e592fc
- Identifiers: CVE-2022-25510
Deduplication result: no deduplication is performed because the scan type is different.

Matching location and scan type, mismatching type identifiers

First vulnerability:
- Scan type: sast
- Location fingerprint: adc83b19e793491b1c6ea0fd8b46cd9f32e592fc
- Identifiers: CWE-259
Second vulnerability:
- Scan type: sast
- Location fingerprint: adc83b19e793491b1c6ea0fd8b46cd9f32e592fc
- Identifiers: CWE-798
Deduplication result: no deduplication is performed because CWE identifiers are ignored.

Matching scan type, location and an identifier

First vulnerability:
- Scan type: container_scanning
- Location fingerprint: adc83b19e793491b1c6ea0fd8b46cd9f32e592fc
- Identifiers: CVE-2019-12345, CVE-2022-25510, CWE-259
Second vulnerability:
- Scan type: container_scanning
- Location fingerprint: adc83b19e793491b1c6ea0fd8b46cd9f32e592fc
- Identifiers: CVE-2022-25510, CWE-798
Deduplication result: the vulnerabilities are deduplicated because both vulnerabilities have the same scan type, location fingerprint, and are identified as CVE-2022-25510.