requirements/output-duplicate-detection.md
Build systems may invoke the same compiler command multiple times for the
same source file (e.g. parallel make retries, ccache wrappers, or repeated
builds with --append). The compilation database specification
(https://clang.llvm.org/docs/JSONCompilationDatabase.html) allows multiple
entries for the same file but notes this is for "different configurations."
Bear filters out true duplicates to keep the output clean and reduce
downstream tool confusion.
output-append),
the original entry from the existing database takes priority over a new
entry with identical fieldsdirectory,
file, arguments)duplicates
section in the configuration filecommand and arguments in the same list (they are alternative
representations of the same data)Bear uses a hash-based approach for duplicate detection. For each entry, a
hash is computed over the configured fields using Rust's DefaultHasher
(SipHash). The hash is checked against a set of previously seen hashes. If
new, the entry is accepted; if already seen, it is rejected.
The hash set grows with the number of unique entries (O(n) memory), but entries are processed one at a time without buffering the full stream.
Hash collisions are theoretically possible. A collision would silently drop a non-duplicate entry (false positive). With a 64-bit hash this is extremely unlikely for typical compilation databases (thousands of entries) but the probability grows with database size. This is an accepted trade-off for simplicity and performance.
Duplicate detection operates on entries after path formatting (output-path-format).
This means the configured path format affects which entries are considered
duplicates. For example, two entries with different relative paths that
resolve to the same absolute path would only be detected as duplicates if
absolute or canonical path formatting is active.
The duplicate filter runs after the source filter and before final
serialization. It processes the combined stream of existing and new entries
when append mode (output-append) is active.
The following fields from the compilation database entry can be used for
duplicate matching (see output-json-compilation-database for field definitions):
| Field | Config name | Description |
|---|---|---|
directory | directory | Working directory of the compilation |
file | file | Source file path |
arguments | arguments | Argument array (mutually exclusive with command) |
command | command | Command string (mutually exclusive with arguments) |
output | output | Output file path |
duplicates:
match_on:
- directory
- file
- arguments
This means two entries are duplicates only if they have the same working directory, the same source file, and the same compiler arguments.
Given a build that compiles file.c twice with identical flags:
When Bear generates the compilation database, then only one entry for file.c appears in the output.
Given a build that compiles file.c with -O2 and then with -O3:
When Bear generates the compilation database with default duplicate config, then both entries appear (different arguments means not a duplicate).
Given files src/util.c and lib/util.c (same basename, different directories):
When Bear generates the compilation database, then both entries are preserved (different directory means not a duplicate).
Given duplicate detection configured with match_on: [file]:
When a build compiles file.c twice with different flags, then only the first entry is kept (matching on file alone).
Given duplicate detection configured with match_on: [file, output]:
When file.c is compiled to both
debug/file.oandrelease/file.o, then both entries are preserved (different output paths).
Given duplicate detection configured with match_on: [command, arguments]:
Then configuration validation rejects it with an error explaining the conflict.
Given duplicate detection configured with match_on: []:
Then configuration validation rejects it with an error explaining the empty field list.
Given an --append run where file.c exists in the old database, and the
new build also compiles file.c with the same flags:
When Bear generates the output, then only one entry for file.c appears (the original from the old database, because existing entries come first).
directory and file to prevent this.-cc1
frontend invocations. These are filtered by the semantic analyzer before
reaching the duplicate filter, but the duplicate filter provides a safety
net.--update concept where duplicates are
replaced rather than dropped. This is not currently implemented in the
Rust version but the configurable field matching provides a foundation
for it.