Back to Bear

Duplicate entry detection and filtering

docs/requirements/output-duplicate-detection.md

4.1.55.3 KB
Original Source

Intent

Build systems may invoke the compiler for the same source file more than once - parallel make retries, ccache wrappers, or repeated builds with --append after a file's flags change. The compilation database specification (https://clang.llvm.org/docs/JSONCompilationDatabase.html) allows multiple entries for the same file, noting this is for "different configurations", but stale duplicates confuse downstream tools and let old flags linger after a rebuild. By default Bear keeps a single entry per source file (identified by its directory and path), so recompiling a file with new flags updates its entry instead of accumulating a stale one. Users who genuinely need several configurations recorded for one file can widen the set of fields that distinguish entries.

Acceptance criteria

  • Duplicate entries are detected and only the first occurrence is kept
  • Duplicate detection is based on configurable fields (default: directory and file)
  • By default two entries for the same source file in the same directory are duplicates regardless of their compiler arguments, so only one survives; distinguishing entries by their flags requires adding the arguments field to the configured set
  • Two entries are considered duplicates when all configured fields match
  • Entries that differ in any configured field are preserved as distinct
  • The first-occurrence guarantee combines with append ordering (output-append): a newly generated entry is emitted before the matching entry from the existing database, so the new entry wins and its flags replace the old ones
  • Accepted entries appear in the output in the same order they were received
  • The set of fields used for matching is configurable via the duplicates section in the configuration file
  • Configuration validation rejects:
    • Empty field lists
    • Duplicate fields in the list
    • Both command and arguments in the same list (they are alternative representations of the same data)

Non-functional constraints

  • The first occurrence of each identical entry is kept and later duplicates are dropped
  • Entries are processed as a stream, one at a time, without buffering the whole input; every unique entry seen so far is remembered for the rest of the run, so a duplicate is detected no matter how far apart the two occurrences are

Testing

Given a build that compiles file.c twice with identical flags:

When Bear generates the compilation database, then only one entry for file.c appears in the output.

Given a build that compiles file.c with -O2 and then with -O3:

When Bear generates the compilation database with default duplicate config, then only one entry for file.c appears (default matching is directory and file, so arguments are ignored).

Given duplicate detection configured with match_on: [directory, file, arguments] and a build that compiles file.c with -O2 and then with -O3:

When Bear generates the compilation database, then both entries appear (arguments are part of the match, so they differ).

Given files src/util.c and lib/util.c (same basename, different directories):

When Bear generates the compilation database, then both entries are preserved (different directory means not a duplicate).

Given duplicate detection configured with match_on: [file]:

When a build compiles file.c twice with different flags, then only the first entry is kept (matching on file alone).

Given duplicate detection configured with match_on: [file, output]:

When file.c is compiled to both debug/file.o and release/file.o, then both entries are preserved (different output paths).

Given duplicate detection configured with match_on: [command, arguments]:

Then configuration validation rejects it with an error explaining the conflict.

Given duplicate detection configured with match_on: []:

Then configuration validation rejects it with an error explaining the empty field list.

Given an --append run where file.c exists in the old database, and the new build compiles file.c with different flags:

When Bear generates the output, then only one entry for file.c appears, recording the new flags (the new entry wins, because new entries come first and default matching ignores arguments).

Notes

  • GitHub issue #667 reported that files with identical basenames in separate directories were incorrectly dropped. This was caused by matching on filename alone without considering the directory. The default config still includes directory alongside file, so same-basename files in different directories remain distinct.
  • GitHub issue #638 reported duplicate entries from clang's internal -cc1 frontend invocations. These are filtered by the semantic analyzer before reaching the duplicate filter, but the duplicate filter provides a safety net.
  • GitHub PR #497 introduced an --update concept where existing entries are replaced when a file is recompiled with new flags. Bear now delivers this as the default rather than a separate flag: dropping arguments from the default match set collapses a file to one entry, and append ordering (output-append) makes the newest entry win. GitHub discussion #712 requested this for partial builds where changed flags previously left stale duplicates.

Rationale