Back to Bear

Disambiguate ambiguous compiler names by version probe

requirements/recognition-ambiguous-name-probe.md

4.1.36.4 KB
Original Source

Intent

When a build invokes a compiler under an ambiguous name (notably cc and c++), Bear must dispatch to the correct interpreter (GCC vs Clang) regardless of which toolchain the system actually installs under that name. On Linux cc is typically GCC; on FreeBSD, OpenBSD, NetBSD, DragonFly, and macOS cc is Clang. Misidentifying the compiler causes flag-arity mistakes (e.g., Clang's -Xclang <arg> consumes the next argv slot, GCC's flag table does not), which corrupts source/output detection in the compilation database.

The user expects bear -- cc -c hello.c to produce a correct entry on every host, without per-platform configuration, and without losing the ability to override Bear's guess when needed.

Acceptance criteria

  • For executables whose basename matches a known-ambiguous name (cc, c++), Bear runs the executable once with --version to classify it as GCC or Clang before dispatching to an interpreter.
  • The probe runs at most once per distinct canonical executable path for the lifetime of the process. Subsequent invocations of the same path reuse the cached result.
  • The probe is the sole classifier for ambiguous names. The compiler recognition regex (gcc.yaml and friends) deliberately does not list cc/c++, so when the probe declines (timeout, unrecognizable output, failed spawn, non-zero exit) recognition returns NotRecognized rather than guessing. A missing entry is visible and debuggable; a wrongly-classified entry corrupts the compilation database silently via mismatched flag-arity tables, which is the bug this requirement exists to prevent.
  • Probes never deadlock. Stdin is closed; the call returns within a bounded time budget even for hung children.
  • Probes do not re-enter Bear's own interception. LD_PRELOAD and DYLD_INSERT_LIBRARIES are stripped from the probe's environment.
  • A user compilers: config entry for a path takes priority over the probe and disables it for that path. This is the supported override mechanism and the only way to recover recognition for a quirky cc whose --version output does not match the probe's signature rules.
  • Wrapper basenames (ccache, distcc, sccache) are never probed even if they appear under an ambiguous name. The wrapper interpreter handles them as today.
  • Names that are not in the ambiguous set (e.g. gcc, clang, gfortran, cross-prefixed or versioned variants) are not probed and continue to resolve via regex.
  • On non-Unix targets (Windows) the probe is not available and the recognizer wires up a no-op probe. Windows toolchains use unambiguous basenames (cl.exe, clang-cl, gcc.exe) that the regex layer classifies directly. Bare cc/c++ on Windows falls through to NotRecognized; in practice no Windows toolchain installs them.

Non-functional constraints

  • The probe must not be invoked from the per-execution hot path more than once per unique canonical path. The intended cost on a clean build is one or two fork+exec pairs total, not per invocation.
  • Probe timeout: short (single-digit seconds), bounded.
  • The classification rule is conservative: when in doubt, return no classification rather than guess wrong. A misclassification corrupts the database; a non-classification falls back to existing behavior.

Testing

Given a host where /usr/bin/cc is Clang:

When Bear recognizes an execution of cc -c hello.c, then it dispatches to the Clang interpreter, and a Clang-only flag with a follow-on argument (e.g. -Xclang -ast-dump) is parsed with correct arity.

Given a host where /usr/bin/cc is GCC:

When Bear recognizes an execution of cc -c hello.c, then it dispatches to the GCC interpreter.

Given any host:

When Bear recognizes the same cc path 1000 times in one run, then the executable is fork-exec'd at most once for probing.

Given a user config containing compilers: [{ path: /usr/bin/cc, as: gcc }]:

When Bear recognizes /usr/bin/cc, then the result is GCC and no probe is performed.

Given an executable that hangs on --version:

When Bear probes it, then the call returns within the configured timeout and the execution is reported as NotRecognized (no entry is written; there is no regex fallback for ambiguous names).

Given an executable named cc whose --version output contains no recognizable signature (e.g. a custom wrapper that prints a vendor banner):

When Bear recognizes it, then the probe declines, recognition returns None, and the execution is reported as NotRecognized. The user can recover the entry by adding the path to compilers: with an explicit as: field.

Given an executable that reads from stdin on --version (e.g. a misplaced bash in PATH named cc):

When Bear probes it, then the call does not block indefinitely (stdin is closed, so the read returns EOF and the process exits).

Given Bear is running with LD_PRELOAD set to its own interception library:

When Bear probes a compiler, then the probe's environment has LD_PRELOAD removed and the probe execution is not itself recorded as a build event.

Given an executable named /usr/lib/ccache/cc that resolves (after canonicalization) to the ccache wrapper:

When Bear recognizes /usr/lib/ccache/cc, then it does not probe the binary and the wrapper interpreter handles the invocation as today.

Notes

  • See default_cc.md for the design discussion that selected this approach over PR #695 (per-invocation probe) and over the per-OS defaults variant.
  • Override mechanism: by user request, the only way to disable the probe for a given path is to declare it in compilers:. There is no process-wide off switch; the override is per-path and explicit.
  • No regex fallback for ambiguous names: the original implementation layered the probe on top of the regex (probe-then-regex), so probe failure silently defaulted to GCC. That re-introduced the bug on the very platforms the probe was meant to fix. The current design makes the probe the sole classifier so the failure mode is loud (no entry) rather than wrong (entry with mis-parsed flags). gcc.yaml carries a comment explaining why cc/c++ are absent from its recognize list.
  • Ambiguous-names list is intentionally minimal (cc, c++). Cross- prefixed variants (aarch64-linux-gnu-cc) are not in the list because cross-toolchains are overwhelmingly GCC and the regex already handles them; if a real BSD cross-toolchain case appears, the list can grow.