proposals/20251205-multi-thread-falco-design.md
This document outlines a high-level design for implementing multi-threading in Falco. The goal of this proposal is to overcome Falco's single-threaded architecture to improve scalability in scenarios where the amount of events produced cannot be processed in a single thread. This is achieved by leveraging multiple threads for event processing, rule evaluation, and output handling, enabling Falco to better utilize multi-core systems and reduce event drops under high event rates.
The success of this multi-threading initiative will be measured by the following key metrics:
These metrics will be evaluated through benchmarking and real-world deployment scenarios to validate that the multi-threaded architecture achieves its scalability goals without compromising correctness or introducing significant overhead.
O(n_cpus) scan on every next() call, it peeks at the head event from each ring buffer, finds the event with the minimum timestamp across all buffers and returns that event to Falco for processing. The consumer position is only advanced after the event has been consumed (on the next call), ensuring the caller can safely read the event data and avoiding the need to perform copies of the event data.BPF_MAP_TYPE_PERF_EVENT_ARRAY used by the legacy eBPF probe.libsinsp state (e.g., the thread state) is maintained in a shared data structure, allowing all workers to access data pushed by other workers. This is crucial for handling events like clone() that rely on data written by other partitions. This requires designing lightweight synchronization mechanisms to ensure efficient access to shared state without introducing significant contention. A dedicated proposal document will address the design of the shared state and synchronization mechanisms, and data consistency.falco_outputs class implements a thread-safe, queue-based architecture using Intel TBB's concurrent_bounded_queue, which is specifically designed for multi-producer, single-consumer scenarios. Multiple worker threads can concurrently call handle_event() to enqueue alert messages using the thread-safe try_push() operation. A dedicated output worker thread consumes messages from the queue using pop() and sends them to all configured outputs (stdout, file, syslog, gRPC, HTTP, etc.). This design is already proven in production, as Falco's multi-source support (where different event sources run in separate threads) already uses this same queue concurrently. The existing implementation requires no changes to support multi-threaded event processing. Note that while outputs are processed in order within the queue, alerts from different worker threads may be interleaved, meaning strict temporal ordering of alerts across different processes is not guaranteed. This is acceptable for security monitoring use cases where the primary concern is detecting and reporting security events rather than maintaining precise event ordering.A crucial and challenging design aspect is partitioning the work to achieve a good trade-off among the following properties:
The first two properties are primarily focused on performance, while the third is essential for the correctness of the solution. These aspects are intrinsically linked.
Based on the analysis below, Static Partitioning by TGID is the proposed approach for the initial implementation.
Events are routed based on the TGID in kernel space (within the eBPF program) to a ring buffer dedicated to a specific partition. The routing logic executes at the point where events are captured, before they are written to any ring buffer. This partition is then consumed by a dedicated worker thread in userspace. The routing in the eBPF program can be accomplished with a simple hash and modulo operation, depending on the desired number of worker threads:
ring_buffer_index = hash(event->tgid) % num_workers
The hash function and number of workers are configured at eBPF program initialization time, allowing the kernel to route events directly to the appropriate ring buffer without userspace intervention.
Pros:
Cons:
Load Imbalance / "Hot" Process Vulnerability: This static partitioning is susceptible to uneven worker load distribution, as a small number of high-activity ("hot") processes can overload the specific worker thread assigned to their TGID, creating a bottleneck.
Cross-Partition Temporal Inconsistency: Events that require information from a parent thread (e.g., fork/clone events) can still lead to causality issues. If the parent's related event is handled by a different, lagging partition, the required context might be incomplete or arrive out of order. Note that load imbalance amplifies this issue. Missing thread information is easy to detect, but there are also cases where information is present but not up-to-date or ahead of the time the clone event happened.
Ancestor information during rule evaluation: When evaluating rules that require ancestor information, the worker thread may need to access thread data from other partitions. Falco rules commonly check ancestor process attributes using fields that traverse the process hierarchy. Based on actual usage in Falco rules, commonly used ancestor fields include:
proc.aname / proc.aname[N] - ancestor process name (where N is the generation level: 1=parent, 2=grandparent, 3=great-grandparent, etc., up to at least level 7)proc.aexepath[N] - ancestor executable path (e.g., proc.aexepath[2] for grandparent)proc.aexe[N] - ancestor executable (e.g., proc.aexe[2] for grandparent)Accessing stale or "ahead" ancestor data (where the ancestor's state may be out of date or from events processed by other partitions with different timestamps) could lead to false positives or false negatives in rule evaluation. We acknowledge this potential issue and plan to assess its impact and determine appropriate mitigations once we have a running prototype.
Mitigations:
Last-Resort Fetching: Fetching the thread information from a different channel to resolve the drift (e.g., proc scan, eBPF iterator). This solution is considered as a last resort because it risks slowing down the event processing loop, potentially negating the performance benefits of multi-threading.
Context Synchronization: Wait for the required thread information to become available. This can be decomposed into two orthogonal concerns:
How to handle the wait:
How to detect data readiness:
These combine into four possible approaches:
| Polling | Signaling | |
|---|---|---|
| Wait/Sleep | Spin-check until ready | Sleep on condition variable, wake on signal |
| Deferring | Periodically retry deferred events | Process deferred events when signaled |
Synchronization point: A natural synchronization point is the clone exit parent event. At this point, the parent process has completed setting up the child's initial state (inherited file descriptors, environment, etc.), making it safe to start processing events for the newly created thread group.
Special case — vfork() / CLONE_VFORK: When vfork() is used, the parent thread is blocked until the child calls exec() or exits, delaying the clone exit parent event. An alternative synchronization point may be needed (e.g., adding back clone enter parent).
Similar to the previous approach, but events are routed by TID instead of TGID.
Pros:
Cons:
This approach routes events based on the CPU core where the event was captured. Each CPU core has its own ring buffer (per-CPU buffers), and multiple CPU buffers are assigned to the same partition. Each partition is consumed by a dedicated worker thread that reads from all the per-CPU buffers assigned to it. The number of partitions does not necessarily match the number of CPU cores—a single partition can read from multiple per-CPU buffers, allowing flexibility in choosing the number of worker threads independently from the number of CPU cores. This leverages the existing per-CPU ring buffer infrastructure used by the kernel module (kmod) and legacy eBPF probe, where events are written to per-CPU buffers that are then grouped into partitions consumed by worker threads.
Pros:
partition = cpu_id % num_workers), and each worker thread reads from all per-CPU buffers assigned to its partition.Cons:
O(n_cpus) scan to maintain event ordering.BPF_MAP_TYPE_RINGBUF, which does not have a per-CPU design. This approach would only be viable with the kernel module (kmod) or legacy eBPF probe that use BPF_MAP_TYPE_PERF_EVENT_ARRAY with per-CPU buffers.Instead of partitioning the data, this approach partitions the work by splitting processing into phases:
Pros:
Cons:
next() does not consume the event. We would also need some flow control (e.g., backpressure) to avoid processing too many events in parallel. This problem would arise only if the rules evaluation phase is slower than the parsing phase.| Approach | Load Balancing | Contention | Temporal Consistency |
|---|---|---|---|
| TGID | Moderate (hot process risk) | Low | Good (within process) |
| TID | Good | Higher | Partial (thread-level only) |
| CPU Core | Good | Low | Poor (process migration issues) |
| Pipelining | Good (rules evaluation phase) | Low (writes) | Requires MVCC |
TGID partitioning was chosen because it offers the best balance between synchronization complexity and correctness guarantees. TID partitioning increases cross-partition access for thread group leader data (e.g., file descriptor table, working directory, environment variables), increasing the coordination cost. Per-CPU partitioning, while leveraging existing infrastructure, suffers from process migration issues that can cause significant temporal inconsistencies when processes move between CPUs. Functional partitioning, while elegant in its separation of concerns, introduces a single-threaded bottleneck in the parsing phase that limits scalability regardless of available cores, and requires complex MVCC mechanisms for data consistency and mechanisms for handling multiple events in parallel.