content/en/docs/troubleshooting/dropping.md
Falco monitors each syscall based on deployed Falco rules. Additionally, Falco requires a few more syscalls to function properly, see Adaptive Syscalls Selection:
repair: true).Falco monitors syscalls by hooking into kernel tracepoints. To transfer events from the kernel to userspace, it uses buffers. For each CPU, Falco allocates separate buffers. If you're using the modern_ebpf driver, you can choose to have fewer, larger buffers shared among multiple CPUs (contention, according to kernel experts, should not be a problem). The buffer size is fixed but can be adjusted in the buf_size_preset config. Increasing the size helps, but keep in mind that the benefits may not increase proportionally. Also, remember that a larger buffer means more preallocated memory.
5 or 6 could be a valid option for large machines assuming you use the kmod or ebpf drivers.modern_ebpf driver try a modern_ebpf.buf_size_preset of 6 or 7, along with a modern_ebpf.cpus_for_each_buffer of 4 or 6. Feel free to experiment and adjust these values as needed.Lastly, while it may sound appealing to push all filtering into the kernel, it is not that straightforward. In the kernel, you are in the application context, and yes, you can slow down both the kernel and the application (for example, apps may then experience lower request rates). Checkout the Driver Kernel Testing Framework for more information. Additionally, in the kernel, you only have raw syscall arguments and can't easily correlate them with other events. All this being said, we are actively looking into ways to improve this and make the kernel logic smarter without sacrificing performance.
Falco's metrics config (see also Falco Metrics) enables you to measure Falco's kernel-side syscall drops and provides a range of useful metrics related to software functioning. Key settings include:
kernel_event_counters_enabled: truelibbpf_stats_enabled: true (for ebpf or modern_ebpf drivers, enable /proc/sys/kernel/bpf_stats_enabled)Here is an example metrics log snippet highlighting the fields crucial for this analysis. Pay close attention to falco.evts_rate_sec and scap.evts_rate_sec, as well as the monotonic drop counters categorizing syscalls into coarse-grained (non-comprehensive) categories. For more details, refer to the dedicated metrics section in the Falco Performance guide for a more detailed explanation.
{
"output_fields": {
"evt.source": "syscall",
"falco.host_num_cpus": 96, # Divide *rate_sec by CPUs
"falco.evts_rate_sec": 93345.1, # Taken between 2 metrics snapshots
"falco.num_evts": 44381403800,
"falco.num_evts_prev": 44045361392,
# scap kernel-side counters
"scap.evts_drop_rate_sec": 0.0, # Taken between 2 metrics snapshots
"scap.evts_rate_sec": 93546.8, # Taken between 2 metrics snapshots
"scap.n_drops": 0, # Monotonic counter all-time kernel side drops
# Coarse-grained (non-comprehensive) categories for more granular insights
"scap.n_drops_buffer_clone_fork_exit": 0,
"scap.n_drops_buffer_close_exit": 0,
"scap.n_drops_buffer_connect_enter": 0,
"scap.n_drops_buffer_connect_exit": 0,
"scap.n_drops_buffer_dir_file_exit": 0,
"scap.n_drops_buffer_execve_exit": 0,
"scap.n_drops_buffer_open_enter": 0,
"scap.n_drops_buffer_open_exit": 0,
"scap.n_drops_buffer_other_interest_exit": 0,
"scap.n_drops_buffer_proc_exit": 0,
"scap.n_drops_buffer_total": 0,
"scap.n_drops_bug": 0,
"scap.n_drops_page_faults": 0,
"scap.n_drops_perc": 0.0, # Taken between 2 metrics snapshots
"scap.n_drops_prev": 0,
"scap.n_drops_scratch_map": 0,
"scap.n_evts": 48528636923,
"scap.n_evts_prev": 48191868502,
# libbpf stats -> all-time kernel tracepoints invocations stats for a x86_64 machine
"scap.sched_process_e.avg_time_ns": 2041, # scheduler process exit tracepoint
"scap.sched_process_e.run_cnt": 151463770,
"scap.sched_process_e.run_time_ns": 181866667867268,
"scap.sys_enter.avg_time_ns": 194, # syscall enter (raw) tracepoint
"scap.sys_enter.run_cnt": 933995602769,
"scap.sys_enter.run_time_ns": 181866667867268,
"scap.sys_exit.avg_time_ns": 205, # syscall exit (raw) tracepoint
"scap.sys_exit.run_cnt": 934000454069,
"scap.sys_exit.run_time_ns": 192201218598457
},
"rule": "Falco internal: metrics snapshot"
}
Since Falco 0.35.0, you have precise control over the syscalls Falco monitors. Refer to the Adaptive Syscalls Selection blog post and carefully read the base_syscalls config description for detailed information.
{{% pageinfo color=info %}} Falco's current metrics system lacks direct syscalls counters to pinpoint high-volume culprits. In the meantime, deriving insights step by step is necessary until syscall counters become available in Falco's metrics system. {{% /pageinfo %}}
Generate a dummy rule designed not to trigger any alerts:
- macro: spawned_process
condition: (evt.type in (execve, execveat))
- rule: TEST Simple Spawned Process
desc: "Test base_syscalls config option"
enabled: true
condition: spawned_process and proc.name=iShouldNeverAlert
output: "%evt.type"
priority: WARNING
Now, run Falco with the dummy rule and the specified test cases (edit base_syscalls config). If you're open to it, consider sharing anonymized logs for further assessment by Falco maintainers or the community to explore potential solutions.
For each test, run Falco in dry-run debug mode initially to print the final set of syscalls.
sudo /usr/bin/falco -c /etc/falco/falco.yaml -r falco_rules_dummy.yaml -o "log_level=debug" -o "log_stderr=true" --dry-run
# Example Output for Test 2
XXX: (2) syscalls in rules: execve, execveat
XXX: +(16) syscalls (Falco's state engine set of syscalls): capset, chdir, chroot, clone, clone3, fchdir, fork, prctl, procexit, setgid, setpgid, setresgid, setresuid, setsid, setuid, vfork
XXX: (18) syscalls selected in total (final set): capset, chdir, chroot, clone, clone3, execve, execveat, fchdir, fork, prctl, procexit, setgid, setpgid, setresgid, setresuid, setsid, setuid, vfork
Subsequently, run Falco normally.
sudo /usr/bin/falco -c /etc/falco/falco.yaml -r falco_rules_dummy.yaml
Test 1: spawned_process only
base_syscalls:
custom_set: [clone, clone3, fork, vfork, execve, execveat, procexit]
repair: false
{{% pageinfo color=info %}}
If Test 1 already fails, and you see drops even after adjusting the buf_size_preset and other parameters, Falco may be less usable on this particular system, unfortunately.
{{% /pageinfo %}}
Test 2: spawned_process + minimum syscalls needed for Falco state (internal process cache table)
base_syscalls:
custom_set: []
repair: true
Test 3: network accept*
base_syscalls:
custom_set: [clone, clone3, fork, vfork, execve, execveat, getsockopt, socket, bind, accept, accept4, close]
repair: false
Test 4: network connect
base_syscalls:
custom_set: [clone, clone3, fork, vfork, execve, execveat, getsockopt, socket, connect, close]
repair: false
Test 5: open* syscalls
base_syscalls:
custom_set: [clone, clone3, fork, vfork, execve, execveat, open, openat, openat2, close]
repair: false
Test n
Continue custom testing to ensure effective monitoring of all desired syscalls on your servers without experiencing event drops or with minimal acceptable drops.
This question presents a challenge as it's not solely about the pure "kernel event rate". In less realistic benchmarking tests, you could artificially drive the rates very high without dropping events. Therefore, we believe it is more complex in real-life production, involving not just event rates but also the actual nature of the events, and possibly bursts of events in very short periods of time.
Additionally, we believe it's best to normalize the event rates by the number of CPUs (e.g. scap.evts_rate_sec / falco.host_num_cpus). Busier servers with 96, 128, or more CPUs naturally have higher event rates than VMs with 12 CPUs, for instance.
Nevertheless, here are some numbers we have heard from various adopters. Please take them with a grain of salt: