src/discover/README.md
Full rewrite pipeline diagram: docs/contributing/TECHNICAL.md
This module has two jobs:
Rewrite commands — Every LLM agent hook calls rtk rewrite "git status". This module decides whether to rewrite it (rtk git status) or pass it through unchanged. This is the hot path — every command the LLM runs goes through here.
Analyze history — rtk discover scans past LLM sessions to find commands that could have been rewritten but weren't. Same classification logic, different consumer.
When a hook sends cargo fmt --all && cargo test 2>&1 | tail -20:
Tokenization — The lexer (lexer.rs) turns the raw string into typed tokens. It's a single-pass state machine that understands shell quoting, escapes, redirects, and operators. This is critical because naive string splitting breaks on quoted content like git commit -m "fix && update".
"cargo test 2>&1 && git status"
→ [Arg("cargo"), Arg("test"), Redirect("2>&1"), Operator("&&"), Arg("git"), Arg("status")]
Compound splitting — The rewrite engine walks the tokens, splitting on Operator (&&, ||, ;) and Pipe (|). Each segment is rewritten independently. For pipes, only the left side is rewritten (the pipe consumer like grep or head runs raw). find/fd before a pipe is never rewritten because rtk's grouped output format breaks pipe consumers like xargs.
Per-segment rewriting — Each segment goes through:
2>&1, >/dev/null) — matched via lexer tokens, set aside, re-appended after rewritinghead -20 file → rtk read file --max-lines 20, tail -n 5 file → rtk read file --tail-lines 5. These can't go through generic prefix replacement because it would produce rtk read -20 file (wrong flag position)sudo, FOO="bar baz"), normalize paths (/usr/bin/grep → grep), strip git global opts (git -C /tmp → git), then match against 60+ regex patterns from rules.rsrtk <cmd>, re-prepend the env prefix, re-append the redirect suffixGuards along the way:
RTK_DISABLED=1 in the env prefix → skip rewritegh with --json/--jq/--template → skip (structured output, rtk would corrupt it)cat with flags other than -n → skip (different semantics than rtk read)cat/head/tail with > or >> → skip (write operation, not a read)hooks.exclude_commands config → skipResult: rtk cargo fmt --all && rtk cargo test 2>&1 | tail -20. Bash handles the && and | at execution time — each rtk invocation is a separate process.
rtk discover reads Claude Code JSONL session files. Each file contains tool_use/tool_result pairs for every command the LLM ran. The module:
SessionProvider trait — currently only Claude Code)The classification logic is shared between discover and rewrite — same patterns, same rules, different consumers.
The ENV_PREFIX regex strips env variable assignments, sudo, and env from the front of commands. It handles:
FOO=barFOO="bar baz"FOO='bar baz'FOO="he said \"hello\""A="x y" B=1 sudo git statusThe prefix is stripped twice: once in classify_command() to match the underlying command against rules, and again in rewrite_segment() to extract it for re-prepending to the rewritten command.
Add an entry to rules.rs. Each rule has:
pattern — regex that matches the command (used by RegexSet for fast matching)rtk_cmd — the RTK command it maps to (e.g., "rtk cargo")rewrite_prefixes — command prefixes to replace (e.g., &["cargo"])category, savings_pct — metadata for discover reportssubcmd_savings, subcmd_status — per-subcommand overridesNo other files need to change. The registry compiles the patterns at first use via lazy_static.