docs/research/rf-topological-sensing/03-attention-mechanisms-rf-sensing.md
Document: 03-attention-mechanisms-rf-sensing Date: 2026-03-08 Status: Research Reference Scope: Attention architectures for graph-based RF sensing where ESP32 nodes form a dynamic signal topology and minimum cut partitioning detects human presence, pose, and activity.
RF topological sensing models a physical space as a dynamic signal graph G = (V, E, W) where:
A person moving through the space perturbs the RF field, causing coherence drops along links whose Fresnel zones intersect the person's body. Minimum cut partitioning of this weighted graph identifies the boundary between perturbed and unperturbed subgraphs, localizing the person.
RF Topological Sensing — Conceptual Model
==========================================
Physical Space Signal Graph G = (V, E, W)
+-----------------------+
| | N1 ----0.92---- N2
| [N1] [N2] | / \ / \
| \ / | 0.31 0.87 0.45 0.91
| \ P / | / \ / \
| \../ | N4 --0.28-- N5 --0.89-- N3
| [N4]...[P]....[N3] | \ /
| / \ | 0.93 ------ 0.90
| / \ |
| [N5] [N6] | Low weights (0.28, 0.31, 0.45) indicate
| | links crossing the person P's position.
+-----------------------+ Mincut separates {N4,N5} from {N1,N2,N3,N6}.
Traditional RF sensing uses hand-crafted features: amplitude variance, phase difference, subcarrier correlation. These have three fundamental limitations:
Attention mechanisms address all three by learning to weight information sources — subcarriers, time steps, links, and nodes — according to their relevance for the downstream task.
| Symbol | Meaning |
|---|---|
| N | Number of ESP32 nodes |
| L = N(N-1)/2 | Number of bidirectional links |
| S | Number of OFDM subcarriers (typically 52 or 114) |
| T | Number of time steps in a CSI window |
| H(s,t) in C^S | CSI vector for link l at time t |
| d_k | Attention key/query dimension |
| h | Number of attention heads |
In a standard graph formulation, the adjacency matrix A has entries a_ij representing signal coherence between nodes i and j. Graph Attention Networks (GATs) replace these fixed weights with learned attention coefficients that adapt based on the node features.
Given node feature vectors x_i in R^F for each ESP32 node i, GAT computes attention coefficients:
e_ij = LeakyReLU(a^T [W x_i || W x_j])
alpha_ij = softmax_j(e_ij) = exp(e_ij) / sum_k(exp(e_ik))
where:
The updated node representation becomes:
x_i' = sigma( sum_j alpha_ij W x_j )
For RF sensing, node features are not given directly. Each ESP32 node participates in multiple links, and each link produces CSI streams. We construct node features by aggregating incoming link information:
x_i = AGG({ f(H_ij(t)) : j in N(i), t in [T] })
where f is a feature extractor (e.g., amplitude statistics, phase slope) and AGG is mean or max pooling over neighbors and time.
Node Feature Construction
=========================
Links to Node N1: Feature Extraction: Node Feature:
N2->N1: H_21(1..T) ---> f(H_21) = [amp_var, \
N3->N1: H_31(1..T) ---> f(H_31) = phase_slope, > AGG --> x_1 in R^F
N4->N1: H_41(1..T) ---> f(H_41) = corr, ...] /
N5->N1: H_51(1..T) ---> f(H_51) /
Single-head attention captures one notion of relevance. Multi-head attention runs h independent attention computations and concatenates or averages:
x_i' = ||_{k=1}^{h} sigma( sum_j alpha_ij^(k) W^(k) x_j )
For RF sensing, different heads can specialize in different phenomena:
| Head | Learned Specialization |
|---|---|
| Head 1 | Line-of-sight path quality |
| Head 2 | Multipath richness (scattering) |
| Head 3 | Temporal stability (static vs dynamic) |
| Head 4 | Frequency selectivity (subcarrier variance) |
Standard GAT only uses node features to compute attention. In RF sensing, edges carry rich information (the CSI itself). Edge-featured GAT incorporates edge attributes e_ij directly:
e_ij = LeakyReLU(a^T [W_n x_i || W_n x_j || W_e e_ij])
where e_ij in R^E contains link-level features:
Edge-Featured GAT — RF Sensing
================================
x_i x_j
| |
v v
[W_n x_i] [W_n x_j]
| |
+--- CONCAT ---+--- CONCAT ---+
| |
[W_e e_ij] |
| |
[ a^T [...] ] |
| |
LeakyReLU |
| |
alpha_ij |
| |
alpha_ij * W x_j ---+---> contribution to x_i'
The original GAT has a "static attention" limitation: the ranking of attention coefficients is fixed for a given query node regardless of the key. GATv2 fixes this by applying the nonlinearity after concatenation but before the dot product:
e_ij = a^T LeakyReLU(W [x_i || x_j])
This is strictly more expressive and important for RF sensing where the same node should attend differently depending on which neighbor it is evaluating — a dynamic property essential for tracking moving targets.
CSI measurements arrive as time series at 100-1000 Hz. Human motion creates characteristic temporal patterns: periodic breathing modulates amplitude at 0.2-0.5 Hz, walking creates 1-2 Hz Doppler signatures, and gestures produce transient bursts. Self-attention over CSI sequences identifies which time steps carry the most information for graph weight updates.
Given a CSI sequence H = [h_1, h_2, ..., h_T] where h_t in R^S is the CSI vector at time t, self-attention computes:
Q = H W_Q, K = H W_K, V = H W_V
Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V
The attention matrix A in R^{T x T} has entry A_st representing how much time step t attends to time step s. This captures:
Self-Attention on CSI Time Series
==================================
Input: T time steps of S-dimensional CSI vectors
h_1 h_2 h_3 ... h_T Time steps
| | | |
v v v v
[ Linear Projections Q, K, V ]
| | | |
v v v v
[ Scaled Dot-Product Attention ]
| | | |
v v v v
z_1 z_2 z_3 ... z_T Contextualized representations
Attention Pattern (breathing example):
t1 t2 t3 t4 t5 t6 t7 t8
t1 [ .9 .3 .1 .0 .7 .2 .1 .0 ] <-- attends to t1, t5
t2 [ .3 .9 .3 .1 .2 .7 .3 .1 ] (same phase of
t3 [ .1 .3 .9 .3 .1 .2 .7 .3 ] breathing cycle)
t4 [ .0 .1 .3 .9 .0 .1 .3 .8 ]
...
Diagonal bands indicate periodic self-similarity.
CSI time series require positional encoding to preserve temporal ordering. Sinusoidal positional encodings work well, but learnable encodings tuned to the CSI sampling rate can capture hardware-specific timing patterns:
PE(t, 2i) = sin(t / 10000^{2i/d})
PE(t, 2i+1) = cos(t / 10000^{2i/d})
For 100 Hz CSI with T=128 window, the positional encoding must resolve 10 ms differences. An alternative is relative positional encoding (RPE) which encodes the time difference (t - s) rather than absolute position, making the model invariant to window start time.
For real-time sensing, causal (masked) attention is necessary — time step t can only attend to steps 1..t:
Mask_st = { 0 if s <= t
{ -inf if s > t
A = softmax((Q K^T + Mask) / sqrt(d_k))
For offline analysis (e.g., training data labeling), bidirectional attention provides richer context by allowing each step to attend to the full window.
The key application is collapsing the time dimension into a single edge weight for graph construction. Attention-weighted temporal pooling:
w_ij = sum_t alpha_t * g(z_t^{ij})
where alpha_t = softmax(v^T tanh(W_a z_t^{ij}))
Here z_t^{ij} is the contextualized CSI representation for link (i,j) at time t, and g maps to a scalar coherence score. The attention weights alpha_t learn to focus on the most informative moments — for example, the peak of a Doppler burst during a gesture.
In a multistatic RF sensing setup, links are not independent. A person walking between nodes N1 and N3 simultaneously affects links (N1,N3), (N2,N3), and (N1,N4) to varying degrees. Cross-attention captures these correlations by allowing each link's representation to attend to all other links.
Let Z^{ij} in R^{T x d} be the temporal CSI embedding for link (i,j) after self-attention. Cross-attention between link (i,j) and all other links:
Q = Z^{ij} W_Q (query from target link)
K = [Z^{kl}] W_K (keys from all links, stacked)
V = [Z^{kl}] W_V (values from all links, stacked)
CrossAttn(ij) = softmax(Q K^T / sqrt(d_k)) V
Cross-Attention for Multi-Link Fusion
======================================
Link (1,2) Link (1,3) Link (2,3) Link (2,4) ...
| | | |
[Self-Attn] [Self-Attn] [Self-Attn] [Self-Attn]
| | | |
v v v v
Z^12 Z^13 Z^23 Z^24
| | | |
+------+-------+------+------+------+------+
| | |
[Cross-Attn] [Cross-Attn] [Cross-Attn] ...
| | |
v v v
C^12 C^13 C^23
| | |
[Edge Score] [Edge Score] [Edge Score]
| | |
v v v
w_12 w_13 w_23
Each link attends to all other links to capture
spatial correlations from shared human targets.
Links that are physically close or share a node should have baseline higher attention. We introduce a geometric bias G_bias:
A = softmax((Q K^T + G_bias) / sqrt(d_k)) V
where G_bias_mn encodes the geometric relationship between link m and link n:
G_bias_mn = -beta * d_Fresnel(m, n) + gamma * shared_node(m, n)
This is the concept implemented in RuVector's CrossViewpointAttention
with GeometricBias — the attention mechanism is biased toward
geometrically meaningful link combinations while still allowing the model
to discover non-obvious correlations.
For N nodes with L = N(N-1)/2 links, full cross-attention is O(L^2). A hierarchical approach reduces this:
Level 1 (Link -> Node): Links incident to Ni --> aggregate --> n_i
Level 2 (Node -> Node): {n_1, ..., n_N} --> Cross-Attn --> {n_1', ..., n_N'}
Level 3 (Node -> Link): n_i', n_j' --> project --> w_ij
This reduces complexity from O(L^2) = O(N^4) to O(N^2), critical for dense meshes with 6-8 nodes (15-28 links).
Given graph G = (V, E, W), the minimum s-t cut partitions V into S and T such that s in S, t in T, and the cut weight is minimized:
mincut(S, T) = sum_{(i,j): i in S, j in T} w_ij
For RF sensing, we seek the normalized cut (Ncut) which balances partition sizes:
Ncut(S, T) = cut(S,T)/assoc(S,V) + cut(S,T)/assoc(T,V)
where assoc(S,V) = sum of all edge weights incident to S.
The discrete mincut problem is NP-hard. The spectral relaxation uses the graph Laplacian L = D - W (D is the degree matrix):
min_y y^T L y / y^T D y subject to y in {-1, +1}^N
Relaxed: min_y y^T L y / y^T D y, y in R^N
The solution is the Fiedler vector — the eigenvector of the smallest nonzero eigenvalue of the normalized Laplacian.
The key insight: replace fixed edge weights with attention-computed scores that are differentiable end-to-end. Given raw CSI features, attention produces edge weights, which feed into a differentiable mincut layer:
Attention-Weighted Differentiable MinCut Pipeline
==================================================
Raw CSI Frames Differentiable MinCut
per link (i,j)
H_12 --+ W = {w_ij}
H_13 --+--> [Attention ] --> |
H_23 --+ [ Modules ] [Build Laplacian L = D - W]
H_24 --+ [Sec 2,3,4,7 ] |
H_34 --+ [Soft assignment S = softmax(X)]
... --+ |
[MinCut loss: Tr(S^T L S) / Tr(S^T D S)]
|
[Backprop through attention weights]
Instead of hard cluster assignments, use a soft assignment matrix S in R^{N x K} where K is the number of clusters:
S = softmax(MLP(X)) where X = GNN(node_features, W)
L_cut = -Tr(S^T A S) / Tr(S^T D S) (MinCut loss)
L_orth = || S^T S / ||S^T S||_F - I/sqrt(K) ||_F (Orthogonality)
L_total = L_cut + lambda * L_orth
The attention-computed edge weights W flow into A (adjacency), D (degree), and through the GNN into S. The entire pipeline is differentiable, allowing the attention mechanism to learn edge weights that produce meaningful cuts.
The training signal for attention comes from two sources:
Supervised: Ground-truth person location determines which links should have low weights (those crossing the person's body).
Self-supervised: The mincut objective itself provides a training signal — attention weights that produce cleaner cuts (lower Ncut value with balanced partitions) are reinforced.
L_attention = L_supervised + alpha * L_mincut + beta * L_regularization
L_supervised = BCE(w_ij, y_ij) (y_ij = 1 if link unobstructed)
L_mincut = Ncut(S*, T*) (quality of resulting partition)
L_regularization = sum_ij |alpha_ij| * H(alpha_ij) (attention entropy)
The entropy regularization H(alpha) prevents attention collapse (all weight on one link) or uniform attention (no discrimination).
Not all ESP32 nodes contribute equally. A node in a corner has fewer intersecting Fresnel zones than a central node. A node with hardware degradation may produce noisy CSI. Spatial attention learns to weight nodes by their information contribution.
For each node i, compute an importance score:
s_i = sigma(w^T [x_i || g_i || q_i])
where:
The importance score gates the node's contribution:
x_i_gated = s_i * x_i
Adapted from channel attention in CNNs, Squeeze-and-Excitation (SE) for node graphs:
1. Squeeze: z = (1/N) sum_i x_i (global node pooling)
2. Excite: s = sigma(W_2 ReLU(W_1 z)) (per-node importance)
3. Scale: x_i' = s_i * x_i (reweight nodes)
Squeeze-and-Excitation for ESP32 Node Graph
=============================================
Node features: x_1 x_2 x_3 x_4 x_5 x_6
| | | | | |
+--+--+--+--+--+--+--+--+--+--+
|
[Global Pool z]
|
[FC -> ReLU -> FC -> Sigmoid]
|
s_1 s_2 s_3 s_4 s_5 s_6
| | | | | |
* * * * * *
| | | | | |
x_1' x_2' x_3' x_4' x_5' x_6'
Example: Node 3 (occluded corner) gets s_3 = 0.2
Node 5 (central, clear LoS) gets s_5 = 0.9
From estimation theory, the Fisher Information quantifies how much a measurement contributes to parameter estimation. For node i observing target at position theta:
FI_i(theta) = E[ (d/d_theta log p(H_i | theta))^2 ]
Nodes with higher Fisher Information provide more localization accuracy.
This can be computed analytically for simple signal models or approximated
via the Cramer-Rao bound. The Geometric Diversity Index from RuVector's
geometry.rs module implements a related concept.
Spatial attention naturally enables dynamic node dropout — nodes with importance below a threshold are excluded from graph construction:
V_active = { i in V : s_i > tau }
E_active = { (i,j) in E : i in V_active AND j in V_active }
This provides robustness to node failures and reduces computation when some nodes are uninformative (e.g., all links from a node are in deep shadow).
Each CSI measurement contains S subcarriers (52 for 20 MHz, 114 for 40 MHz 802.11n). Not all subcarriers are equally informative:
RuVector's apply_antenna_attention concept applies attention at the
subcarrier level before any graph construction. For a CSI vector
h in C^S:
h_real = [Re(h) || Im(h)] in R^{2S}
a = softmax(W_2 ReLU(W_1 h_real + b_1) + b_2) in R^S
h_attended = a odot h in C^S
where odot is element-wise multiplication (the attention weights are real-valued but applied to complex CSI).
Antenna-Level Attention (Before Graph Construction)
====================================================
Raw CSI: h = [h_1, h_2, ..., h_S] (S complex subcarriers)
| | |
[Re/Im decompose + concat]
|
[FC -> ReLU -> FC -> Softmax]
|
Attention: a = [a_1, a_2, ..., a_S] (S real weights, sum = 1)
| | |
* * * (element-wise)
| | |
Attended: h' = [a_1*h_1, a_2*h_2, ..., a_S*h_S]
|
[Feature extraction]
|
[Graph edge weight w_ij]
Subcarrier attention map (example, 52 subcarriers):
Attention ^
weight | ** **
| * * ***** * *
| * * * * * *
| * * * * * *
|*** ****** ********* ***
+------------------------------------------------->
10 20 30 40 50
Subcarrier index
Peaks at subcarriers most affected by target motion.
Nulls at subcarriers dominated by static multipath.
With multiple antennas (MIMO), attention operates across both antenna and subcarrier dimensions. For an A-antenna, S-subcarrier system, the CSI tensor H in C^{A x S}:
Antenna attention: a_ant in R^A (which antennas matter)
Subcarrier attention: a_sub in R^S (which frequencies matter)
Joint attention: A_joint = a_ant * a_sub^T in R^{A x S}
Attended CSI: H' = A_joint odot H in C^{A x S}
This factored attention (rank-1) is parameter-efficient. A full attention matrix A in R^{AS x AS} is more expressive but requires A*S times more computation.
Combining subcarrier attention with temporal attention creates a 2D attention map over the time-frequency representation of CSI:
Time-Frequency Attention Map
=============================
Subcarrier ^
(freq) | . . . . . . . . . . . .
52 | . . . . . . . . . . . .
| . . . . # # . . . . . .
40 | . . . # # # # . . . . .
| . . . # # # # . . . . .
30 | . . # # # # # # . . . .
| . . . # # # # . . . . .
20 | . . . . # # . . . . . .
| . . . . . . . . . . . .
10 | . . . . . . . . . . . .
| . . . . . . . . . . . .
1 | . . . . . . . . . . . .
+---+---+---+---+---+---+---+---+---+--->
20 40 60 80 100 120 140 160 180
Time step
'#' = high attention (motion event at t=60-120, f=20-45)
'.' = low attention (static or noise)
This is essentially a learned spectrogram filter that isolates the time-frequency regions containing target motion signatures.
RuVector's subcarrier_selection.rs uses mincut-based selection to reduce
114 subcarriers to 56 for efficiency. Antenna-level attention provides a
soft version of this: instead of hard selection, it continuously weights
subcarriers. The hard selection can be derived from attention weights:
selected_subcarriers = top_k(a, k=56)
Or using Gumbel-Softmax for differentiable discrete selection during training.
Standard self-attention has O(T^2) time and memory complexity. For CSI sequences with T=512 at 100 Hz (5.12 seconds), the attention matrix has 262,144 entries per head. On ESP32 with 520 KB SRAM, this is prohibitive.
Linear attention replaces the softmax with kernel decomposition:
Standard: Attn(Q,K,V) = softmax(QK^T/sqrt(d)) V O(T^2 d)
Linear: Attn(Q,K,V) = phi(Q) (phi(K)^T V) O(T d^2)
where phi is a feature map (e.g., elu(x) + 1, or random Fourier features). The key insight is associativity: computing (K^T V) first yields a d x d matrix, then multiplying by Q is O(T d^2), which is linear in T when d << T.
For CSI with d_k = 64 and T = 512, this reduces computation by 8x.
Standard vs Linear Attention
=============================
Standard (O(T^2 d)): Linear (O(T d^2)):
Q [T x d] phi(Q) [T x d']
\ \
* K^T [d x T] * (phi(K)^T V) [d' x d]
\ \
[T x T] (large!) [T x d] (small!)
\ |
* V [T x d] | (done)
\ |
[T x d] [T x d]
Instead of full T x T attention, use structured sparsity:
Local Window Attention: Each position attends to a window of w neighbors:
A_st = { QK^T/sqrt(d) if |s - t| <= w/2
{ -inf otherwise
Complexity: O(T * w) with w << T. For CSI at 100 Hz, w = 32 covers 320 ms — sufficient for most motion events.
Dilated Attention: Attend to positions at exponentially increasing gaps:
Attend to: t-1, t-2, t-4, t-8, t-16, t-32, ...
This provides O(T log T) complexity while maintaining long-range context.
Strided Attention: Combine local and strided patterns (as in Longformer):
Attention Pattern (T=16, window=3, stride=4):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 [ x x . x . . . . x . . . x . . . ]
2 [ x x x . x . . . . x . . . x . . ]
3 [ . x x x . x . . . . x . . . x . ]
4 [ x . x x x . x . . . . x . . . x ]
...
x = attends, . = masked
Local window (3) + every 4th position for global context
LSH attention (from Reformer) groups similar queries and keys into buckets, computing attention only within buckets:
1. Hash Q and K into b buckets using LSH
2. Sort by bucket assignment
3. Compute attention within each bucket
Complexity: O(T * T/b) per bucket, O(T * T/b * b) total
With b = sqrt(T): O(T * sqrt(T))
For RF sensing, LSH naturally groups similar CSI patterns — time steps with similar signal characteristics attend to each other, which is physically meaningful (similar body poses produce similar CSI).
For edge deployment on ESP32:
INT8 Quantized Attention:
Q_int8 = clamp(round(Q / scale_Q), -128, 127)
K_int8 = clamp(round(K / scale_K), -128, 127)
Scores_int16 = Q_int8 * K_int8^T (INT8 matmul -> INT16)
A = softmax(dequantize(Scores_int16)) (back to FP32 for softmax)
Memory: Q,K in INT8 uses 1/4 the SRAM of FP32
Compute: INT8 matmul is 2-4x faster on ESP32-S3
For the most constrained scenarios, attention-free architectures that approximate attention behavior:
Gated Linear Units (GLU):
y = (X W_1 + b_1) odot sigma(X W_2 + b_2)
State Space Models (S4/Mamba):
x_t = A x_{t-1} + B u_t
y_t = C x_t + D u_t
With structured A matrix: O(T log T) via FFT
S4 models are particularly promising for CSI sequences because:
+--------------------+--------+---------+--------+----------+
| Method | Memory | Compute | Range | Platform |
+--------------------+--------+---------+--------+----------+
| Full Attention | O(T^2) | O(T^2d) | Global | Server |
| Linear Attention | O(Td) | O(Td^2) | Global | Edge GPU |
| Window Attention | O(Tw) | O(Twd) | Local | RPi/Jetson|
| Dilated Attention | O(TlgT)| O(TlgTd)| Global | RPi |
| LSH Attention | O(TsqT)| O(TsqTd)| Global | Edge GPU |
| INT8 Quantized | O(T^2) | O(T^2d) | Global | ESP32-S3 |
| GLU (no attention) | O(Td) | O(Td) | Local | ESP32 |
| S4/Mamba | O(d^2) | O(Td) | Global | ESP32 |
+--------------------+--------+---------+--------+----------+
T = sequence length, d = model dimension, w = window size
Combining all attention mechanisms into a unified RF sensing pipeline:
Unified Attention Architecture for RF Topological Sensing
==========================================================
LAYER 0: RAW CSI ACQUISITION
+-----------------------------------------------------------+
| ESP32 Node i <---> ESP32 Node j |
| H_ij in C^{A x S x T} (antennas x subcarriers x time) |
+-----------------------------------------------------------+
|
v
LAYER 1: ANTENNA-LEVEL ATTENTION (Section 7)
+-----------------------------------------------------------+
| Per-link subcarrier weighting |
| a_sub = SoftAttn(H_ij) in R^S |
| H_ij' = a_sub odot H_ij |
| Reduces noise, emphasizes motion-sensitive subcarriers |
+-----------------------------------------------------------+
|
v
LAYER 2: TEMPORAL SELF-ATTENTION (Section 3)
+-----------------------------------------------------------+
| Per-link temporal context |
| Z_ij = SelfAttn(H_ij'[t=1..T]) |
| Captures breathing, gait, gesture patterns |
| Uses efficient attention (Section 8) for long sequences |
+-----------------------------------------------------------+
|
v
LAYER 3: CROSS-LINK ATTENTION (Section 4)
+-----------------------------------------------------------+
| Inter-link dependency modeling |
| C_ij = CrossAttn(Z_ij, {Z_kl : all links}) |
| With geometric bias G_bias from node positions |
| Captures multi-link correlations from shared targets |
+-----------------------------------------------------------+
|
v
LAYER 4: EDGE WEIGHT COMPUTATION
+-----------------------------------------------------------+
| w_ij = MLP(TemporalPool(C_ij)) |
| Temporal pooling with attention (Section 3.5) |
| Produces scalar edge weight per link |
+-----------------------------------------------------------+
|
v
LAYER 5: GRAPH ATTENTION NETWORK (Section 2)
+-----------------------------------------------------------+
| Multi-head GAT with edge features |
| x_i' = GAT(x_i, {x_j, w_ij, e_ij}) |
| Refines node representations using graph structure |
+-----------------------------------------------------------+
|
v
LAYER 6: SPATIAL NODE ATTENTION (Section 6)
+-----------------------------------------------------------+
| Node importance weighting |
| s_i = SE_Block(x_i') |
| Suppresses noisy or uninformative nodes |
+-----------------------------------------------------------+
|
v
LAYER 7: DIFFERENTIABLE MINCUT (Section 5)
+-----------------------------------------------------------+
| Soft cluster assignment with attention-weighted edges |
| S = softmax(MLP(x')) |
| L = L_cut + L_orth + L_supervised |
| Partitions graph at human body boundaries |
+-----------------------------------------------------------+
|
v
OUTPUT: Person detection, localization, pose estimation
Stage 1: Pretrain antenna attention (Section 7) on single-link CSI with signal quality labels. This bootstraps meaningful subcarrier weighting before full pipeline training.
Stage 2: Train temporal + cross-link attention (Sections 3-4) with link-level activity labels. The model learns to identify active links.
Stage 3: End-to-end fine-tuning with mincut loss (Section 5) and person location supervision. All attention mechanisms adapt jointly.
Stage 4: Distillation for edge deployment — train efficient variants (Section 8) to match the full model's attention patterns using KL divergence between attention distributions.
For a 6-node mesh (15 links, 52 subcarriers, T=128 time steps):
Component | FLOPs/frame | Parameters | Memory
-----------------------+---------------+------------+---------
Antenna attention (x15)| 15 * 5K | 5K | 15 KB
Temporal self-attn | 15 * 1M | 50K | 200 KB
Cross-link attention | 15^2 * 100K | 100K | 500 KB
GAT (2 layers) | 6 * 50K | 30K | 50 KB
Spatial attention | 6 * 1K | 2K | 5 KB
MinCut MLP | 6 * 10K | 10K | 10 KB
-----------------------+---------------+------------+---------
Total | ~40M | ~200K | ~800 KB
This fits within a Raspberry Pi 4 (1 GB RAM, 4-core ARM Cortex-A72) for real-time inference at 10 Hz. For ESP32 deployment, the efficient variants from Section 8 reduce this by 10-50x.
The unified architecture maps directly to existing RuView modules:
| Architecture Layer | RuView Module | File |
|---|---|---|
| Antenna Attention | ruvector-attn-mincut | model.rs (apply_antenna_attention) |
| Temporal Self-Attention | ruvsense | gesture.rs, intention.rs |
| Cross-Link Attention | ruvector viewpoint | attention.rs (CrossViewpointAttention) |
| Geometric Bias | ruvector viewpoint | geometry.rs (GeometricDiversityIndex) |
| Edge Weight Computation | ruvsense | coherence.rs, coherence_gate.rs |
| Graph Attention | ruvector-mincut | metrics.rs (DynamicPersonMatcher) |
| Spatial Node Attention | ruvsense | multistatic.rs (attention-weighted fusion) |
| Differentiable MinCut | ruvector-mincut | core mincut algorithm |
Vaswani et al., "Attention Is All You Need," NeurIPS 2017.
Velickovic et al., "Graph Attention Networks," ICLR 2018.
Brody et al., "How Attentive are Graph Attention Networks?" ICLR 2022.
Katharopoulos et al., "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention," ICML 2020.
Kitaev et al., "Reformer: The Efficient Transformer," ICLR 2020.
Beltagy et al., "Longformer: The Long-Document Transformer," 2020.
Gu et al., "Efficiently Modeling Long Sequences with Structured State Spaces (S4)," ICLR 2022.
Gu and Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces," 2023.
Wang et al., "Wi-Pose: WiFi-based Multi-Person Pose Estimation," 2021.
Yang et al., "MM-Fi: Multi-Modal Non-Intrusive 4D Human Dataset," 2024.
Wang et al., "Person-in-WiFi: Fine-Grained Person Perception Using WiFi," ICCV 2019.
Bianchi et al., "Spectral Clustering with Graph Neural Networks for Graph Pooling," ICML 2020.
Stoer and Wagner, "A Simple Min-Cut Algorithm," JACM 1997.
Adib and Katabi, "See Through Walls with WiFi!" SIGCOMM 2013.
Wang et al., "Placement Matters: Understanding the Effects of Device Placement for WiFi Sensing," 2022.
End of document. This research reference supports the attention mechanism design choices in the RuView/WiFi-DensePose RF topological sensing system.