docs/research/sota-surveys/sota-wifi-sensing-2025.md
Date: 2026-04-02
Focus: New architectures, lightweight models, edge deployment, ESP32+Pi Zero inference
Complements: wifi-sensing-ruvector-sota-2026.md (February 2026 survey)
Paper: WiFlow: A Lightweight WiFi-based Continuous Human Pose Estimation Network with Spatio-Temporal Feature Decoupling (arXiv:2602.08661)
WiFlow is the most directly relevant architecture for our ESP32 + Pi Zero deployment target.
Three-stage encoder-decoder with spatio-temporal decoupling:
Stage 1: Temporal Encoder (TCN)
Stage 2: Spatial Encoder (Asymmetric Convolution)
Stage 3: Axial Self-Attention
Decoder:
| Metric | WiFlow | WPformer | WiSPPN |
|---|---|---|---|
| Parameters | 4.82M | 10.04M | 121.5M |
| FLOPs | 0.47B | 35.00B | 338.45B |
| PCK@20 (random split) | 97.00% | 70.02% | 85.87% |
| MPJPE (random split) | 0.008m | 0.028m | 0.016m |
| PCK@20 (cross-subject) | 86.89% | -- | -- |
| Training time (5-fold) | 18.17h | 137.5h | -- |
Critical observations for our project:
Loss function:
L = L_H + lambda * L_B
L_H = SmoothL1(predicted_keypoints, ground_truth, beta=0.1)
L_B = sum of bone length constraint violations across 14 bone connections
lambda = 0.2
The bone constraint loss is particularly important for edge deployment where noisy predictions need physical plausibility enforcement.
WiFlow's architecture maps well to our hardware:
Paper: MultiFormer: A Multi-Person Pose Estimation System Based on CSI and Attention Mechanism (arXiv:2505.22555)
Teacher-student framework with OpenPose teacher providing ground truth labels.
Time-Frequency Dual-Dimensional Tokenization (TFDDT):
Dual Transformer Encoder:
Multi-Stage Pose Estimation:
| Variant | Encoder Layers | Input | Parameters |
|---|---|---|---|
| MultiFormer | 8 | 64x1296 | 11.93M |
| MultiFormer-24 | 8 | 64x576 | 4.05M |
| MultiFormer-18 | 6 | 64x324 | 2.80M |
Key result on MM-Fi dataset: MultiFormer achieves PCK@20 of 0.7225, outperforming CSI2Pose (0.6841). The compact MultiFormer-18 at 2.80M parameters is edge-deployable.
MultiFormer's dual-token approach is valuable because:
Paper: Person-in-WiFi 3D: End-to-End Multi-Person 3D Pose Estimation with Wi-Fi (CVPR 2024)
First multi-person 3D WiFi pose estimation.
Key results:
Relevance: Establishes the accuracy ceiling for WiFi 3D pose. Our ESP32+Pi system should target comparable single-person performance (sub-100mm MPJPE) as a milestone.
Paper: arXiv:2410.16303
Novel approach: generates 3D point clouds from WiFi CSI data using transformer networks.
Key innovation: Positional encoding with learned embeddings for antennas and subcarriers, followed by multi-head attention over antenna-subcarrier pairs. This captures both spatial (antenna geometry) and spectral (subcarrier frequency response) dependencies.
Relevance: Point cloud output is a richer representation than keypoints alone, enabling:
Paper: Graph-based 3D Human Pose Estimation using WiFi Signals (arXiv:2511.19105)
Uses graph neural networks where nodes represent keypoints and edges represent skeletal connections. CSI features are injected as node/edge attributes.
Relevance: Graph structure naturally maps to our RuvSense pose_tracker which already maintains a 17-keypoint skeleton with Kalman filtering. Adding graph-based message passing between keypoints could improve joint prediction coherence.
Repository: github.com/winwinashwin/CSI-Sense-Zero
The most directly relevant prior art for our hardware target.
Architecture:
/tmp/csififo (~256 CSI records)Data flow:
ESP32 TX -> WiFi signal -> ESP32 RX -> Serial (921.6 kbaud) -> Pi Zero FIFO -> Model -> WebSocket
Limitations:
What we improve:
Repository: github.com/vitoplantamura/OnnxStream
Runs Stable Diffusion XL on Pi Zero 2 W in 298 MB RAM. Key features:
Benchmark estimates for our model sizes:
| Model | Parameters | INT8 Size | Est. Pi Zero 2 Latency |
|---|---|---|---|
| MultiFormer-18 | 2.80M | ~2.8 MB | ~30-50ms |
| WiFlow | 4.82M | ~4.8 MB | ~50-80ms |
| MultiFormer | 11.93M | ~11.9 MB | ~120-200ms |
| DensePose-WiFi | ~25M (est.) | ~25 MB | ~300-500ms |
These estimates assume XNNPACK-accelerated INT8 inference on Cortex-A53 @ 1 GHz. The WiFlow and MultiFormer-18 models can achieve 12-20 Hz inference, matching our 20 Hz TDMA cycle target.
ONNX Runtime officially supports Raspberry Pi deployment with:
For Rust integration, the ort crate (ONNX Runtime Rust bindings) supports cross-compilation to aarch64-linux-gnu.
Paper: EfficientFi: Towards Large-Scale Lightweight WiFi Sensing via CSI Compression (arXiv:2204.04138)
Proposes compressing CSI data on the sensing device before transmission to the inference node. Key idea: train a CSI autoencoder where the encoder runs on the constrained device and the decoder runs on the more powerful inference node.
Relevance: For our ESP32 -> Pi Zero pipeline, CSI compression on ESP32 reduces:
| Criterion | WiFlow | MultiFormer-18 | DensePose-WiFi | Graph-3D |
|---|---|---|---|---|
| Parameters | 4.82M | 2.80M | ~25M | ~8M (est.) |
| FLOPs | 0.47B | ~0.3B (est.) | ~5B (est.) | ~1B (est.) |
| Multi-person | No | Yes (PAF+Hungarian) | Yes (RCNN-based) | No |
| 3D output | No (2D) | No (2D) | No (UV map) | Yes (3D) |
| Amplitude-only | Yes | Yes | No (amp+phase) | Unknown |
| Edge-viable | Yes | Yes | No | Marginal |
| Open source | Not yet | Not yet | Limited | Not yet |
For the ESP32 + Pi Zero deployment, we recommend a hybrid architecture:
This hybrid achieves:
Based on the surveyed papers:
| Dataset | Subjects | Frames | Hardware | Availability |
|---|---|---|---|---|
| CMU DensePose-WiFi | 8 | ~250K | Intel 5300 | Limited |
| Person-in-WiFi 3D | 7 | 97K | Custom WiFi | GitHub |
| MM-Fi | Multiple | Large | WiFi + mmWave | Public |
| Wi-Pose | Multiple | Large | Intel 5300 | Public |
Our approach:
| Capability | Status | Module |
|---|---|---|
| ESP32 CSI capture | Production | wifi-densepose-hardware |
| Multi-node fusion | Production | ruvsense/multistatic.rs |
| Phase alignment | Production | ruvsense/phase_align.rs |
| Coherence gating | Production | ruvsense/coherence_gate.rs |
| 17-keypoint tracking | Production | ruvsense/pose_tracker.rs |
| ONNX inference engine | Production | wifi-densepose-nn |
| Modality translator | Production | wifi-densepose-nn/translator.rs |
| Training pipeline | Production | wifi-densepose-train |
| Subcarrier interpolation | Production | wifi-densepose-train/subcarrier.rs |
| Gap | Required For | Priority |
|---|---|---|
| Pi Zero deployment target | Edge inference node | Critical |
| Lightweight model architecture | Sub-100ms inference on Cortex-A53 | Critical |
| Temporal causal convolution | Real-time streaming inference | High |
| Axial attention module | Efficient spatial encoding | High |
| Bone constraint loss | Physical plausibility | High |
| CSI compression on ESP32 | Bandwidth reduction | Medium |
| INT8 quantization pipeline | Model size reduction | Medium |
| Cross-environment adaptation | Deployment generalization | Medium |
| Multi-person PAF decoding | Multiple subject support | Low (Phase 2) |
| 3D pose lifting | Z-axis estimation | Low (Phase 3) |
| Diffusion-based pose refinement | Uncertainty quantification | Research |
1. No lightweight inference path. The current wifi-densepose-nn crate assumes GPU or high-end CPU inference. We need an EdgeInferenceEngine optimized for:
2. No ESP32 -> Pi Zero communication protocol. The wifi-densepose-hardware crate handles ESP32 CSI capture and UDP aggregation to a server, but has no lightweight protocol for ESP32 -> Pi Zero direct communication. We need:
3. No temporal convolution module. The existing signal processing pipeline uses frame-by-frame processing. WiFlow and MultiFormer both show that temporal context (20 frames for WiFlow, 64 frames for MultiFormer) significantly improves accuracy. We need a ring buffer + TCN module in the inference path.
4. No bone/skeleton constraint enforcement at inference time. The pose_tracker.rs has Kalman filtering and skeleton constraints, but these are post-hoc corrections. WiFlow shows that baking bone constraints into the loss function during training produces better models that need less post-processing.