Back to Ruview

SOTA WiFi Sensing for Edge Pose Estimation (2024-2026 Update)

docs/research/sota-surveys/sota-wifi-sensing-2025.md

0.7.015.9 KB
Original Source

SOTA WiFi Sensing for Edge Pose Estimation (2024-2026 Update)

Date: 2026-04-02 Focus: New architectures, lightweight models, edge deployment, ESP32+Pi Zero inference Complements: wifi-sensing-ruvector-sota-2026.md (February 2026 survey)


1. New Architectures Since Last Survey

1.1 WiFlow: Lightweight Continuous Pose Estimation (February 2026)

Paper: WiFlow: A Lightweight WiFi-based Continuous Human Pose Estimation Network with Spatio-Temporal Feature Decoupling (arXiv:2602.08661)

WiFlow is the most directly relevant architecture for our ESP32 + Pi Zero deployment target.

Architecture

Three-stage encoder-decoder with spatio-temporal decoupling:

Stage 1: Temporal Encoder (TCN)

  • Dilated causal convolution with exponentially growing dilation factors (1, 2, 4, 8)
  • Input: 540x20 tensor (18 antenna links x 30 subcarriers = 540 features, 20 time steps)
  • Progressive channel compression: 540 -> 440 -> 340 -> 240
  • Preserves temporal causality while achieving full receptive field coverage

Stage 2: Spatial Encoder (Asymmetric Convolution)

  • 1xk kernels operating only in the subcarrier dimension
  • 4 residual blocks: 8 -> 16 -> 32 -> 64 channels
  • Subcarrier compression: 240 -> 120 -> 60 -> 30 -> 15
  • Stride (1,2) downsampling -- no pooling layers

Stage 3: Axial Self-Attention

  • Two-stage axial attention reduces complexity from O(H^2 W^2) to O(H^2 W + HW^2)
  • Stage one: width direction (temporal axis), 8 groups
  • Stage two: height direction (keypoint axis)
  • Input reshaped to (B x K) x C x T for first stage

Decoder:

  • Adaptive average pooling instead of fully connected layers
  • Direct coordinate regression to 2D keypoint positions

Key Metrics

MetricWiFlowWPformerWiSPPN
Parameters4.82M10.04M121.5M
FLOPs0.47B35.00B338.45B
PCK@20 (random split)97.00%70.02%85.87%
MPJPE (random split)0.008m0.028m0.016m
PCK@20 (cross-subject)86.89%----
Training time (5-fold)18.17h137.5h--

Critical observations for our project:

  • 4.82M parameters at INT8 quantization = ~4.8 MB model size -- fits in Pi Zero 2 W RAM (512 MB)
  • 0.47B FLOPs suggests ~50ms inference on Cortex-A53 with NEON SIMD (estimated)
  • Only uses amplitude, discards phase (phase is "heavily corrupted by CFO and SFO in commercial WiFi devices")
  • ESP32-S3 CSI has similar CFO/SFO issues, so amplitude-only approach is pragmatic

Loss function:

L = L_H + lambda * L_B
L_H = SmoothL1(predicted_keypoints, ground_truth, beta=0.1)
L_B = sum of bone length constraint violations across 14 bone connections
lambda = 0.2

The bone constraint loss is particularly important for edge deployment where noisy predictions need physical plausibility enforcement.

Adaptation for ESP32 + Pi Zero

WiFlow's architecture maps well to our hardware:

  • TCN runs on ESP32 (temporal feature extraction from raw CSI stream)
  • Asymmetric conv + axial attention runs on Pi Zero (spatial encoding + pose regression)
  • The 540-dimensional input assumes Intel 5300 NIC (18 links x 30 subcarriers); for ESP32-S3 with 1 TX x 1 RX and 52 subcarriers, input dimension is 52x20 = 1040 -- even smaller

1.2 MultiFormer: Multi-Person WiFi Pose (May 2025)

Paper: MultiFormer: A Multi-Person Pose Estimation System Based on CSI and Attention Mechanism (arXiv:2505.22555)

Architecture

Teacher-student framework with OpenPose teacher providing ground truth labels.

Time-Frequency Dual-Dimensional Tokenization (TFDDT):

  • Input: CSI matrix from 1 TX, 3 RX, 30 subcarriers
  • Upsampled via zero-insertion + low-pass filtering to 64x3x64
  • Two parallel token streams:
    • Frequency tokens F_j: N_S tokens of length M x N_R (subcarrier-centric view)
    • Temporal tokens T_i: M tokens of length N_S x N_R (time-centric view)

Dual Transformer Encoder:

  • 8 layers per branch (frequency and temporal)
  • Multi-head self-attention: MSA(X) = (1/H) * sum(Softmax(QK^T / sqrt(d_k)) V)
  • Each branch followed by FFN with ReLU, dropout, residual connections

Multi-Stage Pose Estimation:

  • Part Confidence Maps (PCM): 19x36x36 heatmaps (18 keypoints + average)
  • Part Affinity Fields (PAF): 38x36x36 directional fields for 19 limb connections
  • Pose-Attentive Perception Module (PAPM): channel + spatial attention on PCM/PAF
  • Multi-person assignment via Hungarian algorithm on PAF integrals

Model Variants

VariantEncoder LayersInputParameters
MultiFormer864x129611.93M
MultiFormer-24864x5764.05M
MultiFormer-18664x3242.80M

Key result on MM-Fi dataset: MultiFormer achieves PCK@20 of 0.7225, outperforming CSI2Pose (0.6841). The compact MultiFormer-18 at 2.80M parameters is edge-deployable.

Relevance to Our Project

MultiFormer's dual-token approach is valuable because:

  1. It explicitly separates temporal and frequency information (like WiFlow's decoupling)
  2. The PAF-based multi-person assignment using Hungarian algorithm can run on Pi Zero
  3. The 2.80M parameter variant (MultiFormer-18) at INT8 = ~2.8 MB, well within Pi Zero constraints

1.3 Person-in-WiFi 3D (CVPR 2024)

Paper: Person-in-WiFi 3D: End-to-End Multi-Person 3D Pose Estimation with Wi-Fi (CVPR 2024)

First multi-person 3D WiFi pose estimation.

Key results:

  • Single person MPJPE: 91.7mm
  • Two persons: 108.1mm
  • Three persons: 125.3mm
  • Dataset: 97K frames, 4m x 3.5m area, 7 volunteers
  • Transformer-based end-to-end architecture

Relevance: Establishes the accuracy ceiling for WiFi 3D pose. Our ESP32+Pi system should target comparable single-person performance (sub-100mm MPJPE) as a milestone.

1.4 Spatio-Temporal 3D Point Clouds from WiFi-CSI (October 2024)

Paper: arXiv:2410.16303

Novel approach: generates 3D point clouds from WiFi CSI data using transformer networks.

Key innovation: Positional encoding with learned embeddings for antennas and subcarriers, followed by multi-head attention over antenna-subcarrier pairs. This captures both spatial (antenna geometry) and spectral (subcarrier frequency response) dependencies.

Relevance: Point cloud output is a richer representation than keypoints alone, enabling:

  • Silhouette estimation for activity recognition
  • Body volume estimation for person identification
  • Occlusion reasoning when fused with multiple viewpoints

1.5 Graph-Based 3D Human Pose from WiFi (November 2025)

Paper: Graph-based 3D Human Pose Estimation using WiFi Signals (arXiv:2511.19105)

Uses graph neural networks where nodes represent keypoints and edges represent skeletal connections. CSI features are injected as node/edge attributes.

Relevance: Graph structure naturally maps to our RuvSense pose_tracker which already maintains a 17-keypoint skeleton with Kalman filtering. Adding graph-based message passing between keypoints could improve joint prediction coherence.

2. Edge Deployment Landscape

2.1 CSI-Sense-Zero: ESP32 + Pi Zero Reference Implementation

Repository: github.com/winwinashwin/CSI-Sense-Zero

The most directly relevant prior art for our hardware target.

Architecture:

  • Two ESP32-WROOM-32: one TX, one RX (captures CSI)
  • Pi Zero: inference node
  • Communication: USB serial at 921,600 baud
  • Buffer: 235KB FIFO at /tmp/csififo (~256 CSI records)
  • Inference rate: 2 Hz (configurable)
  • WebSocket output for real-time visualization

Data flow:

ESP32 TX -> WiFi signal -> ESP32 RX -> Serial (921.6 kbaud) -> Pi Zero FIFO -> Model -> WebSocket

Limitations:

  • Original Pi Zero (single-core ARM11) -- very slow inference
  • Activity recognition only (not pose estimation)
  • Python inference (not optimized for ARM)

What we improve:

  • Pi Zero 2 W has quad-core Cortex-A53 -- roughly 5-10x faster than Pi Zero
  • Rust inference (ONNX/Candle) vs Python -- 3-10x faster
  • ESP32-S3 vs ESP32-WROOM-32 -- better CSI quality, more subcarriers
  • Pose estimation instead of just activity classification
  • UDP transport instead of USB serial -- supports multi-node mesh

2.2 OnnxStream: Lightweight ONNX on Pi Zero 2 W

Repository: github.com/vitoplantamura/OnnxStream

Runs Stable Diffusion XL on Pi Zero 2 W in 298 MB RAM. Key features:

  • C++ implementation, XNNPACK acceleration
  • ARM NEON SIMD optimization
  • Memory-efficient streaming execution (processes one operator at a time)
  • Supports INT8 quantization

Benchmark estimates for our model sizes:

ModelParametersINT8 SizeEst. Pi Zero 2 Latency
MultiFormer-182.80M~2.8 MB~30-50ms
WiFlow4.82M~4.8 MB~50-80ms
MultiFormer11.93M~11.9 MB~120-200ms
DensePose-WiFi~25M (est.)~25 MB~300-500ms

These estimates assume XNNPACK-accelerated INT8 inference on Cortex-A53 @ 1 GHz. The WiFlow and MultiFormer-18 models can achieve 12-20 Hz inference, matching our 20 Hz TDMA cycle target.

2.3 ONNX Runtime on ARM

ONNX Runtime officially supports Raspberry Pi deployment with:

  • ARM NEON execution provider
  • INT8 quantization support
  • Python and C++ APIs
  • Model optimization tools (graph optimization, operator fusion)

For Rust integration, the ort crate (ONNX Runtime Rust bindings) supports cross-compilation to aarch64-linux-gnu.

2.4 EfficientFi: CSI Compression for Edge

Paper: EfficientFi: Towards Large-Scale Lightweight WiFi Sensing via CSI Compression (arXiv:2204.04138)

Proposes compressing CSI data on the sensing device before transmission to the inference node. Key idea: train a CSI autoencoder where the encoder runs on the constrained device and the decoder runs on the more powerful inference node.

Relevance: For our ESP32 -> Pi Zero pipeline, CSI compression on ESP32 reduces:

  • UDP packet size (lower bandwidth, less packet loss)
  • Pi Zero preprocessing time (compressed features are more compact)
  • Effective latency (less data to transmit per frame)

3. Comparative Analysis: Architecture Selection for ESP32 + Pi Zero

3.1 Decision Matrix

CriterionWiFlowMultiFormer-18DensePose-WiFiGraph-3D
Parameters4.82M2.80M~25M~8M (est.)
FLOPs0.47B~0.3B (est.)~5B (est.)~1B (est.)
Multi-personNoYes (PAF+Hungarian)Yes (RCNN-based)No
3D outputNo (2D)No (2D)No (UV map)Yes (3D)
Amplitude-onlyYesYesNo (amp+phase)Unknown
Edge-viableYesYesNoMarginal
Open sourceNot yetNot yetLimitedNot yet

For the ESP32 + Pi Zero deployment, we recommend a hybrid architecture:

  1. WiFlow's TCN temporal encoder on ESP32 -- extract temporal features from raw CSI
  2. MultiFormer's dual-token approach on Pi Zero -- process both frequency and temporal views
  3. WiFlow's bone constraint loss during training -- enforce physical skeleton plausibility
  4. RuvSense coherence gating before inference -- reject low-quality CSI frames

This hybrid achieves:

  • ~3-5M parameters (between WiFlow and MultiFormer-18)
  • Amplitude-only input (robust to ESP32 CFO/SFO)
  • Sub-100ms inference on Pi Zero 2 W
  • Optional multi-person support via PAF module

3.3 Training Data Strategy

Based on the surveyed papers:

DatasetSubjectsFramesHardwareAvailability
CMU DensePose-WiFi8~250KIntel 5300Limited
Person-in-WiFi 3D797KCustom WiFiGitHub
MM-FiMultipleLargeWiFi + mmWavePublic
Wi-PoseMultipleLargeIntel 5300Public

Our approach:

  1. Pre-train on MM-Fi/Wi-Pose public datasets (Intel 5300 CSI format)
  2. Apply domain adaptation for ESP32-S3 CSI format (different subcarrier count, CFO characteristics)
  3. Fine-tune on self-collected ESP32-S3 data in target environments
  4. Augment with synthetic CSI from ray-tracing forward model (Arena Physica insight)

4. Gap Analysis: Current wifi-densepose vs SOTA

4.1 What We Have

CapabilityStatusModule
ESP32 CSI captureProductionwifi-densepose-hardware
Multi-node fusionProductionruvsense/multistatic.rs
Phase alignmentProductionruvsense/phase_align.rs
Coherence gatingProductionruvsense/coherence_gate.rs
17-keypoint trackingProductionruvsense/pose_tracker.rs
ONNX inference engineProductionwifi-densepose-nn
Modality translatorProductionwifi-densepose-nn/translator.rs
Training pipelineProductionwifi-densepose-train
Subcarrier interpolationProductionwifi-densepose-train/subcarrier.rs

4.2 What We Are Missing

GapRequired ForPriority
Pi Zero deployment targetEdge inference nodeCritical
Lightweight model architectureSub-100ms inference on Cortex-A53Critical
Temporal causal convolutionReal-time streaming inferenceHigh
Axial attention moduleEfficient spatial encodingHigh
Bone constraint lossPhysical plausibilityHigh
CSI compression on ESP32Bandwidth reductionMedium
INT8 quantization pipelineModel size reductionMedium
Cross-environment adaptationDeployment generalizationMedium
Multi-person PAF decodingMultiple subject supportLow (Phase 2)
3D pose liftingZ-axis estimationLow (Phase 3)
Diffusion-based pose refinementUncertainty quantificationResearch

4.3 Architecture Gaps in Detail

1. No lightweight inference path. The current wifi-densepose-nn crate assumes GPU or high-end CPU inference. We need an EdgeInferenceEngine optimized for:

  • INT8 ONNX models
  • ARM NEON SIMD via XNNPACK
  • Streaming inference (process CSI frames as they arrive, not in batches)
  • Memory-mapped model loading (avoid loading entire model into RAM)

2. No ESP32 -> Pi Zero communication protocol. The wifi-densepose-hardware crate handles ESP32 CSI capture and UDP aggregation to a server, but has no lightweight protocol for ESP32 -> Pi Zero direct communication. We need:

  • Compact binary frame format (not the full ADR-018 format)
  • Optional CSI compression (autoencoder on ESP32 or simple PCA)
  • Heartbeat and synchronization for multi-ESP32 setups

3. No temporal convolution module. The existing signal processing pipeline uses frame-by-frame processing. WiFlow and MultiFormer both show that temporal context (20 frames for WiFlow, 64 frames for MultiFormer) significantly improves accuracy. We need a ring buffer + TCN module in the inference path.

4. No bone/skeleton constraint enforcement at inference time. The pose_tracker.rs has Kalman filtering and skeleton constraints, but these are post-hoc corrections. WiFlow shows that baking bone constraints into the loss function during training produces better models that need less post-processing.

5. References

  1. DensePose From WiFi, Geng et al., arXiv:2301.00250, 2023
  2. Person-in-WiFi 3D, Yan et al., CVPR 2024
  3. WiFlow, arXiv:2602.08661, 2026
  4. MultiFormer, arXiv:2505.22555, 2025
  5. CSI-Channel Spatial Decomposition, MDPI Electronics 14(4), 2025
  6. CSI-Former, MDPI Entropy 25(1), 2023
  7. Spatio-Temporal 3D Point Clouds from WiFi-CSI, arXiv:2410.16303, 2024
  8. Graph-based 3D Human Pose from WiFi, arXiv:2511.19105, 2025
  9. EfficientFi, arXiv:2204.04138, 2022
  10. CSI-Sense-Zero, github.com/winwinashwin/CSI-Sense-Zero
  11. OnnxStream, github.com/vitoplantamura/OnnxStream
  12. Arena Physica, arenaphysica.com (Atlas RF Studio, Heaviside-0/Marconi-0)
  13. Tools and Methods for WiFi Sensing in Embedded Devices, MDPI Sensors 25(19), 2025
  14. Real-Time HAR using WiFi CSI and LSTM on Edge Devices, SASI-ITE 2025