Back to Ruview

ADR-079: Camera Ground-Truth Training Pipeline

docs/adr/ADR-079-camera-ground-truth-training.md

0.7.020.6 KB
Original Source

ADR-079: Camera Ground-Truth Training Pipeline

  • Status: Accepted
  • Date: 2026-04-06
  • Deciders: ruv
  • Relates to: ADR-072 (WiFlow Architecture), ADR-070 (Self-Supervised Pretraining), ADR-071 (ruvllm Training Pipeline), ADR-024 (AETHER Contrastive), ADR-064 (Multimodal Ambient Intelligence), ADR-075 (MinCut Person Separation)

Context

WiFlow (ADR-072) currently trains without ground-truth pose labels, using proxy poses generated from presence/motion heuristics. This produces a PCK@20 of only 2.5% — far below the 30-50% achievable with supervised training. The fundamental bottleneck is the absence of spatial keypoint labels.

Academic WiFi pose estimation systems (Wi-Pose, Person-in-WiFi 3D, MetaFi++) all train with synchronized camera ground truth and achieve PCK@20 of 40-85%. They discard the camera at deployment — the camera is a training-time teacher, not a runtime dependency.

ADR-064 already identified this: "Record CSI + mmWave while performing signs with a camera as ground truth, then deploy camera-free." This ADR specifies the implementation.

Current Training Pipeline Gap

Current:  CSI amplitude → WiFlow → 17 keypoints (proxy-supervised, PCK@20 = 2.5%)
                                    ↑
                            Heuristic proxies:
                            - Standing skeleton when presence > 0.3
                            - Limb perturbation from motion energy
                            - No spatial accuracy

Target Pipeline

Training: CSI amplitude ──→ WiFlow ──→ 17 keypoints (camera-supervised, PCK@20 target: 35%+)
                                        ↑
          Laptop camera ──→ MediaPipe ──→ 17 COCO keypoints (ground truth)
                                        (time-synchronized, 30 fps)

Deploy:   CSI amplitude ──→ WiFlow ──→ 17 keypoints (camera-free, trained model only)

Decision

Build a camera ground-truth collection and training pipeline using the laptop webcam as a teacher signal. The camera is used only during training data collection and is not required at deployment.

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    Data Collection Phase                         │
│                                                                 │
│  ESP32-S3 nodes ──UDP──→ Sensing Server ──→ CSI frames (.jsonl) │
│                              ↑ time sync                        │
│  Laptop Camera ──→ MediaPipe Pose ──→ Keypoints (.jsonl)        │
│                              ↑                                  │
│                     collect-ground-truth.py                      │
│                     (single orchestrator)                        │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    Training Phase                                │
│                                                                 │
│  Paired dataset: { csi_window[128,20], keypoints[17,2], conf }  │
│         ↓                                                       │
│  train-wiflow-supervised.js                                     │
│    Phase 1: Contrastive pretrain (ADR-072, reuse)               │
│    Phase 2: Supervised keypoint regression (NEW)                │
│    Phase 3: Fine-tune with bone constraints + confidence        │
│         ↓                                                       │
│  WiFlow model (1.8M params) → SafeTensors export                │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    Deployment (camera-free)                      │
│                                                                 │
│  ESP32-S3 CSI → Sensing Server → WiFlow inference → 17 keypoints│
│  (No camera. Trained model runs on CSI input only.)             │
└─────────────────────────────────────────────────────────────────┘

Component 1: scripts/collect-ground-truth.py

Single Python script that orchestrates synchronized capture from the laptop camera and the ESP32 CSI stream.

Dependencies: mediapipe, opencv-python, requests (all pip-installable, no GPU)

Capture flow:

python
# Pseudocode
camera = cv2.VideoCapture(0)           # Laptop webcam
sensing_api = "http://localhost:3000"   # Sensing server

# Start CSI recording via existing API
requests.post(f"{sensing_api}/api/v1/recording/start")

while recording:
    frame = camera.read()
    t = time.time_ns()                  # Nanosecond timestamp

    # MediaPipe Pose: 33 landmarks → map to 17 COCO keypoints
    result = mp_pose.process(frame)
    keypoints_17 = map_mediapipe_to_coco(result.pose_landmarks)
    confidence = mean(landmark.visibility for relevant landmarks)

    # Write to ground-truth JSONL (one line per frame)
    write_jsonl({
        "ts_ns": t,
        "keypoints": keypoints_17,      # [[x,y], ...] normalized [0,1]
        "confidence": confidence,        # 0-1, used for loss weighting
        "n_visible": count(visibility > 0.5),
    })

    # Optional: show live preview with skeleton overlay
    if preview:
        draw_skeleton(frame, keypoints_17)
        cv2.imshow("Ground Truth", frame)

# Stop CSI recording
requests.post(f"{sensing_api}/api/v1/recording/stop")

MediaPipe → COCO keypoint mapping:

COCO IndexJointMediaPipe Index
0Nose0
1Left Eye2
2Right Eye5
3Left Ear7
4Right Ear8
5Left Shoulder11
6Right Shoulder12
7Left Elbow13
8Right Elbow14
9Left Wrist15
10Right Wrist16
11Left Hip23
12Right Hip24
13Left Knee25
14Right Knee26
15Left Ankle27
16Right Ankle28

Component 2: Time Alignment (scripts/align-ground-truth.js)

CSI frames arrive at ~100 Hz with server-side timestamps. Camera keypoints arrive at ~30 fps with client-side timestamps. Alignment is needed because:

  1. Camera and sensing server clocks differ (typically < 50ms on LAN)
  2. CSI is aggregated into 20-frame windows for WiFlow input
  3. Ground-truth keypoints must be averaged over the same window

Alignment algorithm:

For each CSI window W_i (20 frames, ~200ms at 100Hz):
  t_start = W_i.first_frame.timestamp
  t_end   = W_i.last_frame.timestamp

  # Find all camera keypoints within this time window
  matching_keypoints = [k for k in camera_data if t_start <= k.ts <= t_end]

  if len(matching_keypoints) >= 3:   # At least 3 camera frames per window
    # Average keypoints, weighted by confidence
    avg_keypoints = weighted_mean(matching_keypoints, weights=confidences)
    avg_confidence = mean(confidences)

    paired_dataset.append({
      csi_window: W_i.amplitudes,    # [128, 20] float32
      keypoints: avg_keypoints,       # [17, 2] float32
      confidence: avg_confidence,     # scalar
      n_camera_frames: len(matching_keypoints),
    })

Clock sync strategy:

  • NTP is sufficient (< 20ms error on LAN)
  • The 200ms CSI window is 10x larger than typical clock drift
  • For tighter sync: use a handclap/jump as a sync marker — visible spike in both CSI motion energy and camera skeleton velocity. Auto-detect and align.

Output: data/recordings/paired-{timestamp}.jsonl — one line per paired sample:

json
{"csi": [128x20 flat], "kp": [[0.45,0.12], ...], "conf": 0.92, "ts": 1775300000000}

Component 3: Supervised Training (scripts/train-wiflow-supervised.js)

Extends the existing train-ruvllm.js pipeline with a supervised phase.

Phase 1: Contrastive Pretrain (reuse ADR-072)

  • Same as existing: temporal + cross-node triplets
  • Learns CSI representation without labels
  • 50 epochs, ~5 min on laptop

Phase 2: Supervised Keypoint Regression (NEW)

  • Load paired dataset from Component 2
  • Loss: confidence-weighted SmoothL1 on keypoints
L_supervised = (1/N) * sum_i [ conf_i * SmoothL1(pred_i, gt_i, beta=0.05) ]
  • Only train on samples where conf > 0.5 (discard frames where MediaPipe lost tracking)
  • Learning rate: 1e-4 with cosine decay
  • 200 epochs, ~15 min on laptop CPU (1.8M params, no GPU needed)

Phase 3: Refinement with Bone Constraints

  • Fine-tune with combined loss:
L = L_supervised + 0.3 * L_bone + 0.1 * L_temporal

L_bone     = (1/14) * sum_b (bone_len_b - prior_b)^2   # ADR-072 bone priors
L_temporal = SmoothL1(kp_t, kp_{t-1})                   # Temporal smoothness
  • 50 epochs at lower LR (1e-5)
  • Tighten bone constraint weight from 0.3 → 0.5 over epochs

Phase 4: Quantization + Export

  • Reuse ruvllm TurboQuant: float32 → int8 (4x smaller, ~881 KB)
  • Export via SafeTensors for cross-platform deployment
  • Validate quantized model PCK@20 within 2% of full-precision

Component 4: Evaluation Script (scripts/eval-wiflow.js)

Measure actual PCK@20 using held-out paired data (20% split).

PCK@k = (1/N) * sum_i [ (||pred_i - gt_i|| < k * torso_length) ? 1 : 0 ]

Metrics reported:

MetricDescriptionTarget
PCK@20% of keypoints within 20% torso length> 35%
PCK@50% within 50% torso length> 60%
MPJPEMean per-joint position error (pixels)< 40px
Per-joint PCKBreakdown by joint (wrists are hardest)Report all 17
Inference latencySingle window prediction time< 50ms

Optimization Strategy

O1: Curriculum Learning

Train easy poses first, hard poses later:

StageEpochsData FilterRationale
150conf > 0.9, standing onlyEstablish stable skeleton baseline
250conf > 0.7, low motionAdd sitting, subtle movements
350conf > 0.5, all posesFull dataset including occlusions
450All data, with augmentationRobustness via noise injection

O2: Data Augmentation (CSI domain)

Augment CSI windows to increase effective dataset size without collecting more data:

AugmentationImplementationExpected Gain
Time shiftRoll CSI window by ±2 frames+30% data
Amplitude noiseGaussian noise, sigma=0.02Robustness
Subcarrier dropoutZero 10% of subcarriers randomlyRobustness
Temporal flipReverse window + reverse keypoint velocity+100% data
Multi-node mixSwap node CSI, keep same-time keypointsCross-node generalization

O3: Knowledge Distillation from MediaPipe

Instead of raw keypoint regression, distill MediaPipe's confidence and heatmap information:

L_distill = KL_div(softmax(wifi_heatmap / T), softmax(camera_heatmap / T))
  • Temperature T=4 for soft targets (transfers inter-joint relationships)
  • WiFlow predicts a 17-channel heatmap [17, H, W] instead of direct [17, 2]
  • Argmax for final keypoint extraction
  • Trade-off: Adds ~200K params for heatmap decoder, but improves spatial precision

O4: Active Learning Loop

Identify which poses the model is worst at and collect more data for those:

1. Train initial model on first collection session
2. Run inference on new CSI data, compute prediction entropy
3. Flag high-entropy windows (model is uncertain)
4. During next collection, the preview overlay highlights these moments:
   "Hold this pose — model needs more examples"
5. Re-train with augmented dataset

Expected: 2-3 active learning iterations reach saturation.

O6: Subcarrier Selection (ruvector-solver)

Variance-based top-K subcarrier selection, equivalent to ruvector-solver's sparse interpolation (114→56). Removes noise/static subcarriers before training:

For each subcarrier d in [0, dim):
  variance[d] = mean over samples of temporal_variance(csi[d, :])
Select top-K by variance (K = dim * 0.5)

Validated: 128 → 56 subcarriers (56% input reduction), proportional model size reduction.

O7: Attention-Weighted Subcarriers (ruvector-attention)

Compute per-subcarrier attention weights based on temporal energy correlation with ground-truth keypoint motion. High-energy subcarriers that covary with skeleton movement get amplified:

For each subcarrier d:
  energy[d] = sum of squared first-differences over time
  weight[d] = softmax(energy, temperature=0.1)
Apply: csi[d, :] *= weight[d] * dim  (mean weight = 1)

Validated: Top-5 attention subcarriers identified automatically per dataset.

O8: Stoer-Wagner MinCut Person Separation (ruvector-mincut / ADR-075)

JS implementation of the Stoer-Wagner algorithm for person separation in CSI, equivalent to DynamicPersonMatcher in wifi-densepose-train/src/metrics.rs. Builds a subcarrier correlation graph and finds the minimum cut to identify person-specific subcarrier clusters:

1. Build dim×dim Pearson correlation matrix across subcarriers
2. Run Stoer-Wagner min-cut on correlation graph
3. Partition subcarriers into person-specific groups
4. Train per-partition models for multi-person scenarios

Validated: Stoer-Wagner executes on 56-dim graph, identifies partition boundaries.

O9: Multi-SPSA Gradient Estimation

Average over K=3 random perturbation directions per gradient step. Reduces variance by sqrt(K) = 1.73x compared to single SPSA, at 3x forward pass cost (net win for convergence quality):

For k in 1..K:
  delta_k = random ±1 per parameter
  grad_k = (loss(w + eps*delta_k) - loss(w - eps*delta_k)) / (2*eps*delta_k)
grad = mean(grad_1, ..., grad_K)

O10: Mac M4 Pro Training via Tailscale

Training runs on Mac Mini M4 Pro (16-core GPU, ARM NEON SIMD) via Tailscale SSH, using ruvllm's native Node.js SIMD ops:

Windows (CPU)Mac M4 Pro
Node.jsv24.12.0 (x86)v25.9.0 (ARM)
SIMDSSE4/AVX2NEON
CoresConsumer laptop12P + 4E cores
TrainingSlow (minutes/epoch)Fast (seconds/epoch)

O5: Cross-Environment Transfer

Train on one room, deploy in another:

StrategyImplementation
Room-invariant featuresNormalize CSI by running mean/variance
LoRA adaptersTrain a 4-rank LoRA per room (ADR-071) — 7.3 KB each
Few-shot calibration2 min of camera data in new room → fine-tune LoRA only
AETHER embeddingsUse contrastive room-independent features (ADR-024) as input

The LoRA approach is most practical: ship a base model + collect 2 min of calibration data per new room using the laptop camera.

Data Collection Protocol

Recommended collection sessions per room:

SessionDurationActivityPeopleTotal CSI Frames
1. Baseline5 minEmpty + 1 person entry/exit0-130,000
2. Standing poses5 minStand, arms up/down/sides, turn130,000
3. Sitting5 minSit, type, lean, stand up/sit down130,000
4. Walking5 minWalk paths across room130,000
5. Mixed5 minVaried activities, transitions130,000
6. Multi-person5 min2 people, varied activities230,000
Total30 min180,000

At 20-frame windows: 9,000 paired training samples per 30-min session. With augmentation (O2): ~27,000 effective samples.

Camera placement: position laptop so the camera has a clear view of the sensing area. The camera FOV should cover the same space the ESP32 nodes cover.

File Structure

scripts/
  collect-ground-truth.py     # Camera capture + MediaPipe + CSI sync
  align-ground-truth.js       # Time-align CSI windows with camera keypoints
  train-wiflow-supervised.js  # Supervised training pipeline
  eval-wiflow.js              # PCK evaluation on held-out data

data/
  ground-truth/               # Raw camera keypoint captures
    gt-{timestamp}.jsonl
  paired/                     # Aligned CSI + keypoint pairs
    paired-{timestamp}.jsonl

models/
  wiflow-supervised/          # Trained model outputs
    wiflow-v1.safetensors
    wiflow-v1-int8.safetensors
    training-log.json
    eval-report.json

Privacy Considerations

  • Camera frames are processed locally by MediaPipe — no cloud upload
  • Raw video is never saved — only extracted keypoint coordinates are stored
  • The .jsonl ground-truth files contain only [x,y] joint coordinates, not images
  • The trained model runs on CSI only — no camera data leaves the laptop
  • Users can delete data/ground-truth/ after training; the model is self-contained

Consequences

Positive

  • 10-20x accuracy improvement: PCK@20 from 2.5% → 35%+ with real supervision
  • Reuses existing infrastructure: sensing server recording API, ruvllm training, SafeTensors
  • No new hardware: laptop webcam + existing ESP32 nodes
  • Privacy preserved at deployment: camera only needed during 30-min training session
  • Incremental: can improve with more collection sessions + active learning
  • Distributable: trained model weights can be shared on HuggingFace (ADR-070)

Negative

  • Camera placement matters: must see the same area ESP32 nodes sense
  • Single-room models: need LoRA calibration per room (2 min + camera)
  • MediaPipe limitations: occlusion, side views, multiple people reduce keypoint quality
  • Time sync: NTP drift can misalign frames (mitigated by 200ms windows)

Risks

RiskProbabilityImpactMitigation
MediaPipe keypoints too noisyLowMediumFilter by confidence; MediaPipe is robust indoors
Clock drift > 100msLowHighAdd handclap sync marker detection
Single camera can't see all posesMediumMediumPosition camera centrally; collect from 2 angles
Model overfits to one roomHighMediumLoRA adapters + AETHER normalization (O5)
Insufficient data (< 5K pairs)LowHighAugmentation (O2) + active learning (O4)

Implementation Plan

PhaseTaskEffortStatus
P1collect-ground-truth.py — camera + MediaPipe capture2 hrsDone
P2align-ground-truth.js — time alignment + pairing1 hrDone
P3train-wiflow-supervised.js — supervised training3 hrsDone
P4eval-wiflow.js — PCK evaluation1 hrDone
P5ruvector optimizations (O6-O9)2 hrsDone
P6Mac M4 Pro training via Tailscale (O10)1 hrDone
P7Data collection session (30 min recording)1 hrPending
P8Training + evaluation on real paired data30 minPending
P9LoRA cross-room calibration (O5)2 hrsPending

Validated Hardware

ComponentSpecValidated
Mac Mini camera1920x1080, 30fpsYes — 14/17 keypoints, conf 0.94-1.0
MediaPipe PoseLandmarkerv0.10.33 Tasks API, lite modelYes — via Tailscale SSH
Mac M4 Pro GPU16-core, Metal 4, NEON SIMDYes — Node.js v25.9.0
Tailscale SSHLAN-accessible Mac, passwordlessYes
ESP32-S3 CSI128 subcarriers, 100HzYes — existing recordings
Sensing server recording API/api/v1/recording/start|stopYes — existing

Baseline Benchmark

Proxy-pose baseline (no camera supervision, standing skeleton heuristic):

PCK@10:  11.8%
PCK@20:  35.3%
PCK@50:  94.1%
MPJPE:   0.067
Latency: 0.03ms/sample

Per-joint PCK@20: upper body (nose, shoulders, wrists) at 0% — proxy has no spatial accuracy for these. Camera supervision targets these joints specifically.

References

  • WiFlow: arXiv:2602.08661 — WiFi-based pose estimation with TCN + axial attention
  • Wi-Pose (CVPR 2021) — 3D CNN WiFi pose with camera supervision
  • Person-in-WiFi 3D (CVPR 2024) — Deformable attention with camera labels
  • MediaPipe Pose — Google's real-time 33-landmark body pose estimator
  • MetaFi++ (NeurIPS 2023) — Meta-learning cross-modal WiFi sensing