ADR-079: Camera Ground-Truth Training Pipeline

Status: Accepted
Date: 2026-04-06
Deciders: ruv
Relates to: ADR-072 (WiFlow Architecture), ADR-070 (Self-Supervised Pretraining), ADR-071 (ruvllm Training Pipeline), ADR-024 (AETHER Contrastive), ADR-064 (Multimodal Ambient Intelligence), ADR-075 (MinCut Person Separation)

Context

WiFlow (ADR-072) currently trains without ground-truth pose labels, using proxy poses generated from presence/motion heuristics. This produces a PCK@20 of only 2.5% — far below the 30-50% achievable with supervised training. The fundamental bottleneck is the absence of spatial keypoint labels.

Academic WiFi pose estimation systems (Wi-Pose, Person-in-WiFi 3D, MetaFi++) all train with synchronized camera ground truth and achieve PCK@20 of 40-85%. They discard the camera at deployment — the camera is a training-time teacher, not a runtime dependency.

ADR-064 already identified this: "Record CSI + mmWave while performing signs with a camera as ground truth, then deploy camera-free." This ADR specifies the implementation.

Current Training Pipeline Gap

Current:  CSI amplitude → WiFlow → 17 keypoints (proxy-supervised, PCK@20 = 2.5%)
                                    ↑
                            Heuristic proxies:
                            - Standing skeleton when presence > 0.3
                            - Limb perturbation from motion energy
                            - No spatial accuracy

Target Pipeline

Training: CSI amplitude ──→ WiFlow ──→ 17 keypoints (camera-supervised, PCK@20 target: 35%+)
                                        ↑
          Laptop camera ──→ MediaPipe ──→ 17 COCO keypoints (ground truth)
                                        (time-synchronized, 30 fps)

Deploy:   CSI amplitude ──→ WiFlow ──→ 17 keypoints (camera-free, trained model only)

Decision

Build a camera ground-truth collection and training pipeline using the laptop webcam as a teacher signal. The camera is used only during training data collection and is not required at deployment.

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    Data Collection Phase                         │
│                                                                 │
│  ESP32-S3 nodes ──UDP──→ Sensing Server ──→ CSI frames (.jsonl) │
│                              ↑ time sync                        │
│  Laptop Camera ──→ MediaPipe Pose ──→ Keypoints (.jsonl)        │
│                              ↑                                  │
│                     collect-ground-truth.py                      │
│                     (single orchestrator)                        │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    Training Phase                                │
│                                                                 │
│  Paired dataset: { csi_window[128,20], keypoints[17,2], conf }  │
│         ↓                                                       │
│  train-wiflow-supervised.js                                     │
│    Phase 1: Contrastive pretrain (ADR-072, reuse)               │
│    Phase 2: Supervised keypoint regression (NEW)                │
│    Phase 3: Fine-tune with bone constraints + confidence        │
│         ↓                                                       │
│  WiFlow model (1.8M params) → SafeTensors export                │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    Deployment (camera-free)                      │
│                                                                 │
│  ESP32-S3 CSI → Sensing Server → WiFlow inference → 17 keypoints│
│  (No camera. Trained model runs on CSI input only.)             │
└─────────────────────────────────────────────────────────────────┘

Component 1: `scripts/collect-ground-truth.py`

Single Python script that orchestrates synchronized capture from the laptop camera and the ESP32 CSI stream.

Dependencies: mediapipe, opencv-python, requests (all pip-installable, no GPU)

Capture flow:

python

# Pseudocode
camera = cv2.VideoCapture(0)           # Laptop webcam
sensing_api = "http://localhost:3000"   # Sensing server

# Start CSI recording via existing API
requests.post(f"{sensing_api}/api/v1/recording/start")

while recording:
    frame = camera.read()
    t = time.time_ns()                  # Nanosecond timestamp

    # MediaPipe Pose: 33 landmarks → map to 17 COCO keypoints
    result = mp_pose.process(frame)
    keypoints_17 = map_mediapipe_to_coco(result.pose_landmarks)
    confidence = mean(landmark.visibility for relevant landmarks)

    # Write to ground-truth JSONL (one line per frame)
    write_jsonl({
        "ts_ns": t,
        "keypoints": keypoints_17,      # [[x,y], ...] normalized [0,1]
        "confidence": confidence,        # 0-1, used for loss weighting
        "n_visible": count(visibility > 0.5),
    })

    # Optional: show live preview with skeleton overlay
    if preview:
        draw_skeleton(frame, keypoints_17)
        cv2.imshow("Ground Truth", frame)

# Stop CSI recording
requests.post(f"{sensing_api}/api/v1/recording/stop")

MediaPipe → COCO keypoint mapping:

COCO Index	Joint	MediaPipe Index
0	Nose	0
1	Left Eye	2
2	Right Eye	5
3	Left Ear	7
4	Right Ear	8
5	Left Shoulder	11
6	Right Shoulder	12
7	Left Elbow	13
8	Right Elbow	14
9	Left Wrist	15
10	Right Wrist	16
11	Left Hip	23
12	Right Hip	24
13	Left Knee	25
14	Right Knee	26
15	Left Ankle	27
16	Right Ankle	28

Component 2: Time Alignment (`scripts/align-ground-truth.js`)

CSI frames arrive at ~100 Hz with server-side timestamps. Camera keypoints arrive at ~30 fps with client-side timestamps. Alignment is needed because:

Camera and sensing server clocks differ (typically < 50ms on LAN)
CSI is aggregated into 20-frame windows for WiFlow input
Ground-truth keypoints must be averaged over the same window

Alignment algorithm:

For each CSI window W_i (20 frames, ~200ms at 100Hz):
  t_start = W_i.first_frame.timestamp
  t_end   = W_i.last_frame.timestamp

  # Find all camera keypoints within this time window
  matching_keypoints = [k for k in camera_data if t_start <= k.ts <= t_end]

  if len(matching_keypoints) >= 3:   # At least 3 camera frames per window
    # Average keypoints, weighted by confidence
    avg_keypoints = weighted_mean(matching_keypoints, weights=confidences)
    avg_confidence = mean(confidences)

    paired_dataset.append({
      csi_window: W_i.amplitudes,    # [128, 20] float32
      keypoints: avg_keypoints,       # [17, 2] float32
      confidence: avg_confidence,     # scalar
      n_camera_frames: len(matching_keypoints),
    })

Clock sync strategy:

NTP is sufficient (< 20ms error on LAN)
The 200ms CSI window is 10x larger than typical clock drift
For tighter sync: use a handclap/jump as a sync marker — visible spike in both CSI motion energy and camera skeleton velocity. Auto-detect and align.

Output: data/recordings/paired-{timestamp}.jsonl — one line per paired sample:

json

{"csi": [128x20 flat], "kp": [[0.45,0.12], ...], "conf": 0.92, "ts": 1775300000000}

Component 3: Supervised Training (`scripts/train-wiflow-supervised.js`)

Extends the existing train-ruvllm.js pipeline with a supervised phase.

Phase 1: Contrastive Pretrain (reuse ADR-072)

Same as existing: temporal + cross-node triplets
Learns CSI representation without labels
50 epochs, ~5 min on laptop

Phase 2: Supervised Keypoint Regression (NEW)

Load paired dataset from Component 2
Loss: confidence-weighted SmoothL1 on keypoints

L_supervised = (1/N) * sum_i [ conf_i * SmoothL1(pred_i, gt_i, beta=0.05) ]

Only train on samples where conf > 0.5 (discard frames where MediaPipe lost tracking)
Learning rate: 1e-4 with cosine decay
200 epochs, ~15 min on laptop CPU (1.8M params, no GPU needed)

Phase 3: Refinement with Bone Constraints

Fine-tune with combined loss:

L = L_supervised + 0.3 * L_bone + 0.1 * L_temporal

L_bone     = (1/14) * sum_b (bone_len_b - prior_b)^2   # ADR-072 bone priors
L_temporal = SmoothL1(kp_t, kp_{t-1})                   # Temporal smoothness

50 epochs at lower LR (1e-5)
Tighten bone constraint weight from 0.3 → 0.5 over epochs

Phase 4: Quantization + Export

Reuse ruvllm TurboQuant: float32 → int8 (4x smaller, ~881 KB)
Export via SafeTensors for cross-platform deployment
Validate quantized model PCK@20 within 2% of full-precision

Component 4: Evaluation Script (`scripts/eval-wiflow.js`)

Measure actual PCK@20 using held-out paired data (20% split).

PCK@k = (1/N) * sum_i [ (||pred_i - gt_i|| < k * torso_length) ? 1 : 0 ]

Metrics reported:

Metric	Description	Target
PCK@20	% of keypoints within 20% torso length	> 35%
PCK@50	% within 50% torso length	> 60%
MPJPE	Mean per-joint position error (pixels)	< 40px
Per-joint PCK	Breakdown by joint (wrists are hardest)	Report all 17
Inference latency	Single window prediction time	< 50ms

Optimization Strategy

O1: Curriculum Learning

Train easy poses first, hard poses later:

Stage	Epochs	Data Filter	Rationale
1	50	`conf > 0.9`, standing only	Establish stable skeleton baseline
2	50	`conf > 0.7`, low motion	Add sitting, subtle movements
3	50	`conf > 0.5`, all poses	Full dataset including occlusions
4	50	All data, with augmentation	Robustness via noise injection

O2: Data Augmentation (CSI domain)

Augment CSI windows to increase effective dataset size without collecting more data:

Augmentation	Implementation	Expected Gain
Time shift	Roll CSI window by ±2 frames	+30% data
Amplitude noise	Gaussian noise, sigma=0.02	Robustness
Subcarrier dropout	Zero 10% of subcarriers randomly	Robustness
Temporal flip	Reverse window + reverse keypoint velocity	+100% data
Multi-node mix	Swap node CSI, keep same-time keypoints	Cross-node generalization

O3: Knowledge Distillation from MediaPipe

Instead of raw keypoint regression, distill MediaPipe's confidence and heatmap information:

L_distill = KL_div(softmax(wifi_heatmap / T), softmax(camera_heatmap / T))

Temperature T=4 for soft targets (transfers inter-joint relationships)
WiFlow predicts a 17-channel heatmap [17, H, W] instead of direct [17, 2]
Argmax for final keypoint extraction
Trade-off: Adds ~200K params for heatmap decoder, but improves spatial precision

O4: Active Learning Loop

Identify which poses the model is worst at and collect more data for those:

1. Train initial model on first collection session
2. Run inference on new CSI data, compute prediction entropy
3. Flag high-entropy windows (model is uncertain)
4. During next collection, the preview overlay highlights these moments:
   "Hold this pose — model needs more examples"
5. Re-train with augmented dataset

Expected: 2-3 active learning iterations reach saturation.

O6: Subcarrier Selection (ruvector-solver)

Variance-based top-K subcarrier selection, equivalent to ruvector-solver's sparse interpolation (114→56). Removes noise/static subcarriers before training:

For each subcarrier d in [0, dim):
  variance[d] = mean over samples of temporal_variance(csi[d, :])
Select top-K by variance (K = dim * 0.5)

Validated: 128 → 56 subcarriers (56% input reduction), proportional model size reduction.

O7: Attention-Weighted Subcarriers (ruvector-attention)

Compute per-subcarrier attention weights based on temporal energy correlation with ground-truth keypoint motion. High-energy subcarriers that covary with skeleton movement get amplified:

For each subcarrier d:
  energy[d] = sum of squared first-differences over time
  weight[d] = softmax(energy, temperature=0.1)
Apply: csi[d, :] *= weight[d] * dim  (mean weight = 1)

Validated: Top-5 attention subcarriers identified automatically per dataset.

O8: Stoer-Wagner MinCut Person Separation (ruvector-mincut / ADR-075)

JS implementation of the Stoer-Wagner algorithm for person separation in CSI, equivalent to DynamicPersonMatcher in wifi-densepose-train/src/metrics.rs. Builds a subcarrier correlation graph and finds the minimum cut to identify person-specific subcarrier clusters:

1. Build dim×dim Pearson correlation matrix across subcarriers
2. Run Stoer-Wagner min-cut on correlation graph
3. Partition subcarriers into person-specific groups
4. Train per-partition models for multi-person scenarios

Validated: Stoer-Wagner executes on 56-dim graph, identifies partition boundaries.

O9: Multi-SPSA Gradient Estimation

Average over K=3 random perturbation directions per gradient step. Reduces variance by sqrt(K) = 1.73x compared to single SPSA, at 3x forward pass cost (net win for convergence quality):

For k in 1..K:
  delta_k = random ±1 per parameter
  grad_k = (loss(w + eps*delta_k) - loss(w - eps*delta_k)) / (2*eps*delta_k)
grad = mean(grad_1, ..., grad_K)

O10: Mac M4 Pro Training via Tailscale

Training runs on Mac Mini M4 Pro (16-core GPU, ARM NEON SIMD) via Tailscale SSH, using ruvllm's native Node.js SIMD ops:

	Windows (CPU)	Mac M4 Pro
Node.js	v24.12.0 (x86)	v25.9.0 (ARM)
SIMD	SSE4/AVX2	NEON
Cores	Consumer laptop	12P + 4E cores
Training	Slow (minutes/epoch)	Fast (seconds/epoch)

O5: Cross-Environment Transfer

Train on one room, deploy in another:

Strategy	Implementation
Room-invariant features	Normalize CSI by running mean/variance
LoRA adapters	Train a 4-rank LoRA per room (ADR-071) — 7.3 KB each
Few-shot calibration	2 min of camera data in new room → fine-tune LoRA only
AETHER embeddings	Use contrastive room-independent features (ADR-024) as input

The LoRA approach is most practical: ship a base model + collect 2 min of calibration data per new room using the laptop camera.

Data Collection Protocol

Recommended collection sessions per room:

Session	Duration	Activity	People	Total CSI Frames
1. Baseline	5 min	Empty + 1 person entry/exit	0-1	30,000
2. Standing poses	5 min	Stand, arms up/down/sides, turn	1	30,000
3. Sitting	5 min	Sit, type, lean, stand up/sit down	1	30,000
4. Walking	5 min	Walk paths across room	1	30,000
5. Mixed	5 min	Varied activities, transitions	1	30,000
6. Multi-person	5 min	2 people, varied activities	2	30,000
Total	30 min			180,000

At 20-frame windows: 9,000 paired training samples per 30-min session. With augmentation (O2): ~27,000 effective samples.

Camera placement: position laptop so the camera has a clear view of the sensing area. The camera FOV should cover the same space the ESP32 nodes cover.

File Structure

scripts/
  collect-ground-truth.py     # Camera capture + MediaPipe + CSI sync
  align-ground-truth.js       # Time-align CSI windows with camera keypoints
  train-wiflow-supervised.js  # Supervised training pipeline
  eval-wiflow.js              # PCK evaluation on held-out data

data/
  ground-truth/               # Raw camera keypoint captures
    gt-{timestamp}.jsonl
  paired/                     # Aligned CSI + keypoint pairs
    paired-{timestamp}.jsonl

models/
  wiflow-supervised/          # Trained model outputs
    wiflow-v1.safetensors
    wiflow-v1-int8.safetensors
    training-log.json
    eval-report.json

Privacy Considerations

Camera frames are processed locally by MediaPipe — no cloud upload
Raw video is never saved — only extracted keypoint coordinates are stored
The .jsonl ground-truth files contain only [x,y] joint coordinates, not images
The trained model runs on CSI only — no camera data leaves the laptop
Users can delete data/ground-truth/ after training; the model is self-contained

Consequences

Positive

10-20x accuracy improvement: PCK@20 from 2.5% → 35%+ with real supervision
Reuses existing infrastructure: sensing server recording API, ruvllm training, SafeTensors
No new hardware: laptop webcam + existing ESP32 nodes
Privacy preserved at deployment: camera only needed during 30-min training session
Incremental: can improve with more collection sessions + active learning
Distributable: trained model weights can be shared on HuggingFace (ADR-070)

Negative

Camera placement matters: must see the same area ESP32 nodes sense
Single-room models: need LoRA calibration per room (2 min + camera)
MediaPipe limitations: occlusion, side views, multiple people reduce keypoint quality
Time sync: NTP drift can misalign frames (mitigated by 200ms windows)

Risks

Risk	Probability	Impact	Mitigation
MediaPipe keypoints too noisy	Low	Medium	Filter by confidence; MediaPipe is robust indoors
Clock drift > 100ms	Low	High	Add handclap sync marker detection
Single camera can't see all poses	Medium	Medium	Position camera centrally; collect from 2 angles
Model overfits to one room	High	Medium	LoRA adapters + AETHER normalization (O5)
Insufficient data (< 5K pairs)	Low	High	Augmentation (O2) + active learning (O4)

Implementation Plan

Phase	Task	Effort	Status
P1	`collect-ground-truth.py` — camera + MediaPipe capture	2 hrs	Done
P2	`align-ground-truth.js` — time alignment + pairing	1 hr	Done
P3	`train-wiflow-supervised.js` — supervised training	3 hrs	Done
P4	`eval-wiflow.js` — PCK evaluation	1 hr	Done
P5	ruvector optimizations (O6-O9)	2 hrs	Done
P6	Mac M4 Pro training via Tailscale (O10)	1 hr	Done
P7	Data collection session (30 min recording)	1 hr	Pending
P8	Training + evaluation on real paired data	30 min	Pending
P9	LoRA cross-room calibration (O5)	2 hrs	Pending

Validated Hardware

Component	Spec	Validated
Mac Mini camera	1920x1080, 30fps	Yes — 14/17 keypoints, conf 0.94-1.0
MediaPipe PoseLandmarker	v0.10.33 Tasks API, lite model	Yes — via Tailscale SSH
Mac M4 Pro GPU	16-core, Metal 4, NEON SIMD	Yes — Node.js v25.9.0
Tailscale SSH	LAN-accessible Mac, passwordless	Yes
ESP32-S3 CSI	128 subcarriers, 100Hz	Yes — existing recordings
Sensing server recording API	`/api/v1/recording/start\|stop`	Yes — existing

Baseline Benchmark

Proxy-pose baseline (no camera supervision, standing skeleton heuristic):

PCK@10:  11.8%
PCK@20:  35.3%
PCK@50:  94.1%
MPJPE:   0.067
Latency: 0.03ms/sample

Per-joint PCK@20: upper body (nose, shoulders, wrists) at 0% — proxy has no spatial accuracy for these. Camera supervision targets these joints specifically.

References

WiFlow: arXiv:2602.08661 — WiFi-based pose estimation with TCN + axial attention
Wi-Pose (CVPR 2021) — 3D CNN WiFi pose with camera supervision
Person-in-WiFi 3D (CVPR 2024) — Deformable attention with camera labels
MediaPipe Pose — Google's real-time 33-landmark body pose estimator
MetaFi++ (NeurIPS 2023) — Meta-learning cross-modal WiFi sensing

ADR-079: Camera Ground-Truth Training Pipeline

ADR-079: Camera Ground-Truth Training Pipeline

Context

Current Training Pipeline Gap

Target Pipeline

Decision

Architecture Overview

Component 1: scripts/collect-ground-truth.py

Component 2: Time Alignment (scripts/align-ground-truth.js)

Component 3: Supervised Training (scripts/train-wiflow-supervised.js)

Component 4: Evaluation Script (scripts/eval-wiflow.js)

Optimization Strategy

O1: Curriculum Learning

O2: Data Augmentation (CSI domain)

O3: Knowledge Distillation from MediaPipe

O4: Active Learning Loop

O6: Subcarrier Selection (ruvector-solver)

O7: Attention-Weighted Subcarriers (ruvector-attention)

O8: Stoer-Wagner MinCut Person Separation (ruvector-mincut / ADR-075)

O9: Multi-SPSA Gradient Estimation

O10: Mac M4 Pro Training via Tailscale

O5: Cross-Environment Transfer

Data Collection Protocol

File Structure

Privacy Considerations

Consequences

Positive

Negative

Risks

Implementation Plan

Validated Hardware

Baseline Benchmark

References

Component 1: `scripts/collect-ground-truth.py`

Component 2: Time Alignment (`scripts/align-ground-truth.js`)

Component 3: Supervised Training (`scripts/train-wiflow-supervised.js`)

Component 4: Evaluation Script (`scripts/eval-wiflow.js`)