WARP-RM
A Warp-Augmented Relative Progress Reward Model for Data Curation

Anonymous Author(s)

TL;DR: WARP learns a self-supervised, dense, signed relative progress signal from raw demonstrations by training on time-warped playback. Reweighted behavior cloning with WARP-RM produces policies that fold T-shirts up to 18× faster in throughput compared to vanilla BC trained on the same data.
WARP-RM Progress Velocity Signal
\(\hat{v}_{t}\)
t = 0.0 s  /   s
\(\hat{v}_{t}\)   +0.00

Episodes A and B play real teleoperated episodes while the curve shows the dense per-frame predicted signed progress magnitude \(\hat{v}_{t}\) — positive for forward task progress and negative for regression. Click on the curve to scrub through the video.

Data

Our policy-training data is drawn from a single corpus of around 125 hours of successful, unannotated human-teleoperated T-shirt-folding demonstrations. On this task, episode length is a coarse proxy for execution efficiency: longer episodes tend to contain more hesitations, retries, and recoveries. To evaluate robustness as progressively more inefficient behavior is admitted into training, we define three nested, length-filtered tiers: \(\mathcal{D}_{1}\) (≤ 60s): 1,975 episodes (29.1 hours), \(\mathcal{D}_{2}\) (≤ 90s): 3,546 episodes (61.5 hours), and \(\mathcal{D}_{3}\) (≤ 120s): 5,718 episodes (124.7 hours). Policies are trained on the same underlying demonstrations, either with uniform weighting (vanilla BC) or WARP-based progress reweighting. A single WARP-RM model is used across all tiers: it is trained once on a fixed reference subset \(\mathcal{D}_{\mathrm{RM}}\) — the shortest demonstrations (≤ 59s, 1,807 episodes) — providing a clean reference signal for the canonical execution pace (\(\hat{v} = 1\)). For baseline comparisons, SARM requires human annotations, so an annotated supplement \(\mathcal{D}_{A}\) (929 expert demonstrations, 15.8 hours) is added, forming the augmented datasets \(\mathcal{D}_{4} = \mathcal{D}_{1} \cup \mathcal{D}_{A}\) and \(\mathcal{D}_{5} = \mathcal{D}_{2} \cup \mathcal{D}_{A}\); WARP-RM and all other baselines treat \(\mathcal{D}_{A}\) as unannotated.

Episode-length distribution of the demonstration corpus. The main unannotated corpus (\(\mathcal{D}_{1}\)–\(\mathcal{D}_{3}\), blue; 5,718 episodes), with the annotated supplement \(\mathcal{D}_{A}\) (orange; 929 episodes) stacked on top. Dashed lines mark the nested tier cutoffs \(\mathcal{D}_{1}\) (≤ 60s), \(\mathcal{D}_{2}\) (≤ 90s), and \(\mathcal{D}_{3}\) (≤ 120s).

Real-World Policy Rollouts

Across 380 real-world trials of T-shirt folding from a crumpled start, WARP-BC consistently completes more folds and completes them faster than vanilla BC and other baselines trained on the same demonstration corpus. Each video below shows all 20 evaluation trials for a tier, played simultaneously in a 4×5 grid. Use the selector to switch between models trained on each of the three demonstration tiers from the paper.

Vanilla BC
20/20 successes, 113.8s mean time-to-completion
WARP-BC
20/20 successes, 63.9s mean time-to-completion
\(\mathcal{D}_{1}\): model trained on demonstrations ≤ 60s — the cleanest, fastest tier.

Videos played at 1× speed.

Time-to-completion distribution for successful trials across D1, D2, D3
Time-to-completion distribution for successes. Across all three training tiers, WARP-BC completes folds faster than vanilla BC. As the training corpus admits more suboptimal demonstrations (\(\mathcal{D}_{1}\rightarrow\mathcal{D}_{3}\)), vanilla BC's success count collapses while WARP-BC stays robust. Solid horizontal bar marks the mean.

Method

Prior progress models are trained to predict how far along a trajectory is — the fraction of the demonstration that has elapsed so far. This is a noisy learning target, because the same elapsed fraction in two different demonstrations can correspond to very different amounts of actual task progress. WARP-RM instead learns a relative signal — how fast, and in which direction, the task is advancing — using a self-supervised target obtained by replaying successful demonstrations at non-uniform velocities.

1. Time-warp playback → self-supervised progress labels

Resample a successful trajectory with smoothly varying playback speeds (AR(1) in log-space) and Poisson-sampled reversals. The signed source-frame displacement from the window's first frame is the per-frame progress label — no human annotation required.

2. Predict per-frame velocity from images

A frozen DINOv3 ViT-B/16 + 12-layer bidirectional transformer head outputs a per-frame categorical distribution over cumulative progress. Its temporal derivative gives the velocity \(\hat{v}_{t}\):

\(\hat{v}\) ≈ 1 → expert pace, \(\hat{v}\) ≈ 0 → stagnating, \(\hat{v}\) < 0 → regressing

WARP-RM architecture: frozen DINOv3 + temporal-diff + transformer + 30-bin categorical head
WARP Reward Model. A window of \(N{=}32\) RGB frames is encoded by frozen DINOv3, augmented with per-frame temporal differences, projected to model dimension, and processed by a bidirectional transformer. The head emits a 30-bin categorical distribution over cumulative progress at each input frame; per-step intra-window velocities \(v_j = (N{-}1)(\hat{y}_j - \hat{y}_{j-1})\) are averaged across overlapping sliding windows to give the dense curve on the left.
3. WARP-BC: reweight action chunks by terminal velocity

For each training chunk, gate on the predicted velocity at its terminal frame. With \(\tau = 1.0\), only chunks ending in faster-than-expert progress are kept, and each is weighted continuously by its velocity:

\( w \;=\; \hat{v}_{\mathrm{end}} \cdot \mathbf{1}\{\,\hat{v}_{\mathrm{end}} > \tau\,\} \)

Quantitative Results

Cross-tier results on T-shirt folding

All policies are evaluated on 20 trials of T-shirt folding from a crumpled start with a 240s timeout. Mean time-to-completion (TTC) is reported over successful trials only. As the training pool admits more suboptimal demonstrations, vanilla BC degrades sharply while WARP-BC stays robust.

Method Metric \(\mathcal{D}_{1}\) (≤60s) \(\mathcal{D}_{2}\) (≤90s) \(\mathcal{D}_{3}\) (≤120s)
Vanilla BC Success ↑ 20/20 2/20 0/20
Mean TTC (s) ↓ 113.8 199.0 N/A
Throughput (/hr) ↑ 31.6 1.5 0.0
Action Chunks Kept 100% 100% 100%
WARP-BC Success ↑ 20/20 19/20 14/20
Mean TTC (s) ↓ 63.9 118.8 117.4
Throughput (/hr) ↑ 56.3 27.4 16.3
Action Chunks Kept 35.7% 34.4% 22.5%

Matched baseline comparisons

Because SARM requires human-annotated subtask boundaries, all methods are evaluated on the augmented corpora \(\mathcal{D}_{4} = \mathcal{D}_{1} \cup \mathcal{D}_{A}\) and \(\mathcal{D}_{5} = \mathcal{D}_{2} \cup \mathcal{D}_{A}\), where \(\mathcal{D}_{A}\) is the annotated supplement (treated as unannotated by every method except SARM). WARP-BC sustains the highest throughput on both tiers and ties or leads on success — without the human labels SARM needs. SARM and SCIZOR collapse on the noisier \(\mathcal{D}_{5}\), while DemInf stays robust on success but at lower throughput.

Method Metric \(\mathcal{D}_{4}\) \(\mathcal{D}_{5}\)
SARM Success ↑ 19/20 2/20
Mean TTC (s) ↓ 90.5 156.0
Throughput (/hr) ↑ 34.9 1.55
Action Chunks Kept 78.5% 66.6%
DemInf Success ↑ 19/20 18/20
Mean TTC (s) ↓ 89.6 115.8
Throughput (/hr) ↑ 35.2 25.3
Action Chunks Kept 45.6% 33.7%
SCIZOR Success ↑ 19/20 2/20
Mean TTC (s) ↓ 98.4 206.2
Throughput (/hr) ↑ 32.4 1.5
Action Chunks Kept 77.9% 66.7%
WARP-BC Success ↑ 20/20 20/20
Mean TTC (s) ↓ 71.2 80.7
Throughput (/hr) ↑ 50.6 44.6
Action Chunks Kept 45.6% 33.7%

Ablations

All ablations are run on dataset \(\mathcal{D}_{2}\). Kept-train-samples reports the fraction of action chunks that survive the weighting filter.

Variant Success ↑ Mean TTC (s) ↓ Throughput (/hr) ↑ Action Chunks Kept
Weighting function
\(\tau = 0\) 3/20201.42.397.0%
\(\tau = 1\), max = 1 (binary) 16/20139.618.034.4%
\(\tau = 1\), continuous — WARP 19/20118.827.434.4%
RA-BC aggregation strategy
Mean over 1s action chunk 15/20127.017.434.0%
Mean over 1s, one-chunk offset 14/20124.215.934.3%
Terminal \(\hat{v}_{end}\) — WARP 19/20118.827.434.4%
WARP sampler
IID log-normal 18/20131.022.828.7%
AR(1) process — WARP 19/20118.827.434.4%