Empirical Results¶

26 experiments, 236 training runs. All on CogVideoX-2B with Bridge V2 robot video data.

Key findings¶

Finding	Multiplier	Significance
Depth vs recurrence	1.73x	p<0.001
Per-token adaptive compute	1.24x	single seed
Weight sharing	1.03x	medium
Curriculum vs fixed	1.05x	--
Frozen 3-loop (350K params)	0.073 action loss	beats 11.7M unfrozen

Falsified hypotheses¶

Hypothesis	Result	Evidence
Looping = iterative reasoning	Falsified	3 independent nulls (p=0.97, p>0.05, p>0.05)
Shared canvas = multi-modal binding	Falsified	Joint prediction 19% worse (p<0.0001)
Token allocation follows power laws	Borderline	R^2=0.902, alpha=0.011

Fixed-point convergence¶

Loop representations converge toward fixed points:

Loop	Cosine sim to loop 1	Velocity
1	0.926	0.675
2	0.973	0.570
3	0.990	0.398
4	0.996	0.292

Token velocities decay exponentially. Visual tokens converge slowest. Action tokens converge fastest. Looping is weight-sharing regularization, not iterative refinement.

Freeze strategy comparison¶

Strategy	Trainable	Action loss	Diffusion loss
Frozen	350K	0.073	1.48
Half-frozen	3.7M	0.107	0.19
Unfrozen	11.7M	0.088	0.18

Freeze level doesn't affect action loss (p=0.72). It only affects video generation quality.

Paper¶

Looped Attention in Video Diffusion Transformers: 26 Experiments on What Works, What Doesn't, and Why

Jacob Valdez and Claude Opus 4.6

Paper PDF | Video | Experiment data