Looped Attention¶

Looped attention iterates transformer blocks multiple times with learned iteration embeddings. The empirical result: 1.73x parameter efficiency through weight-sharing regularization.

Looped attention

How it works¶

For each loop iteration l in [0, current_loops):

h_input = h + loop_emb[l]                   # learned perturbation
h_out = frozen_block(h_input, ...)           # original pretrained forward
gate = sigmoid(loop_gate[l])                 # scalar in (0, 1)
h = gate * h_out + (1 - gate) * h           # gated residual mix

Zero-init safety¶

loop_emb initialized to zeros
loop_gate initialized to 0.0 → sigmoid(0) = 0.5

At init, the model behaves nearly identically to the pretrained backbone. No distribution shift. Safe to graft onto any frozen model.

Trainable parameters¶

Per block:  (embed_dim + 1) × max_loops
CogVideoX-2B (d=1920, 3 loops, 30 blocks):
  5,763 params/block × 30 = 172,890 + action_head ≈ 350K total

That's 0.02% of the 1.69B backbone.

Curriculum scheduling¶

scheduler = CurriculumScheduler(max_loops=3, total_steps=5000)
# Steps 0-1666:    1 loop
# Steps 1667-3333: 2 loops
# Steps 3334-5000: 3 loops

Why 3 loops?¶

From a 12-condition grid ablation (36 runs, $152 compute):

	Frozen (350K)	Half-frozen (3.7M)	Unfrozen (11.7M)
1 loop	0.121	0.115	0.108
2 loops	0.140	0.119	0.112
3 loops	0.073	0.107	0.088
4 loops	0.104	0.137	0.124

3 loops wins at every freeze level. 4 loops consistently regresses.

What looping is NOT¶

Hypothesis	Result	Evidence
Iterative reasoning	Falsified	p=0.97, p>0.05, p>0.05
Multi-modal binding	Falsified	19% worse (p<0.0001)

The benefit is weight-sharing regularization: fixed-point convergence (cosine sim 0.926 → 0.996), lower variance, better parameter efficiency. Not iterative reasoning.