CogVideoX Backbone¶
The default training backbone grafts onto a pretrained CogVideoX video diffusion transformer. Instead of training from random initialization, the world model inherits rich spatiotemporal priors from video pretraining and learns only a thin adaptation layer.
Why video diffusion?¶
Video diffusion models learn fundamental physics of the visual world -- object permanence, gravity, fluid dynamics, lighting. While the world model operates on structured time series (not video), the attention patterns learned by CogVideoX transfer well:
- Temporal continuity: Things change smoothly over time
- Spatial correlation: Nearby things are related
- Multi-scale dynamics: Fast changes happen on top of slow trends
- Compositional structure: The world is made of parts
These priors accelerate training on heterogeneous world data. The model doesn't need to learn temporal attention from scratch -- it already knows how sequences evolve.
Architecture¶
graph TD
subgraph "Canvas Input"
A["(B, N, d_model)\ncanvas tensor"]
B["Positional Encoding"]
C["proj_in: Linear(d_model -> inner_dim)"]
end
subgraph "Frozen CogVideoX Blocks (30x)"
D["Block 0"]
E["Block 1"]
F["..."]
G["Block 29"]
end
subgraph "Per-Block Loop Params (trainable)"
H["loop_emb[i][l]\nloop_emb_enc[i][l]\nloop_gate[i][l]"]
end
subgraph "Output"
I["proj_out: Linear(inner_dim -> d_model)"]
J["LayerNorm"]
K["(B, N, d_model)\ncanvas output"]
end
A --> B --> C
C --> D
H --> D
D --> E --> F --> G
G --> I --> J --> K
style D fill:#ddd,stroke:#999
style E fill:#ddd,stroke:#999
style F fill:#ddd,stroke:#999
style G fill:#ddd,stroke:#999
style H fill:#afa,stroke:#393
How it works¶
- Project up: Canvas tensor
(B, N, d_model)is projected to CogVideoX'sinner_dim(e.g., 3072 for CogVideoX-2b) - Loop through frozen blocks: Each pretrained block runs L times per forward pass. Each iteration adds a zero-initialized loop embedding and applies a sigmoid-gated residual
- Encoder conditioning: A single learned token participates in CogVideoX's joint attention as global context
- Project down: Output is projected back to
d_modelfor the canvas loss
Zero-init safety¶
All loop embeddings and gates start at zero. At initialization:
loop_emb = 0means no perturbation to hidden statesloop_gate = sigmoid(0) = 0.5means 50% blend of block output and input- The backbone behaves like a standard (frozen) transformer pass-through
Training gradually activates the loop parameters, teaching the model domain-specific reasoning patterns on top of the pretrained video priors.
Parameter budget¶
| Component | Params | Trainable? |
|---|---|---|
| CogVideoX frozen blocks (30) | ~3.3B | No |
| Loop embeddings (30 blocks x 3 loops x inner_dim) | ~553K | Yes |
| Loop gates (30 blocks x 3 loops) | ~90 | Yes |
| Encoder conditioning token | ~3K | Yes |
| proj_in (d_model -> inner_dim) | ~400K | Yes |
| proj_out (inner_dim -> d_model) | ~400K | Yes |
| Total trainable | ~1.35M |
Only 0.04% of parameters are trainable. The rest provide frozen spatiotemporal priors.
Usage¶
Default (CogVideoX grafting)¶
from general_unified_world_model import DAGCurriculumTrainer
trainer = DAGCurriculumTrainer(
nodes=dag,
data_sources=data_sources,
backbone="cogvideox", # default
pretrained_model_id="THUDM/CogVideoX-2b", # default
device="cuda",
)
trainer.run()
Fallback (from scratch)¶
trainer = DAGCurriculumTrainer(
nodes=dag,
data_sources=data_sources,
backbone="scratch", # train from random init
device="cuda",
)
CLI¶
# CogVideoX (default)
python scripts/train_h100.py
# From scratch
python scripts/train_h100.py --backbone scratch
Installation¶
CogVideoX requires the diffusers library:
The pretrained model (~5GB) is downloaded from HuggingFace on first use and cached in ~/.cache/huggingface/hub/.
If diffusers is not installed, the trainer automatically falls back to the scratch backbone.
Mixed precision¶
CogVideoX blocks run in bfloat16 (loaded with torch_dtype=torch.bfloat16). Trainable parameters (loop embeddings, projections) stay in float32 for stable gradients. The backbone handles dtype conversion at the block boundary automatically.
Weight merging¶
When the DAG curriculum merges parent weights at join nodes, only trainable parameters are averaged. The frozen CogVideoX blocks are shared across all nodes (loaded once, kept on GPU throughout training). This means:
- Checkpoint files are small (~5MB instead of ~6GB)
- Memory usage is constant regardless of how many nodes have been trained
- Merging is fast (only ~1.35M parameters to average)