canvas_engineering.canvas¶
Core data structures for the spatiotemporal canvas.
canvas_engineering.canvas.RegionSpec
dataclass
¶
Declarative specification for a canvas region.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bounds
|
Tuple[int, int, int, int, int, int]
|
(t0, t1, h0, h1, w0, w1) spatial-temporal extent. |
required |
period
|
int
|
Canvas frames per real-world update (1 = every frame). A region with period=4 spanning t=0..3 maps to real frames 0,4,8,12. |
1
|
is_output
|
bool
|
Whether this region participates in diffusion loss. |
True
|
loss_weight
|
float
|
Relative loss weight for positions in this region. |
1.0
|
semantic_type
|
Optional[str]
|
Human-readable modality description, e.g. "RGB video 224x224 30fps from front-facing monocular camera". This is the source of truth — the embedding is derived from it. |
None
|
semantic_embedding
|
Optional[Tuple[float, ...]]
|
Frozen vector from embedding_model applied to semantic_type. Used to compute transfer distance between modalities. Must be re-derived if semantic_type or embedding_model changes. |
None
|
embedding_model
|
str
|
Identifier of the model that produced semantic_embedding. Should stay constant within a project/ecosystem. Declared explicitly so different communities can use different models and so the embedding can always be re-derived. |
'openai/text-embedding-3-small'
|
default_attn
|
str
|
Default attention function type for outgoing connections from this region. Connections can override this per-edge. See ATTENTION_TYPES for the full registry of supported types. |
'cross_attention'
|
carrier
|
str
|
Dynamics carrier for this region. "deterministic" = standard forward latent updates (default). "diffusive" = noise/denoise. "filter" = predict/correct. "memory" = persistent lookup. "residual" = error traces. |
'deterministic'
|
canvas_engineering.canvas.ATTENTION_TYPES = {'cross_attention': 'Standard scaled dot-product QKV attention (softmax). The default. O(N*M) where N=|src|, M=|dst|.', 'linear_attention': 'Dot-product without softmax normalization. O(N+M) via kernel trick (elu+1 or ReLU features). Good for low-dimensional or high-frequency regions where full attention is overkill.', 'cosine_attention': 'Cosine similarity attention — normalized dot-product without learned temperature. Stable gradients, no scaling by sqrt(d).', 'sigmoid_attention': 'Sigmoid instead of softmax over attention logits. Each position independently gates each key (no competition). Good for multi-label / non-exclusive attention patterns.', 'gated': 'Gated cross-attention (Flamingo-style). A learned sigmoid gate on the cross-attention output controls whether to incorporate context. Good for optional conditioning (goals, instructions, memory).', 'perceiver': 'Cross-attend through a learned latent bottleneck. Compresses dst into a small set of latent vectors, then src attends to those. O(N*K) where K << M. Good for very large dst regions.', 'pooling': 'Mean-pool dst into a single vector, broadcast to all src positions. Cheapest possible information transfer. O(M+N). Good for scalar or low-dimensional conditioning signals.', 'copy': 'Direct tensor transfer — no learned parameters, no attention. Requires src and dst to have compatible shapes. For broadcast regions or direct latent sharing between agents.', 'mamba': 'Selective state-space model (S6). Input-dependent gating over a compressed state. O(N) sequential, hardware-efficient. Good for long temporal sequences.', 'rwkv': 'Linear attention with learned exponential decay. O(N) via recurrent formulation. Time-mixing with position-dependent forgetting. Good for causal temporal connections.', 'hyena': 'Long convolution operator with data-dependent gating. O(N log N) via FFT. Sub-quadratic alternative to attention for very long sequences.', 'sparse_attention': 'Top-k attention — only the k highest logits survive softmax. Sparse gradient flow. Good for regions that should selectively bind to specific positions.', 'local_attention': 'Windowed attention — each position only attends within a local spatial/temporal window. O(N*W) where W is window size. Good for spatially local interactions.', 'none': 'Connection exists in schema but is disabled. Useful for ablation studies — the edge is declared but produces no information flow.', 'random_fixed': 'Random sparse attention pattern, frozen at init. Each position attends to a random fixed subset of dst. Baseline for measuring whether learned patterns matter.', 'mixture': 'Mixture-of-experts style routing. Each src position is routed to a subset of dst positions by a learned router. Sparse but adaptive. Good for multi-modal hubs.', 'cogvideox': "CogVideoX-native attention. Within the canvas dispatcher this is standard cross-attention; when a CogVideoX backbone is grafted via graft_looped_blocks(), the backbone's native 3D-RoPE multi-head attention supersedes this entirely. Use as default_attn on any region trained inside a CogVideoX transformer."}
module-attribute
¶
canvas_engineering.canvas.transfer_distance(a, b)
¶
Cosine distance between two regions' semantic type embeddings.
Returns a value in [0, 2]: 0 = identical modality, 1 = orthogonal, 2 = opposite. Lower distance → cheaper to bridge (fewer adapter layers, less data).
Both specs must have semantic_embedding set and use the same embedding_model.
canvas_engineering.canvas.CanvasLayout
dataclass
¶
Declarative canvas geometry and modality region assignments.
Example
layout = CanvasLayout( T=5, H=8, W=8, d_model=256, regions={ "visual": (0, 5, 0, 6, 0, 6), # 5 frames of 6x6 patches "action": (0, 5, 6, 7, 0, 1), # per-frame actions "reward": (2, 3, 7, 8, 0, 1), # single reward slot }, t_current=2, # t >= 2 is "future" (diffusion output) )
canvas_frame(name, real_t)
¶
Map a real-world frame to a canvas timestep index (relative to region start).
Returns None if real_t is not aligned to this region's period.
loss_weight_mask(device='cpu')
¶
Per-position loss weights as a (N,) tensor.
Positions in is_output=True regions get their loss_weight; is_output=False or uncovered positions get 0. Overlapping regions accumulate weights additively.
output_mask()
¶
Flat indices that are diffusion outputs (t >= t_current).
real_frame(name, canvas_t)
¶
Map a canvas timestep index (relative to region start) to a real-world frame.
real_frame = canvas_t * period
region_indices(name)
¶
Flat indices for a named region.
region_indices_at_t(name, t_abs)
¶
Flat indices for a named region at a specific absolute timestep.
Returns empty list if t_abs is outside the region's temporal extent.
region_spec(name)
¶
Return the RegionSpec for a named region, wrapping raw tuples with defaults.
region_timesteps(name)
¶
Absolute timesteps covered by a named region.
canvas_engineering.canvas.PeriodEmbedding
¶
Bases: Module
Learned embedding indexed by log-bucketed temporal period.
Summed into position representations so the model knows each position's native update rate. Combined with temporal positional encoding, this lets the model infer staleness when reading held values from slower regions via temporal fill.
Period values are mapped to buckets via log scaling
period=1 → bucket 0 (tick-rate) period=4 → bucket 2 (hourly) period=16 → bucket 4 (daily) period=576 → bucket 10 (quarterly) period=4608 → bucket 13 (decadal)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
d_model
|
int
|
Embedding dimension (must match canvas d_model). |
required |
n_buckets
|
int
|
Number of discrete period buckets. |
16
|
max_period
|
int
|
Maximum expected period (for log scaling). |
10000
|
canvas_engineering.canvas.SpatiotemporalCanvas
¶
Bases: Module
Manages the unified canvas tensor with positional + modality + period embeddings.
Each position's representation is the sum of: 1. Content embedding (empty token or placed data) 2. 3D sinusoidal positional encoding (T, H, W) 3. Region identity (learned modality embedding or semantic conditioning) 4. Period embedding (learned, indexed by log-bucketed update period)
The period embedding tells the model each position's native update rate, enabling it to infer staleness when reading held values from slower regions via temporal fill connections.
Supports two modes of per-region identity:
1. Learned modality embeddings (default): one learned vector per region.
2. Semantic conditioning: frozen embeddings from a text model, projected
to d_model with optional learned residuals. Pass a SemanticConditioner
to __init__ to enable. Replaces learned modality embeddings.
Example
canvas_mod = SpatiotemporalCanvas(layout) canvas = canvas_mod.create_empty(batch_size=4) # (4, THW, d_model) canvas = canvas_mod.place(canvas, visual_embs, "visual") action_embs = canvas_mod.extract(canvas, "action")
With semantic conditioning:¶
from canvas_engineering.semantic import SemanticConditioner cond = SemanticConditioner(d_model=256, embed_dim=1536, region_embeddings={...}) canvas_mod = SpatiotemporalCanvas(layout, semantic_conditioner=cond)
create_empty(batch_size)
¶
(B, N, d_model) canvas filled with empty tokens + positional + period encoding.
Each position gets: empty_token + 3D_PE + period_embedding. If a semantic conditioner is present, also adds semantic conditioning. If a program was provided at init, also adds family+carrier embeddings.
extract(canvas, region_name)
¶
Read embeddings from a named region.
place(canvas, embeddings, region_name)
¶
Write embeddings into a named region, adding modality + period + family/carrier embeddings.