Attention Function Types¶

Not all connections should use the same attention mechanism. Canvas-engineering lets you declare the type of function used for each edge in the compute graph.

Declaration¶

# Per-region default (applies to all outgoing connections)
RegionSpec(bounds=..., default_attn="linear_attention")

# Per-connection override
Connection(src="thought", dst="visual", fn="perceiver")

Resolution order: connection.fn → region.default_attn → "cross_attention".

The full lineup¶

Dot-product family¶

Type	Complexity	Description
`cross_attention`	O(NM)	Standard scaled dot-product QKV with softmax. The default.
`linear_attention`	O(N+M)	No softmax — kernel trick with elu+1 or ReLU features. Good for low-dimensional streams where full quadratic attention is overkill.
`cosine_attention`	O(NM)	Cosine similarity instead of scaled dot-product. No temperature parameter. Stable gradients.
`sigmoid_attention`	O(NM)	Sigmoid instead of softmax — each position independently gates each key. Non-exclusive attention for multi-label patterns.

Gating family¶

Type	Complexity	Description
`gated`	O(NM)	Gated cross-attention (Flamingo-style). A learned sigmoid gate controls whether to incorporate context. Best for optional conditioning — goals, instructions, memory retrieval.

Compression family¶

Type	Complexity	Description
`perceiver`	O(NK)	Cross-attend through a learned latent bottleneck (K << M). Compresses a large dst region into a fixed-size representation. Good for reading from large visual fields.
`pooling`	O(N+M)	Mean-pool dst into a single vector, broadcast to all src positions. Cheapest possible information transfer. Good for scalar conditioning.
`copy`	O(N)	Direct tensor transfer — no learned parameters. For broadcast regions, multi-agent latent sharing, or identity connections.

State-space / recurrence family¶

Type	Complexity	Description
`mamba`	O(N)	Selective state-space model (S6). Input-dependent gating over compressed state with query-based readout — each query position attends to the SSM context sequence via scaled dot-product, enabling position-selective reading of the state.
`rwkv`	O(N)	Linear attention with learned exponential decay. Recurrent formulation with position-dependent forgetting. Good for causal temporal connections.
`hyena`	O(N log N)	Long convolution with data-dependent gating via FFT. Sub-quadratic alternative for very long sequences.

Sparse / structured family¶

Type	Complexity	Description
`sparse_attention`	O(NK)	Top-k attention — only the k highest logits survive softmax. Sparse gradient flow for selective binding.
`local_attention`	O(NW)	Windowed attention — each position only attends within a local spatial/temporal window W. For spatially local interactions.

Meta / experimental¶

Type	Complexity	Description
`none`	O(0)	Edge exists in schema but is disabled. For ablation studies.
`random_fixed`	O(NK)	Random sparse attention pattern, frozen at init. Baseline for measuring whether learned patterns matter.
`mixture`	O(NK)	MoE-style learned routing. Each src position is routed to a subset of dst by a learned router.

Backbone-native¶

Type	Complexity	Description
`cogvideox`	O(NM)	CogVideoX-native attention. Within the canvas `AttentionDispatcher` this is standard cross-attention. When a CogVideoX backbone is grafted via `graft_looped_blocks()`, the backbone's native 3D-RoPE multi-head attention supersedes this entirely. Use as `default_attn` on any region trained inside a CogVideoX transformer.

Design recipes¶

Robot manipulation¶

"visual":  default_attn="cross_attention"   # spatial reasoning needs full attention
"proprio": default_attn="linear_attention"  # 12D vector, O(N²) is wasteful
"action":  default_attn="cross_attention"   # content-based visual selection
# proprio → action: fn="pooling"            # just inject the state vector

Embodied agent with memory¶

"perception": default_attn="cross_attention"
"memory":     default_attn="mamba"           # O(N) over long episode history
"policy":     default_attn="cross_attention"
# memory → perception: fn="gated"           # selective memory retrieval
# perception → memory: fn="perceiver"       # compress into fixed-size buffer

Multi-agent coordination¶

"agent_a.thought": default_attn="rwkv"      # causal temporal within agent
"agent_b.thought": default_attn="rwkv"
# agent_a → agent_b: fn="copy"              # direct latent relay
# both → shared_task: fn="cross_attention"   # selective broadcast

Vision transformer¶

"patches":  default_attn="local_attention"   # each patch attends locally
"cls":      default_attn="cross_attention"   # global token aggregates
# cls → patches: fn="cross_attention"        # global readout
# patches → cls: fn="pooling"               # compress to single vector

Dispatch: from declaration to execution¶

All 18 attention types are fully implemented as nn.Module classes in canvas_engineering.attention. The AttentionDispatcher routes each topology connection to its resolved function:

from canvas_engineering import AttentionDispatcher

dispatcher = AttentionDispatcher(
    topology=topology,
    layout=layout,
    d_model=256,
    n_heads=4,
)
output = dispatcher(hidden_states)  # per-connection dispatch

A frozen CogVideoX backbone runs all positions through the same blocks (full attention), so it can only honor weight modulation. Use default_attn="cogvideox" on regions trained inside a CogVideoX transformer to signal this. A custom or scratch backbone can use AttentionDispatcher for true per-connection dispatch with temporal fill modes.

Custom attention types can be registered at runtime:

from canvas_engineering import register_attention

register_attention("my_custom_attn", MyCustomAttentionModule)