Skip to content

Attention Function Types

Not all connections should use the same attention mechanism. Canvas-engineering lets you declare the type of function used for each edge in the compute graph.

Declaration

# Per-region default (applies to all outgoing connections)
RegionSpec(bounds=..., default_attn="linear_attention")

# Per-connection override
Connection(src="thought", dst="visual", fn="perceiver")

Resolution order: connection.fnregion.default_attn"cross_attention".

The full lineup

Dot-product family

Type Complexity Description
cross_attention O(NM) Standard scaled dot-product QKV with softmax. The default.
linear_attention O(N+M) No softmax — kernel trick with elu+1 or ReLU features. Good for low-dimensional streams where full quadratic attention is overkill.
cosine_attention O(NM) Cosine similarity instead of scaled dot-product. No temperature parameter. Stable gradients.
sigmoid_attention O(NM) Sigmoid instead of softmax — each position independently gates each key. Non-exclusive attention for multi-label patterns.

Gating family

Type Complexity Description
gated O(NM) Gated cross-attention (Flamingo-style). A learned sigmoid gate controls whether to incorporate context. Best for optional conditioning — goals, instructions, memory retrieval.

Compression family

Type Complexity Description
perceiver O(NK) Cross-attend through a learned latent bottleneck (K << M). Compresses a large dst region into a fixed-size representation. Good for reading from large visual fields.
pooling O(N+M) Mean-pool dst into a single vector, broadcast to all src positions. Cheapest possible information transfer. Good for scalar conditioning.
copy O(N) Direct tensor transfer — no learned parameters. For broadcast regions, multi-agent latent sharing, or identity connections.

State-space / recurrence family

Type Complexity Description
mamba O(N) Selective state-space model (S6). Input-dependent gating over compressed state with query-based readout — each query position attends to the SSM context sequence via scaled dot-product, enabling position-selective reading of the state.
rwkv O(N) Linear attention with learned exponential decay. Recurrent formulation with position-dependent forgetting. Good for causal temporal connections.
hyena O(N log N) Long convolution with data-dependent gating via FFT. Sub-quadratic alternative for very long sequences.

Sparse / structured family

Type Complexity Description
sparse_attention O(NK) Top-k attention — only the k highest logits survive softmax. Sparse gradient flow for selective binding.
local_attention O(NW) Windowed attention — each position only attends within a local spatial/temporal window W. For spatially local interactions.

Meta / experimental

Type Complexity Description
none O(0) Edge exists in schema but is disabled. For ablation studies.
random_fixed O(NK) Random sparse attention pattern, frozen at init. Baseline for measuring whether learned patterns matter.
mixture O(NK) MoE-style learned routing. Each src position is routed to a subset of dst by a learned router.

Backbone-native

Type Complexity Description
cogvideox O(NM) CogVideoX-native attention. Within the canvas AttentionDispatcher this is standard cross-attention. When a CogVideoX backbone is grafted via graft_looped_blocks(), the backbone's native 3D-RoPE multi-head attention supersedes this entirely. Use as default_attn on any region trained inside a CogVideoX transformer.

Design recipes

Robot manipulation

"visual":  default_attn="cross_attention"   # spatial reasoning needs full attention
"proprio": default_attn="linear_attention"  # 12D vector, O(N²) is wasteful
"action":  default_attn="cross_attention"   # content-based visual selection
# proprio → action: fn="pooling"            # just inject the state vector

Embodied agent with memory

"perception": default_attn="cross_attention"
"memory":     default_attn="mamba"           # O(N) over long episode history
"policy":     default_attn="cross_attention"
# memory → perception: fn="gated"           # selective memory retrieval
# perception → memory: fn="perceiver"       # compress into fixed-size buffer

Multi-agent coordination

"agent_a.thought": default_attn="rwkv"      # causal temporal within agent
"agent_b.thought": default_attn="rwkv"
# agent_a → agent_b: fn="copy"              # direct latent relay
# both → shared_task: fn="cross_attention"   # selective broadcast

Vision transformer

"patches":  default_attn="local_attention"   # each patch attends locally
"cls":      default_attn="cross_attention"   # global token aggregates
# cls → patches: fn="cross_attention"        # global readout
# patches → cls: fn="pooling"               # compress to single vector

Dispatch: from declaration to execution

All 18 attention types are fully implemented as nn.Module classes in canvas_engineering.attention. The AttentionDispatcher routes each topology connection to its resolved function:

from canvas_engineering import AttentionDispatcher

dispatcher = AttentionDispatcher(
    topology=topology,
    layout=layout,
    d_model=256,
    n_heads=4,
)
output = dispatcher(hidden_states)  # per-connection dispatch

A frozen CogVideoX backbone runs all positions through the same blocks (full attention), so it can only honor weight modulation. Use default_attn="cogvideox" on regions trained inside a CogVideoX transformer to signal this. A custom or scratch backbone can use AttentionDispatcher for true per-connection dispatch with temporal fill modes.

Custom attention types can be registered at runtime:

from canvas_engineering import register_attention

register_attention("my_custom_attn", MyCustomAttentionModule)