Core Concepts
This guide explains the fundamental concepts behind BSBR (Block Sparse Attention with Block Retrieval). Understanding these concepts will help you make better use of the library and customize it for your needs.
Attention Mechanisms
Standard Transformer Attention
The standard transformer attention mechanism computes attention scores between all pairs of tokens in a sequence:
This leads to O(n²) complexity in both computation and memory, where n is the sequence length.
BSBR's Approach
BSBR addresses this scalability issue by combining two types of attention:
- Within-chunk attention: Standard attention within fixed-size chunks
- Between-chunk attention: Efficient block retrieval between chunks
Where: - Q, K, V: Query, Key, Value matrices - R, H: Meta queries and keys for chunk-level attention - F: State vectors (flattened K^T·V for each chunk) - M_in: Block diagonal mask - M_out: Causal mask for chunk-level attention
Chunking Strategy
Chunk Size Selection
The chunk size (B) is a crucial hyperparameter that affects:
- Memory Usage: Larger chunks use more memory but provide better local context
- Computation Time: Smaller chunks are faster but may miss important long-range dependencies
- Model Expressivity: Chunk size affects how well the model can capture different types of relationships
Chunk Overlap
BSBR supports overlapping chunks to improve information flow between adjacent chunks:
Block Retrieval
Meta Attention
Between chunks, BSBR uses a meta-attention mechanism to efficiently retrieve information:
- State Compression: Each chunk's information is compressed into a state vector
- Meta Queries: Special queries that operate at the chunk level
- Efficient Retrieval: O(n/B) complexity for chunk-level attention
Compression Factor
The compression factor © controls how much information is preserved in chunk states:
Higher compression factors: - Reduce memory usage - Speed up computation - May lose some fine-grained information
Memory Management
Gradient Checkpointing
BSBR supports gradient checkpointing to trade computation for memory:
Memory-Efficient Attention
The implementation includes several memory optimizations:
- Sparse Attention: Only compute attention for relevant token pairs
- State Reuse: Reuse chunk states across layers
- Efficient Masking: Optimized attention masks for causal language modeling
Performance Characteristics
Computational Complexity
BSBR achieves near-linear complexity in sequence length:
- Within-chunk: O(n·B) where B is chunk size
- Between-chunk: O(n + n²/B)
- Overall: O(n) for fixed chunk size
Memory Usage
Memory consumption scales linearly with sequence length:
- Within-chunk: O(n·B)
- Between-chunk: O(n/B)
- State vectors: O(n/c) where c is compression factor
Best Practices
Model Configuration
Recommended configurations for different use cases:
-
Short Sequences (n < 512):
-
Medium Sequences (512 ≤ n < 2048):
-
Long Sequences (n ≥ 2048):
Training Tips
- Learning Rate: Use slightly higher learning rates than standard transformers
- Warmup: Longer warmup periods may be needed for very long sequences
- Gradient Clipping: Monitor gradients and clip if necessary
- Batch Size: Adjust based on available memory and sequence length
Next Steps
- Check the API Reference for detailed documentation
- Explore Examples for usage examples
- See Research for benchmark results