Research Benchmarks

This document provides detailed benchmark results comparing BSBR with other transformer architectures based on the experiments run in the research/architecture_comparisons/ directory.

Benchmark Setup

Environment

Hardware: CPU (Intel Core i9)
Software: Python 3.12, PyTorch, Transformers, etc. (see requirements.txt)
BSBR Version: 0.1.2

Models Compared

Models were configured with comparable hyperparameters (hidden dim: 256, heads: 4, layers: 2 where applicable) for evaluation.

BSBR (Block Sparse Attention with Block Retrieval)
Standard Transformer
Linear Transformer
DeltaNet
Sliding Window Transformer
Hopfield Network
GAU (Gated Attention Unit)

Parameter Counts

Model	Parameters (Millions)	Relative to Base (Standard)
BSBR	6.0M	1.66x
Standard	3.6M	1.0x
Linear	3.6M	1.0x
DeltaNet	3.6M	1.0x
SlidingWindow	3.6M	1.0x
Hopfield	3.6M	1.0x
GAU	4.4M	1.22x

Data source: research/architecture_comparisons/results/comparison_results.json

Performance Benchmarks (CPU)

Results are based on runs with sequence lengths [64, 128, 256, 512, 1024]. Data source: research/architecture_comparisons/results/comparison_results.json

Inference Time (seconds)

Model	n=64	n=128	n=256	n=512	n=1024
BSBR	0.462	0.560	0.753	1.570	3.092
Linear	1.570	2.742	4.896	8.879	17.322
DeltaNet	8.085	13.31	23.71	46.166	92.276
Standard	0.254	0.334	0.453	0.908	2.538
SlidingWindow	0.514	0.748	1.289	2.442	5.568
Hopfield	0.255	0.365	0.478	0.937	2.568
GAU	0.488	0.880	1.950	5.381	17.649

Peak Memory Usage (MB)

Model	n=64	n=128	n=256	n=512	n=1024
BSBR	22.826	22.826	22.827	22.829	22.833
Linear	13.790	13.790	13.791	13.793	13.797
DeltaNet	13.790	13.790	13.791	13.793	13.797
Standard	13.790	13.790	13.791	13.793	13.797
SlidingWindow	13.790	13.790	13.791	13.793	13.797
Hopfield	13.790	13.790	13.791	13.793	13.797
GAU	16.799	16.800	16.801	16.803	16.807

Complexity Analysis

Based on fitting power-law curves to the inference time data.

Model	Empirical Complexity	R-squared	Time at n=1024 (seconds)	Memory at n=1024 (MB)
BSBR	O(n^0.70) ≈ O(n)	0.9380	3.0916	22.83
Standard	O(n^0.81) ≈ O(n)	0.9212	2.5382	13.80
Linear	O(n^0.86) ≈ O(n)	0.9988	17.3223	13.80
DeltaNet	O(n^0.88) ≈ O(n)	0.9956	92.2763	13.80
SlidingWindow	O(n^0.86) ≈ O(n)	0.9804	5.5680	13.80
Hopfield	O(n^0.80) ≈ O(n)	0.9308	2.5681	13.80
GAU	O(n^1.30) ≈ O(n log n)	0.9826	17.6486	16.81

Note: Empirical complexity measured on CPU for n <= 1024 may differ from theoretical asymptotic behavior, especially for Standard attention.

Visualizations

Note: Paths are relative to the docs/ directory.

Complexity & Scaling:

Complexity Analysis Log-Log Complexity Scaling Curves

Performance Comparison:

Inference Heatmap Radar Chart Memory Scaling Combined Performance Basic Comparison Summary Dashboard

Sections below contain placeholder data and are for illustrative purposes only.

Training Benchmarks

Convergence Speed

convergence_results = {
    'BSBR': {
        'epochs_to_converge': 50,
        'final_loss': 0.15,
        'validation_accuracy': 0.92
    },
    'Standard': {
        'epochs_to_converge': 45,
        'final_loss': 0.18,
        'validation_accuracy': 0.89
    },
    'Linear': {
        'epochs_to_converge': 55,
        'final_loss': 0.20,
        'validation_accuracy': 0.87
    }
}

Training Memory

training_memory = {
    'BSBR': {
        'peak_memory': 2.5,  # GB
        'gradient_memory': 1.4,  # GB
        'activation_memory': 0.8  # GB
    },
    'Standard': {
        'peak_memory': 16.0,  # GB
        'gradient_memory': 12.8,  # GB
        'activation_memory': 8.0  # GB
    },
    'Linear': {
        'peak_memory': 2.5,  # GB
        'gradient_memory': 1.4,  # GB
        'activation_memory': 0.8  # GB
    }
}

Task-Specific Benchmarks

Document Classification

document_results = {
    'sequence_lengths': [512, 1024, 2048],
    'BSBR': {
        'accuracy': [0.94, 0.92, 0.89],
        'inference_time': [0.4, 0.7, 1.2]  # seconds
    },
    'Standard': {
        'accuracy': [0.95, 0.88, 0.75],
        'inference_time': [1.5, 5.0, 18.0]  # seconds
    },
    'Linear': {
        'accuracy': [0.92, 0.89, 0.85],
        'inference_time': [0.4, 0.7, 1.2]  # seconds
    }
}

Language Modeling

language_modeling_results = {
    'sequence_lengths': [512, 1024, 2048],
    'BSBR': {
        'perplexity': [15.2, 16.8, 18.5],
        'inference_time': [0.4, 0.7, 1.2]  # seconds
    },
    'Standard': {
        'perplexity': [14.8, 16.5, 18.2],
        'inference_time': [1.5, 5.0, 18.0]  # seconds
    },
    'Linear': {
        'perplexity': [15.5, 17.2, 19.0],
        'inference_time': [0.4, 0.7, 1.2]  # seconds
    }
}

Question Answering

qa_results = {
    'sequence_lengths': [512, 1024, 2048],
    'BSBR': {
        'f1_score': [0.82, 0.80, 0.77],
        'inference_time': [0.4, 0.7, 1.2]  # seconds
    },
    'Standard': {
        'f1_score': [0.83, 0.79, 0.74],
        'inference_time': [1.5, 5.0, 18.0]  # seconds
    },
    'Linear': {
        'f1_score': [0.81, 0.78, 0.75],
        'inference_time': [0.4, 0.7, 1.2]  # seconds
    }
}

Hardware Utilization

GPU Utilization

gpu_utilization = {
    'BSBR': {
        'gpu_util': 85,  # percentage
        'memory_util': 60,  # percentage
        'power_usage': 250  # watts
    },
    'Standard': {
        'gpu_util': 95,  # percentage
        'memory_util': 90,  # percentage
        'power_usage': 300  # watts
    },
    'Linear': {
        'gpu_util': 80,  # percentage
        'memory_util': 55,  # percentage
        'power_usage': 230  # watts
    }
}

CPU Utilization

cpu_utilization = {
    'BSBR': {
        'cpu_util': 60,  # percentage
        'peak_memory': 2.5  # GB
    },
    'Standard': {
        'cpu_util': 50,
        'peak_memory': 16.0  # GB
    },
    'Linear': {
        'cpu_util': 55,
        'peak_memory': 2.5  # GB
    }
}