Project Website

Efficient LLM Inference

Efficient inference through indexing, kernels, and KV-cache retrieval

This project advances algorithmic and systems foundations for efficient large language model inference, spanning exact acceleration for low-bit matrix-vector multiplication, optimized CPU/CUDA kernels, and recall-guaranteed sparse attention over the KV cache.

Project overview

The efficiency of large language model inference is shaped by more than one bottleneck. Low-bit quantization reduces model size, but inference still depends on fast matrix-vector multiplication over binary and ternary weight matrices. Long-context decoding introduces another cost: each generated token must use the KV cache to attend over previously computed keys and values.

This project studies those bottlenecks through indexing, exact algorithms, and hardware-aware implementation. RSR and RSR++ preprocess fixed low-bit weight matrices into compact indices for exact matrix-vector multiplication. RSR-core implements that direction as optimized CPU and CUDA kernels with model-inference integration.

Louver extends the project scope to attention itself by formulating sparse attention over the KV cache as a halfspace range-searching problem with zero false negatives relative to a specified relevance threshold.

The broader goal is to make LLM inference more efficient and reliable on practical hardware without treating accuracy loss as an unavoidable cost of speed. Across the publications, the project examines when preprocessing, data structures, sparse retrieval, and systems optimization can reduce runtime or memory pressure while preserving exactness, recall guarantees, or clearly characterized approximation limits.

Publications

CoRR 2026

Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

Mohsen Dehghankar and Abolfazl Asudeh · CoRR abs/2605.06763 (Preprint), 2026

Illustration for the Louver sparse attention KV cache indexing paper

Sparse attention improves LLM inference efficiency by selecting a subset of key–value (KV) entries, but at the cost of potential accuracy degradation. In particular, omitting critical KV entries can induce substantial errors in model outputs. Existing methods typically operate under fixed or adaptive token budgets and provide empirical robustness or partial theoretical guarantees, yet they do not ensure zero false negatives across decoding steps, particularly since the set of relevant tokens is both query- and step-dependent.

Our empirical observations confirm that missing even one critical key can lead to sharp error spikes, especially in long-output reasoning tasks where the set of important tokens varies throughout decoding. This observation motivates the need for indexing methods that dynamically adapt to these variations across decoding steps while guaranteeing a full recall of the relevant keys above a certain threshold.

We address this challenge by reformulating sparse attention as the computational geometry problem of halfspace range searching. However, existing range searching data structures are not suitable for modern LLM inference due to their computational and implementation overheads. To overcome this, we introduce Louver, a novel index structure tailored for efficient KV cache retrieval.

Louver (i) guarantees zero false negatives with respect to a specified threshold in both theory and practice, (ii) is lightweight to integrate into existing LLM inference pipelines, and (iii) incorporates hardware-aware optimizations for both CPU and GPU executions.

Our experiments demonstrate that Louver outperforms prior sparse attention methods in both accuracy and runtime, and is faster than highly optimized dense implementations such as FlashAttention. These results highlight that recall guarantees are a critical and overlooked dimension of sparse attention, and open a new direction for building theoretically grounded, efficient KV cache indices.

Citation: Mohsen Dehghankar and Abolfazl Asudeh. Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache. CoRR, abs/2605.06763 (Preprint) 2026.

ICML 2025

An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

Mohsen Dehghankar, Mahdi Erfanian, and Abolfazl Asudeh · Proceedings of the 42nd International Conference on Machine Learning, Vol. 267, pp. 12969-12986, 2025

Illustration for the RSR and RSR++ matrix multiplication paper

Matrix-vector multiplication is the central computational bottleneck in inference for binary and ternary neural networks, including low-bit large language models. Although quantization reduces model size, the resulting inference pipelines often remain limited by memory movement and inefficient execution, preventing the full benefit of low-bit representations from being realized in practice.

This paper introduces Redundant Segment Reduction (RSR) and RSR++, the first algorithmic framework for accelerating low-bit matrix-vector multiplication without sacrificing exactness. Instead of trading accuracy for efficiency, the method preprocesses fixed weight matrices after training and builds compact index structures that reduce redundant computation during inference.

The result is a logarithmic-factor improvement in both time and effective memory complexity. For an n × n weight matrix, the method achieves O(n² / log n) time complexity, improving on standard multiplication while preserving exact computation.

Beyond the theoretical guarantees, the method delivers strong practical gains. Experiments show up to 30× faster matrix multiplication, up to 6× lower memory usage, up to 6× faster LLM inference on CPU without system-level optimization, and up to 2.5× faster LLM inference on GPU. The design also exposes substantial parallelism, since computations across column blocks are independent and can be executed concurrently.

More broadly, this work shows that efficient inference can be advanced not only through model compression, but also through better exact algorithms for the core operations that dominate runtime. In that sense, it provides an algorithmic foundation for making capable neural models more practical, affordable, and accessible on commodity hardware.

Citation: Mohsen Dehghankar, Mahdi Erfanian, and Abolfazl Asudeh. 2025. An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks. In Proceedings of the 42nd International Conference on Machine Learning, Vol. 267, pp. 12969-12986.

CoRR 2026

RSR-core: A High-Performance Engine for Low-Bit Matrix-Vector Multiplication

Mohsen Dehghankar and Abolfazl Asudeh · CoRR abs/2603.27462, 2026

Illustration for the RSR-core engine paper

Matrix-vector multiplication is a core operation in neural networks, vector databases, and large language models, especially at inference time. As low-bit quantization becomes increasingly important for efficient deployment, binary and ternary weight matrices offer strong opportunities for acceleration, but practical gains depend on whether the underlying multiplication can be executed efficiently at the kernel level.

This paper presents RSR-core, a high-performance engine that brings the algorithmic ideas of Redundant Segment Reduction into optimized low-level kernels for both CPU and CUDA environments. While prior implementations of RSR operated mainly at the application level, RSR-core closes the gap between theory and practice by enabling efficient integration into real inference systems.

The engine supports matrix-vector multiplication for binary and ternary weight matrices with general vectors, and is designed for production use. It also includes Hugging Face integration for preprocessing low-bit models and running accelerated inference within existing workflows.

Experimentally, RSR-core delivers substantial performance improvements over baseline Hugging Face PyTorch multiplication. The results show up to 62× speedup on CPU and up to 1.9× faster token generation on CUDA for popular ternary LLMs, demonstrating that low-bit acceleration can translate into meaningful end-to-end gains when paired with a systems-oriented implementation.

More broadly, this work advances the project’s goal of making efficient LLM inference practical by turning a theoretically grounded algorithm into a deployable engine. In doing so, it strengthens the systems foundation for low-bit LLM inference on both commodity CPUs and modern GPUs.

Citation: Mohsen Dehghankar and Abolfazl Asudeh. 2026. RSR-core: A High-Performance Engine for Low-Bit Matrix-Vector Multiplication. CoRR abs/2603.27462.

EDBT 2025

Evaluating the Feasibility of Sampling-Based Techniques for Training Multilayer Perceptrons

Sana Ebrahimi, Rishi Advani, and Abolfazl Asudeh · EDBT 2025

Illustration for the sampling-based MLP training paper

Matrix multiplication is the main computational bottleneck in deep neural networks, creating a fundamental challenge for training and personalization on low-resource machines. As demand grows for running capable models, including LLM-based systems, on commodity CPUs, it becomes increasingly important to understand whether approximation techniques can meaningfully reduce this cost without undermining learning quality.

This paper studies sampling-based approaches for accelerating multilayer perceptron training by approximating matrix multiplication. We identify two main families of methods in the literature: those that activate only a sampled subset of hidden-layer nodes at each iteration, and those that sample nodes from the previous layer to approximate the current layer’s activations. In both cases, the matrix products are computed using only the selected samples.

Through theoretical analysis and comprehensive experiments, the paper shows that these techniques face important scalability limitations. The central issue is that even small approximation errors can propagate across layers, leading to an exponential growth in estimation error as network depth increases.

While the work does not yet resolve this limitation, it provides an important negative result and clarifies the boundaries of currently available sampling-based methods. This makes the paper valuable not only as an evaluation of prior ideas, but also as a foundation for future research on more robust approaches to efficient neural network training.

More broadly, the paper complements the project’s focus on efficient inference by examining a related algorithmic question at training time: when approximation can help, and when it breaks down. In doing so, it helps sharpen the research agenda for efficient neural computation on resource-constrained devices.

Citation: Sana Ebrahimi, Rishi Advani, and Abolfazl Asudeh. 2025. Evaluating the Feasibility of Sampling-Based Techniques for Training Multilayer Perceptrons. In EDBT 2025.

Softwares

Software

RSR-core

A high-performance engine for low-bit matrix-vector multiplication across CPU and CUDA backends.

RSR-core demo visual comparing RSR against a Hugging Face baseline

RSR-core is the systems implementation of the Redundant Segment Reduction framework for efficient low-bit inference. The repository provides the core kernels, model integrations, and benchmarking pipeline needed to accelerate binary and ternary matrix-vector multiplication, which is a dominant operation in low-bit neural inference and LLM decoding.

The engine supports both CPU and CUDA backends and exposes optimized low-level kernels together with Python wrappers for 1-bit and 1.58-bit multiplication. It is designed to bridge the gap between the algorithmic gains of RSR and practical deployment in real inference pipelines.

The software also includes Hugging Face integration for preprocessing quantized models into RSR format and running accelerated inference from those preprocessed artifacts. In addition, the repository provides benchmarking scripts for kernel-level and end-to-end evaluation, making it easy to reproduce performance results on local hardware.

For interactive use, RSR-core includes a web dashboard built with FastAPI and Vite/React. The interface supports model browsing, preprocessing, side-by-side backend comparison, inference, and benchmark visualization, turning the project into a production-oriented workflow rather than a standalone prototype.

According to the repository benchmarks, the engine achieves substantial speedups over Hugging Face PyTorch baselines, including up to 62× higher throughput on CPU and up to 1.9× faster token generation on CUDA for popular ternary LLMs. The demo visual above links to the project’s video demonstration from the repository README.

Project investigators

A. Asudeh

Faculty

Mohsen Dehghankar

Lead PhD Student

Mahdi Erfanian, Sana Ebrahimi, Rishi Advani

Collaborators (PhD Students)