Project Website

Efficient LLM Inference

Efficient (and provacy-preserving) inference on personal devices

This project advances systems and algorithmic foundations for running large language models efficiently on personal devices, reducing dependence on cloud-only deployment while improving privacy, accessibility, and cost efficiency.

Project overview

The efficiency of large language model inference is fundamentally shaped by low-level kernel execution, especially matrix-vector multiplication. Although low-bit quantization has reduced model size, practical inference pipelines still suffer from inefficient memory access and limited hardware utilization, making high-quality LLM inference expensive and difficult to deploy on commodity devices.

As a result, advanced LLM capabilities remain concentrated in cloud environments backed by powerful GPU servers. While cloud-based inference broadens access, it also introduces recurring monetary and energy costs, network-dependent latency, and privacy concerns associated with transmitting user data to external infrastructure.

This project aims to shift LLM inference toward a more user-centric and widely accessible model by enabling accurate, private, and efficient execution on personal computers. The broader goal is to support LLM operation without reliance on costly, energy-intensive accelerators or continuous internet connectivity, thereby expanding access to advanced language technologies.

The project will deliver a unified framework for privacy-preserving on-device LLM inference. By improving efficiency on personal devices, this work seeks to reduce dependence on centralized infrastructure, strengthen data protection, and support more trustworthy and practical deployment of modern AI systems.

Softwares

Beyond papers, this project also produces software artifacts that make efficient low-bit inference practical in real systems.

Software

RSR-core

A high-performance engine for low-bit matrix-vector multiplication across CPU and CUDA backends.

RSR-core demo visual comparing RSR against a Hugging Face baseline

RSR-core is the systems implementation of the Redundant Segment Reduction framework for efficient low-bit inference. The repository provides the core kernels, model integrations, and benchmarking pipeline needed to accelerate binary and ternary matrix-vector multiplication, which is a dominant operation in low-bit neural inference and LLM decoding.

The engine supports both CPU and CUDA backends and exposes optimized low-level kernels together with Python wrappers for 1-bit and 1.58-bit multiplication. It is designed to bridge the gap between the algorithmic gains of RSR and practical deployment in real inference pipelines.

The software also includes Hugging Face integration for preprocessing quantized models into RSR format and running accelerated inference from those preprocessed artifacts. In addition, the repository provides benchmarking scripts for kernel-level and end-to-end evaluation, making it easy to reproduce performance results on local hardware.

For interactive use, RSR-core includes a web dashboard built with FastAPI and Vite/React. The interface supports model browsing, preprocessing, side-by-side backend comparison, inference, and benchmark visualization, turning the project into a production-oriented workflow rather than a standalone prototype.

According to the repository benchmarks, the engine achieves substantial speedups over Hugging Face PyTorch baselines, including up to 62× higher throughput on CPU and up to 1.9× faster token generation on CUDA for popular ternary LLMs. The demo visual above links to the project’s video demonstration from the repository README.

Publications

This project currently includes three papers on algorithmic and systems foundations for efficient low-bit neural inference.

ICML 2025

An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

Mohsen Dehghankar, Mahdi Erfanian, and Abolfazl Asudeh · Proceedings of the 42nd International Conference on Machine Learning, Vol. 267, pp. 12969-12986, 2025

Illustration for the RSR and RSR++ matrix multiplication paper

Matrix-vector multiplication is the central computational bottleneck in inference for binary and ternary neural networks, including low-bit large language models. Although quantization reduces model size, the resulting inference pipelines often remain limited by memory movement and inefficient execution, preventing the full benefit of low-bit representations from being realized in practice.

This paper introduces Redundant Segment Reduction (RSR) and RSR++, the first algorithmic framework for accelerating low-bit matrix-vector multiplication without sacrificing exactness. Instead of trading accuracy for efficiency, the method preprocesses fixed weight matrices after training and builds compact index structures that reduce redundant computation during inference.

The result is a logarithmic-factor improvement in both time and effective memory complexity. For an n × n weight matrix, the method achieves O(n² / log n) time complexity, improving on standard multiplication while preserving exact computation.

Beyond the theoretical guarantees, the method delivers strong practical gains. Experiments show up to 30× faster matrix multiplication, up to 6× lower memory usage, up to 6× faster LLM inference on CPU without system-level optimization, and up to 2.5× faster LLM inference on GPU. The design also exposes substantial parallelism, since computations across column blocks are independent and can be executed concurrently.

More broadly, this work shows that efficient on-device inference can be advanced not only through model compression, but also through better exact algorithms for the core operations that dominate runtime. In that sense, it provides an algorithmic foundation for making capable neural models more practical, affordable, and accessible on commodity hardware.

Citation: Mohsen Dehghankar, Mahdi Erfanian, and Abolfazl Asudeh. 2025. An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks. In Proceedings of the 42nd International Conference on Machine Learning, Vol. 267, pp. 12969-12986.

CoRR 2026

RSR-core: A High-Performance Engine for Low-Bit Matrix-Vector Multiplication

Mohsen Dehghankar and Abolfazl Asudeh · CoRR abs/2603.27462, 2026

Illustration for the RSR-core engine paper

Matrix-vector multiplication is a core operation in neural networks, vector databases, and large language models, especially at inference time. As low-bit quantization becomes increasingly important for efficient deployment, binary and ternary weight matrices offer strong opportunities for acceleration, but practical gains depend on whether the underlying multiplication can be executed efficiently at the kernel level.

This paper presents RSR-core, a high-performance engine that brings the algorithmic ideas of Redundant Segment Reduction into optimized low-level kernels for both CPU and CUDA environments. While prior implementations of RSR operated mainly at the application level, RSR-core closes the gap between theory and practice by enabling efficient integration into real inference systems.

The engine supports matrix-vector multiplication for binary and ternary weight matrices with general vectors, and is designed for production use. It also includes Hugging Face integration for preprocessing low-bit models and running accelerated inference within existing workflows.

Experimentally, RSR-core delivers substantial performance improvements over baseline Hugging Face PyTorch multiplication. The results show up to 62× speedup on CPU and up to 1.9× faster token generation on CUDA for popular ternary LLMs, demonstrating that low-bit acceleration can translate into meaningful end-to-end gains when paired with a systems-oriented implementation.

More broadly, this work advances the project’s goal of making efficient on-device inference practical by turning a theoretically grounded algorithm into a deployable engine. In doing so, it strengthens the systems foundation for low-bit LLM inference on both commodity CPUs and modern GPUs.

Citation: Mohsen Dehghankar and Abolfazl Asudeh. 2026. RSR-core: A High-Performance Engine for Low-Bit Matrix-Vector Multiplication. CoRR abs/2603.27462.

EDBT 2025

Evaluating the Feasibility of Sampling-Based Techniques for Training Multilayer Perceptrons

Sana Ebrahimi, Rishi Advani, and Abolfazl Asudeh · EDBT 2025

Illustration for the sampling-based MLP training paper

Matrix multiplication is the main computational bottleneck in deep neural networks, creating a fundamental challenge for training and personalization on low-resource machines. As demand grows for running capable models, including LLM-based systems, on commodity CPUs, it becomes increasingly important to understand whether approximation techniques can meaningfully reduce this cost without undermining learning quality.

This paper studies sampling-based approaches for accelerating multilayer perceptron training by approximating matrix multiplication. We identify two main families of methods in the literature: those that activate only a sampled subset of hidden-layer nodes at each iteration, and those that sample nodes from the previous layer to approximate the current layer’s activations. In both cases, the matrix products are computed using only the selected samples.

Through theoretical analysis and comprehensive experiments, the paper shows that these techniques face important scalability limitations. The central issue is that even small approximation errors can propagate across layers, leading to an exponential growth in estimation error as network depth increases.

While the work does not yet resolve this limitation, it provides an important negative result and clarifies the boundaries of currently available sampling-based methods. This makes the paper valuable not only as an evaluation of prior ideas, but also as a foundation for future research on more robust approaches to efficient neural network training.

More broadly, the paper complements the project’s focus on efficient inference by examining a related algorithmic question at training time: when approximation can help, and when it breaks down. In doing so, it helps sharpen the research agenda for efficient neural computation on resource-constrained devices.

Citation: Sana Ebrahimi, Rishi Advani, and Abolfazl Asudeh. 2025. Evaluating the Feasibility of Sampling-Based Techniques for Training Multilayer Perceptrons. In EDBT 2025.

Project investigators

A. Asudeh

Faculty

Mohsen Dehghankar

Lead PhD Student

Mahdi Erfanian

PhD Student

Sana Ebrahimi

PhD Student

Rishi Advani

PhD Student