RSR-core
RSR-core is the systems implementation of the Redundant Segment Reduction framework for efficient low-bit inference. The repository provides the core kernels, model integrations, and benchmarking pipeline needed to accelerate binary and ternary matrix-vector multiplication, which is a dominant operation in low-bit neural inference and LLM decoding.
The engine supports both CPU and CUDA backends and exposes optimized low-level kernels together with Python wrappers for 1-bit and 1.58-bit multiplication. It is designed to bridge the gap between the algorithmic gains of RSR and practical deployment in real inference pipelines.
The software also includes Hugging Face integration for preprocessing quantized models into RSR format and running accelerated inference from those preprocessed artifacts. In addition, the repository provides benchmarking scripts for kernel-level and end-to-end evaluation, making it easy to reproduce performance results on local hardware.
For interactive use, RSR-core includes a web dashboard built with FastAPI and Vite/React. The interface supports model browsing, preprocessing, side-by-side backend comparison, inference, and benchmark visualization, turning the project into a production-oriented workflow rather than a standalone prototype.
According to the repository benchmarks, the engine achieves substantial speedups over Hugging Face PyTorch baselines, including up to 62× higher throughput on CPU and up to 1.9× faster token generation on CUDA for popular ternary LLMs. The demo visual above links to the project’s video demonstration from the repository README.