Theoretically-sound and Practically-efficient Vector Data Retrieval

Overview

Why this project matters

The rapid rise of large language models (LLMs) and Retrieval-Augmented Generation (RAG) for information retrieval, question answering, and knowledge-intensive reasoning, together with the growing prevalence of large-scale unstructured data, has driven the emergence of vector database management systems. These systems extend traditional database architectures by efficiently storing, indexing, and querying vector representations such as embeddings, which now sit at the core of many machine learning and AI applications.

At the heart of this landscape are ranked retrieval and similarity search. These problems underpin vector database systems, nearest-neighbor search, and retrieval pipelines in LLM-based applications, where both efficiency and reliability are essential. Classical approximate nearest neighbor algorithms provide strong theoretical guarantees but often fail to scale in practice, while heuristic methods perform remarkably well empirically yet offer little in the way of provable guarantees.

This project focuses on closing that gap. In particular, it studies how to design vector retrieval methods that preserve the rigor of theory while matching the performance expectations of real-world systems. A central challenge is that many ANN and ranked retrieval algorithms are optimized for top-k search when k is small, but modern interactive data systems also require random access to dynamically ordered items over high-dimensional feature spaces.

By connecting data management theory with practical systems building, VectorDB aims to support a new generation of vector retrieval engines that are scalable, analyzable, and dependable enough for production use in search, recommendation, multimodal analytics, and LLM-powered reasoning.

Research Directions

What we do

Our research examines the foundations and systems questions needed to make vector retrieval both theoretically principled and practically effective. The project asks:

How can we bridge the longstanding gap between theoretically grounded ANN algorithms and the heuristic methods that dominate practical vector search engines?
How can ranked retrieval and similarity search be made reliable enough for deployment in vector databases, RAG systems, and other knowledge-intensive AI pipelines?
How can we develop efficient algorithms for random-access ranked retrieval, where ordering is defined dynamically over high-dimensional vector spaces rather than by fixed relational attributes?
How should indexing, pruning, and access methods be redesigned so that random access and exploration remain efficient even when top-k is not the only retrieval mode that matters?
What theoretical guarantees are most useful in practice for vector data systems: approximation quality, stability, latency bounds, memory efficiency, or robustness under distribution shift?
How can vector retrieval algorithms better support interactive exploration, hybrid search, and adaptive ranking in systems that combine structured data, embeddings, and learned query operators?
How can these ideas inform end-to-end vector database architectures that are not only fast on benchmarks, but also dependable, maintainable, and extensible in real deployments?

Project outcomes

Publications

This section will track publications on vector databases, ranked retrieval, similarity search, and random-access algorithms for high-dimensional data systems.

CoRR 2025

HENN: A Hierarchical Epsilon Net Navigation Graph for Approximate Nearest Neighbor Search

Mohsen Dehghankar and Abolfazl Asudeh · CoRR abs/2505.17368, 2025

Illustration for the HENN paper on hierarchical graph indexing for approximate nearest neighbor search

Hierarchical graph-based algorithms such as HNSW have become the dominant practical approach for Approximate Nearest Neighbor search, yet their success has long outpaced their theory. They achieve strong empirical performance, but their reliance on randomized heuristic graph construction leaves open critical questions about query-time guarantees and worst-case recall. At the same time, many theoretically grounded ANN structures remain difficult to implement and rarely match the scale or simplicity demanded by real systems.

This paper introduces HENN, the Hierarchical Epsilon Net Navigation Graph, as a graph-based indexing structure that combines rigorous guarantees with practical efficiency. Built on the theory of ε-nets, HENN guarantees polylogarithmic worst-case query time while preserving high recall and keeping implementation overhead low enough to remain realistic for deployment.

Beyond proposing a new structure, the paper also gives theoretical insight into the empirical success of HNSW by establishing a probabilistic polylogarithmic query-time bound for it. That comparison is important, because prior hierarchical methods may degrade to linear query time under adversarial inputs, whereas HENN maintains provable performance independent of the underlying data distribution.

Extensive experiments show that HENN achieves faster query time while maintaining competitive recall across a range of data distributions, including adversarial settings. The overall result is a robust and scalable ANN solution that closes part of the longstanding gap between principled retrieval theory and high-performance vector search in practice.

Citation: Dehghankar, Mohsen, and Abolfazl Asudeh. 2025. HENN: A Hierarchical Epsilon Net Navigation Graph for Approximate Nearest Neighbor Search. Preprint, CoRR abs/2505.17368.

Paper

Under Review 2026

Random-Access Ranked Retrieval and Similarity Search

Mohsen Dehghankar, Abolfazl Asudeh, Raghav Mittal, Suraj Shetiya, and Gautam Das · Under Review, 2026

Illustration for the Random Access paper on ranked retrieval and similarity search

Random access is a fundamental operation behind many efficient search and exploration algorithms, but modern interactive data systems increasingly rely on ranked retrieval and similarity search where orderings are defined dynamically over high-dimensional feature spaces. That shift creates a new challenge: how can systems directly access the tuple at a desired rank when the ranking itself is induced by geometric or embedding-based structure rather than static stored attributes?

This paper formalizes the Random-Access Ranked Retrieval (RAR) problem and extends the framework to similarity search. On the theoretical side, it develops an efficient algorithm based on geometric arrangements that achieves logarithmic query time. However, because that approach incurs exponential space complexity in high dimensions, the work also introduces a second family of algorithms based on ε-sampling that use only linear space.

Since exactly locating the tuple at a specific rank is closely tied to the range counting problem and therefore inherently difficult, the paper further defines a relaxed variant called κ-Random-Access Ranked Retrieval. Instead of returning one exact tuple, this formulation returns a small subset of size κ guaranteed to contain the target item. To support this efficiently, the work introduces an intermediate problem, Stripe Range Retrieval (SRR), together with a hierarchical sampling data structure specialized for narrow stripe queries.

The resulting methods achieve practical scalability across both dataset size and dimensionality. The paper proves near-optimal bounds for the proposed algorithms and validates them experimentally on real and synthetic datasets, showing scalability to millions of tuples and hundreds of dimensions while supporting efficient ranked retrieval and similarity search.

Citation: Mohsen Dehghankar, Abolfazl Asudeh, Raghav Mittal, Suraj Shetiya, and Gautam Das. 2026. Random-Access Ranked Retrieval and Similarity Search. Under Review.

Team

Project investigators

A. Asudeh

Faculty

Mohsen Dehghankar

Lead PhD Student