Generative AI for Multi-modal Data Management

Overview

Why this project matters

Recent advances in Generative AI are fundamentally reshaping the landscape of data-driven systems. Beyond their success in language and vision tasks, generative models now offer a powerful abstraction for understanding, synthesizing, and reasoning over data.

This project investigates how Generative AI can serve as a core enabler for the next generation of data management systems: systems that are more capable, more responsible, more efficient, and more adaptive than traditional pipeline-driven approaches.

At a high level, the project explores a shift toward model-centric data management, where generative models actively participate in data processing, augmentation, and retrieval. By leveraging GenAI's ability to generate high-quality synthetic data and interpret complex multimodal inputs, the project seeks new solutions to long-standing challenges in the field.

Across its directions, the work bridges algorithmic foundations, including robustness, approximation, and ranking, with scalable implementations suitable for real-world data platforms. The long-term goal is a unified framework where generative models help curate, query, and improve data rather than merely consume it.

Research Directions

What we do

Our research asks a broader set of questions about how generative AI can help address data management challenges. These questions include:

How can generative models become reliable first-class operators in data management systems rather than isolated add-ons for downstream analytics?
How can Generative AI improve fairness, robustness, and coverage in datasets through principled data augmentation, especially for under-served or under-represented groups?
How can synthetic data generation be made both useful and responsible, with stronger guarantees about quality, diversity, and downstream impact on learning systems?
How can multimodal databases support complex natural-language queries that require semantic understanding, cross-modal alignment, and compositional reasoning over text, images, and other modalities?
How should generative reasoning interact with retrieval, ranking, and approximation techniques so that query answering is both expressive and scalable in real-world systems?
How can model-centric data systems remain efficient and adaptive as data distributions, user needs, and modalities evolve over time?
How can theoretical foundations such as robustness, approximation, and ranking inform scalable implementations of generative data systems in practice?

Project outcomes

Softwares

Beyond papers, this project will produce software artifacts for model-centric data management, including tools for responsible data augmentation, multimodal retrieval, and generative querying.

Software

Needle

A deployment-ready open-source image retrieval database for complex natural-language queries.

Needle🪡🔍 is a deployment-ready open-source image retrieval database with high accuracy that can handle complex queries in natural language. It is designed to be fast, efficient, and precise, outperforming state-of-the-art methods while remaining practical for real use.

Born from high-end research, Needle is built to be accessible beyond a narrow research audience while still delivering strong performance. It enables researchers, developers, and enthusiasts to explore image datasets through richer query interfaces than standard keyword or basic embedding search.

Within this project, Needle represents the systems side of generative-AI-powered data management: a concrete platform for advanced retrieval over visual data, where natural-language interaction, semantic understanding, and practical deployment all matter.

Project Webpage GitHub Repo Interactive Demo

Project outcomes

Publications

This section will track publications on responsible generative data management, multimodal retrieval, and model-centric data systems.

arXiv 2024

Needle: A Generative-AI Powered Monte Carlo Method for Answering Complex Natural Language Queries on Multi-modal Data

Mahdi Erfanian, Mohsen Dehghankar, and Abolfazl Asudeh · arXiv:2412.00639, 2024

Illustration for the NEEDLE paper on answering complex natural-language queries over multimodal data

Multi-modal datasets such as image collections are full of rich information, but that richness is often locked away. The raw items may be visually expressive, yet the textual descriptions that would let a system understand them are usually sparse, incomplete, or entirely missing. As soon as a user asks a complex natural-language query, something that depends on nuanced semantics rather than a short keyword label, standard retrieval pipelines begin to fail.

The core difficulty is deeper than missing metadata. In traditional nearest-neighbor search, both the data and the query live in the same metric space, so comparison is direct. Here, however, the query is expressed in language while the tuples are multimodal objects, meaning the two sides start in fundamentally different spaces. Prior work tries to bridge that gap through jointly trained text-image representations, but those approaches often struggle once the query becomes compositionally rich or semantically demanding.

This paper takes a different path. Instead of forcing the query into a single brittle vector representation, it introduces a generative-AI-powered Monte Carlo method that uses foundation models to create synthetic samples capturing the intent and complexity of the natural-language request. Those generated samples are then represented in the same metric space as the multi-modal data, making retrieval possible in a way that better respects the structure of the original query.

Building on that idea, the paper presents NEEDLE, a database system for image retrieval that is driven by synthetic data generation rather than contrastive learning or metadata search. The result is not just a research idea, but an open-source, deployment-ready system designed for practical adoption by researchers and developers. Across benchmark datasets, NEEDLE significantly outperforms state-of-the-art text-to-image retrieval methods, while remaining flexible enough to benefit from future progress in foundation models and embedding technologies.

Citation: Mahdi Erfanian, Mohsen Dehghankar, and Abolfazl Asudeh. 2024. Needle: A Generative-AI Powered Monte Carlo Method for Answering Complex Natural Language Queries on Multi-modal Data. arXiv:2412.00639.

Paper

VLDB 2024

Chameleon: Foundation Models for Fairness-aware Multi-modal Data Augmentation to Enhance Coverage of Minorities

Mahdi Erfanian, H. Jagadish, and Abolfazl Asudeh · Proceedings of the VLDB Endowment, 17(11): 3470-3483, 2024

Illustration for the Chameleon paper on fairness-aware multimodal data augmentation

Machine learning systems are increasingly scrutinized for being unfair to minorities and other marginalized groups, and one recurring reason is simple but serious: those groups are often under-represented in the training data. In the ideal case, the dataset could be repaired by collecting more real examples from external sources. In practice, that option is often unavailable, too expensive, or too slow.

That gap creates a natural question for modern generative AI: if real data cannot be collected, can high-quality synthetic data be used to repair the dataset instead? This paper explores that question through Chameleon, a system that uses foundation models to generate synthetic multimodal tuples for fairness-aware data augmentation.

The appeal of this idea is immediate, because the dataset itself is rarely the final objective. It is usually an input to something else, such as training an ML model. If a small amount of carefully generated synthetic data can reduce under-representation and improve downstream fairness, then augmentation may be a practical substitute when real-world repair is impossible.

But making that work requires more than just generating extra examples. The system must identify the minimal set of synthetic tuples needed to address representation bias, preserve semantic integrity so the generated tuples fit the context and distribution of the original dataset, maintain visual and semantic quality that looks realistic to human evaluators, and remain cost-effective despite the monetary cost of querying foundation models. To address these challenges, the work introduces several technical ideas, including rejection tests and multi-armed bandit methods.

Comprehensive experiments show that augmentation with synthetic tuples generated by advanced foundation models can come remarkably close to the effect of adding real tuples when the goal is to reduce downstream unfairness. In that sense, Chameleon demonstrates that generative AI can play a concrete role in responsible data management, not merely by analyzing biased datasets, but by actively helping repair them.

Citation: Mahdi Erfanian, H. Jagadish, and Abolfazl Asudeh. 2024. Chameleon: Foundation Models for Fairness-aware Multi-modal Data Augmentation to Enhance Coverage of Minorities. In Proceedings of the VLDB Endowment, 17(11): 3470-3483.

Paper Slides Video

Preprint 2026

NeedleDB: A Generative-AI Based System for Accurate and Efficient Image Retrieval using Complex Natural Language Queries

Mahdi Erfanian and Abolfazl Asudeh · CoRR abs/2603.27464, 2026

Illustration for the NeedleDB demo paper on accurate and efficient image retrieval

Retrieving images with natural language sounds straightforward until the query becomes genuinely descriptive. Existing systems often depend on contrastive-learning embeddings such as CLIP, which work reasonably well for short or simple prompts but begin to degrade when users ask for something nuanced, compositional, or visually specific. That gap is exactly where a practical retrieval system begins to matter.

NeedleDB is presented as a deployment-ready database system built to answer those harder queries over image collections. Its central idea is to use generative AI not just as an auxiliary tool, but as part of the retrieval mechanism itself. Rather than forcing a difficult text query into a single embedding and hoping it aligns well with the image space, the system synthesizes guide images that visually represent the query, effectively turning a fragile text-to-image search problem into a more tractable image-to-image retrieval task.

The system then strengthens retrieval quality by aggregating nearest-neighbor results across multiple vision embedders through a weighted rank-fusion strategy grounded in a Monte Carlo estimator with provable error bounds. That algorithmic core is paired with a complete systems stack: a command-line interface through needlectl, a browser-based web interface, and a modular microservice architecture backed by PostgreSQL and Milvus.

The result is a system that is not only accurate, but operational. On challenging benchmarks, NeedleDB improves Mean Average Precision by up to 93% over the strongest baseline while maintaining sub-second query latency. In demonstration settings, users can interact with the system through realistic scenarios that highlight retrieval quality, ingestion workflows, and the configurability of the end-to-end pipeline.

Citation: Mahdi Erfanian and Abolfazl Asudeh. 2026. NeedleDB: A Generative-AI Based System for Accurate and Efficient Image Retrieval using Complex Natural Language Queries. CoRR abs/2603.27464.

Paper Interactive Demo GitHub

EDBT 2024

Data Coverage for Detecting Representation Bias in Image Datasets: A Crowdsourcing Approach

Melika Mousavi, Nima Shahbazi, and Abolfazl Asudeh · International Conference on Extending Database Technology (EDBT), 2024

Illustration for the data coverage paper on crowdsourcing-based detection of representation bias in image datasets

Existing machine learning models often fail on minority groups because the datasets used to train them do not adequately represent those groups. This is especially challenging in social and image datasets, where the relevant protected or demographic attributes may not be explicitly available, making it difficult to determine whether the data sufficiently covers different populations.

This paper studies how to identify representation bias in image datasets without relying on explicit attribute values. Using the notion of data coverage, it develops multiple crowdsourcing-based approaches for detecting whether a dataset lacks proper representation for a given group. The core method is a divide-and-conquer algorithm with search-space pruning, designed to efficiently discover coverage gaps while keeping the human labeling effort manageable.

Beyond the base algorithm, the work provides a distinct theoretical analysis, including a tight upper bound that establishes near-optimality. It also introduces heuristics that reduce the cost of coverage detection across both intersectional and non-intersectional groups, and it shows that relying only on pre-trained predictors is not sufficient for dependable bias detection in this setting.

Finally, the paper extends the framework to make use of existing predictive models when possible, reducing crowdsourcing cost without fully trusting those models as the final answer. Extensive experiments, including live studies on Amazon Mechanical Turk, validate both the problem formulation and the practical performance of the proposed algorithms.

Citation: Mousavi, Melika, Nima Shahbazi, and Abolfazl Asudeh. 2024. Data Coverage for Detecting Representation Bias in Image Datasets: A Crowdsourcing Approach. International Conference on Extending Database Technology (EDBT).

Paper

Team

Generative AI for Multi-modal Data Management

Why this project matters

What we do

Softwares

Needle

Publications

Needle: A Generative-AI Powered Monte Carlo Method for Answering Complex Natural Language Queries on Multi-modal Data

Chameleon: Foundation Models for Fairness-aware Multi-modal Data Augmentation to Enhance Coverage of Minorities

NeedleDB: A Generative-AI Based System for Accurate and Efficient Image Retrieval using Complex Natural Language Queries

Data Coverage for Detecting Representation Bias in Image Datasets: A Crowdsourcing Approach

Project investigators

A. Asudeh

Mahdi Erfanian

Mohsen Dehghankar

H. V. Jagadish