NSF IIS-2348919 | Fairness-aware Data Structures for Approximate Query Processing

Overview

Why this project matters

The abundance of data, coupled with recent advances in computation, has transformed almost every aspect of modern life. At the center of this transformation are data structures, which make it possible to store, index, summarize, and query massive datasets efficiently enough for real-world systems.

As data-driven technologies become increasingly embedded in everyday decision-making, however, their drawbacks and potential harms also become more visible, especially when systems produce biased outcomes. This project revisits classical data structures for approximate query processing through the lens of fairness. It envisions a future in which fairness is a first-class concern of database systems, and where fairness issues in database indices and summaries can be identified, analyzed, and resolved directly at the systems level.

To achieve this goal, the project develops data structures with theoretical guarantees that support group fairness across several core domains, including hashing, membership estimation, aggregate query estimation, and approximate ranking and selection query answering. The long-term objective is to build a system for fair data structures for approximate query processing that can integrate naturally into existing data management systems, bringing efficiency, accuracy, and fairness together as inseparable design principles.

Outcomes and Publications

Plublications

Fair Data Structures

SIGMOD 2024

FairHash: A Fair and Memory/Time-efficient Hashmap

Nima Shahbazi, Stavros Sintos, and Abolfazl Asudeh · Proc. ACM Manag. Data 2, 3, Article 136, June 2024, 29 pages

Comparison of traditional hashing, data-informed hashing, and FairHash

Hashmaps are among the most widely used data structures in computing, serving as a basic primitive in data processing, indexing, caching, and large-scale systems. Yet standard hash functions are designed to optimize efficiency, not fairness.

Given a dataset, traditional random hashmaps do not uniformly distribute points across a fixed number of buckets. Even data-informed hashmaps, while able to improve bucket balance, may still fail to preserve equal group representation inside each bucket. As a result, they can introduce systematic disparities that later propagate through downstream analytics and decision-making pipelines.

FairHash addresses this issue directly. It guarantees zero unfairness while simultaneously ensuring a uniform distribution of points across buckets and equal group ratios within each bucket, regardless of how the original data is distributed.

Formally, FairHash satisfies pairwise fairness: for pairs of points drawn from different demographic groups, the collision probability, and therefore the false positive rate, is exactly 1/m, where m is the number of buckets. This gives a precise fairness guarantee at the level of the data structure itself.

To the best of our knowledge, FairHash was the first work to study group fairness in a fundamental data structure. The design combines ideas from computational geometry with recent advances in the elegant yet less widely known problem of necklace splitting.

Beyond its theoretical contribution, FairHash has clear practical relevance across the responsible data science pipeline, wherever hashing is used as a building block for efficient storage, retrieval, or approximate processing.

Citation: Nima Shahbazi, Stavros Sintos, and Abolfazl Asudeh. 2024. FairHash: A Fair and Memory/Time-efficient Hashmap. Proc. ACM Manag. Data 2, 3, Article 136 (June 2024), 29 pages. https://doi.org/10.1145/3654939

Paper Slides

SIGMOD 2026

Fair-Count-Min: Frequency Estimation under Equal Group-wise Approximation Factor

Nima Shahbazi, Stavros Sintos, and Abolfazl Asudeh · Proc. ACM Manag. Data 4, 3 (SIGMOD), Article 179 (June 2026), 42 pages

Illustration for the Fair-Count-Min paper

Frequency estimation is one of the core tasks in streaming data systems. When data arrives too quickly or in too large a volume to store explicitly, systems often rely on compact sketches such as Count-Min to answer aggregate queries approximately while using only sublinear space.

But compactness comes with error. In the classical Count-Min sketch, the additive approximation error is not experienced equally across all groups. In practice, this means that less frequent or less represented groups can suffer disproportionately larger relative distortion, creating an important fairness concern in data analysis and monitoring.

Fair-Count-Min revisits this classical sketch from a fairness perspective. The goal is not only to estimate frequencies efficiently, but also to ensure that the expected approximation factor is distributed equally across groups rather than being concentrated on the unpopular ones.

To accomplish this, the paper introduces a column-partitioning strategy together with group-aware semi-uniform hashing. This design eliminates collisions between elements from different groups, which is exactly the mechanism through which unequal error tends to arise in standard sketches.

We provide formal guarantees for group fairness, analyze the cost of enforcing it, and evaluate the method extensively on real-world datasets against representative state-of-the-art baselines. The results show that Fair-Count-Min achieves fairness with only a generally small additional error while preserving efficiency comparable to the standard Count-Min sketch.

More broadly, this work demonstrates that even highly optimized streaming primitives can be redesigned so fairness is built into the approximation mechanism itself, rather than handled later as an afterthought.

Citation: Nima Shahbazi, Stavros Sintos, and Abolfazl Asudeh. 2026. Fair-Count-Min: Frequency Estimation under Equal Group-wise Approximation Factor. Proc. ACM Manag. Data 4, 3 (SIGMOD), Article 179 (June 2026), 42 pages. https://doi.org/10.1145/3802056

Paper

VLDB 2026

On Fair Epsilon Nets and Geometric Hitting Sets

Mohsen Dehghankar, Stavros Sintos, Abolfazl Asudeh · Proceedings of the VLDB Endowment, Vol. 19, pp. 1032-1045

Illustration accompanying the fair epsilon nets and geometric hitting sets paper

Many data systems rely on small representative subsets of data to support fast queries and scalable analytics. In computational geometry, these summaries are captured by fundamental tools such as epsilon-nets, epsilon-samples, and geometric hitting sets, which provide strong guarantees for range queries and approximation.

However, these tools are inherently oblivious to fairness. In modern applications, datasets may contain multiple groups defined by sensitive application-specific attributes. When summaries are constructed without considering these groups, they can introduce systematic biases that propagate to downstream tasks, including query answering, visualization, and decision-making.

In this work, led by Mohsen Dehghankar, we revisit these classical geometric constructs through the lens of fairness. The goal is to ensure that summaries are not only accurate but also representative across groups. We introduce fairness-aware variants of epsilon-nets and geometric hitting sets under two natural notions of group fairness:

Demographic parity, which preserves the original proportions of groups in the data.
Custom ratio fairness, which allows user-specified representation across groups.

Formulating these fairness-aware variants, we design sampling-based algorithms for fair epsilon-nets that achieve near-optimal size guarantees, with only a logarithmic dependence on the number of groups. To further improve the quality of the summaries, we develop a discrepancy-based approach that produces smaller solutions with stronger theoretical guarantees.

Building on these results, we show that the fair geometric hitting set problem can be reduced to fair epsilon-nets, leading to approximation bounds with only a mild overhead due to fairness constraints. Our experimental evaluation demonstrates that enforcing fairness comes at a surprisingly small cost, requiring only a modest increase in summary size while effectively eliminating representation bias.

This work connects computational geometry with fairness-aware data management, showing that fairness can be incorporated directly into the core primitives underlying data summarization, rather than treated as a post-processing step. More broadly, it is part of our NSF-supported effort (Award No. 2348919) on fair data structures for approximate query processing, aiming to rethink the foundations of data systems so that efficiency, accuracy, and fairness are treated as first-class and inseparable design principles.

Citation: Mohsen Dehghankar, Stavros Sintos, and Abolfazl Asudeh. On Fair Epsilon Net and Geometric Hitting Set. Proceedings of the VLDB Endowment, Vol. 19, pp. 1032-1045.

Paper

Beyond Data Structures

VLDB 2026

Unbiased Binning

Abolfazl Asudeh, Zeinab Asoodeh, Bita Asoodeh, and Omid Asudeh · Proceedings of the VLDB Endowment, Vol. 19, 2026

Discretizing raw features into bucketized attributes is a common preprocessing step before sharing, analyzing, or modeling a dataset. However, this process can inadvertently introduce bias and amplify unfairness in downstream tasks.

To address this problem, we formulate unbiased binning, which seeks bucketized attributes that satisfy group parity. We develop an efficient dynamic programming algorithm for equal-size binning and characterizes the practical challenges of exact parity. In practice, however, an unbiased binning may incur a high price of fairness or may not exist at all, particularly when group distributions differ substantially.

To accommodate settings in which small deviations from perfect parity are acceptable, we further introduce ε-biased binning which restricts group disparities across buckets to at most ε. We first present a quadratic-time dynamic programming solution. Since this algorithm does not scale to large datasets, we then proposes a practical local-search algorithm, LS, built around a divide-and-conquer procedure. D&C runs in near-linear time and is proved to return a valid solution whenever one exists. LS uses the D&C solution as an upper bound to guide the search for the optimum.

The LS and D&C algorithms are general and are not limited to equal-size binning. Complementing the theoretical results, extensive experiments on real-world and synthetic datasets confirm the efficiency of the proposed algorithms and demonstrate the significance of the problem: fairness-unaware binning can generate biased attribute representations, while fairness-aware binning can substantially reduce this bias with a negligible price of fairness.

Citation: Abolfazl Asudeh, Zeinab Asoodeh, Bita Asoodeh, Omid Asudeh. Unbiased Binning: Fairness-aware Attribute Representation. Proceedings of the VLDB Endowment, Vol 19, 2026.

Technical Report

KDD 2025

Fair Set Cover

Mohsen Dehghankar, Rahul Raychaudhury, Stavros Sintos, and Abolfazl Asudeh · Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 (KDD '25), August 3-7, 2025

Set Cover is one of the most classical optimization problems in computer science, with deep theoretical roots and broad practical impact. It models a wide range of real-world tasks, including resource allocation, summarization, and selection problems, many of which directly affect how benefits, opportunities, or services are distributed.

Despite this wide applicability, Set Cover has largely been studied without an explicit notion of fairness. As a result, solutions that are optimal in size may still produce systematically unbalanced outcomes when the chosen sets overrepresent some groups while underrepresenting others.

In this work, led by Mohsen Dehghankar, we address this gap by introducing the Fair Set Cover problem: finding a minimum-size cover that also satisfies demographic parity in its selection of sets. This brings fairness directly into a core combinatorial optimization problem rather than treating it as a downstream correction.

We develop multiple variants of Fair Set Cover, analyze their computational hardness, and design efficient approximation algorithms for solving them. Under reasonable assumptions, our methods guarantee zero unfairness while incurring only a small increase in approximation ratio.

Our experimental results support the theory, showing that fairness comes at a negligible practical cost, with only minimal impact on solution size and runtime.

Beyond its direct applications, this work also provides algorithmic foundations for related problems such as representative dataset formation, fair regret-ratio minimizing sets, and some variants of fair clustering.

Citation: Mohsen Dehghankar, Rahul Raychaudhury, Stavros Sintos, and Abolfazl Asudeh. Fair Set Cover. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 (KDD '25), August 3-7, 2025.

Paper Slides

ICDT 2026

Dynamic Necklace Splitting

Rishi Advani, Abolfazl Asudeh, Mohsen Dehghankar, and Stavros Sintos · 29th International Conference on Database Theory (EDBT/ICDT Conference 2026), Leibniz International Proceedings in Informatics (LIPIcs), Volume 365, pp. 19:1-19:20

Necklace splitting is a classic problem in combinatorics and fair division: given a sequence of colored beads, how can we divide it among multiple agents so that each receives an equal share of every color, using as few cuts as possible?

The story goes back to the early 1980s, when Bhatt and Leiserson introduced the problem in the context of parallel computation. Soon after, it became a beautiful example of the interplay between combinatorics and topology, with subsequent work connecting it to the Borsuk-Ulam theorem. Most recently, Noga Alon and Andrei Graur discovered an efficient approximation algorithm for finding a solution with few cuts [ICALP'21].

But the classical formulation is static, the necklace does not change. Modern data systems, however, are inherently dynamic.

In our recent work, led by Rishi Advani, we study dynamic necklace splitting: what happens when beads can be inserted, deleted, or relocated over time, while still maintaining fairness efficiently without recomputing from scratch?

This extension is motivated by practical applications, including fair hash maps, load balancing across distributed systems, and maintaining representative partitions in data processing pipelines.

Our key findings include:

The formalization of necklace splitting in dynamic settings.
Linear-time algorithms that maintain optimal fairness guarantees for the two-color case under updates.
Efficient algorithms for multi-color settings and batch updates.
A randomized approach that achieves approximate fairness with polylogarithmic update time when the number of agents is small.

To the best of my knowledge, this paper was the only submission to ICDT 2026 that received three Accept votes, reflecting the strength of the work led by my student, Rishi Advani.

Despite its elegant origins and growing relevance, necklace splitting remains relatively underexplored.

A natural next direction is to further generalize these results beyond two colors.

Citation: Rishi Advani, Abolfazl Asudeh, Mohsen Dehghankar, and Stavros Sintos. Dynamic Necklace Splitting. In 29th International Conference on Database Theory (EDBT/ICDT Conference 2026). Leibniz International Proceedings in Informatics (LIPIcs), Volume 365, pp. 19:1-19:20.

Paper

VLDB 2025

Mining the Minoria

Mohsen Dehghankar and Abolfazl Asudeh · Proceedings of the VLDB Endowment, 18(5): 1453-1465, 2025

Suppose you are a responsible business or data owner and want to identify groups in your data that are systematically underrepresented and underperforming in downstream decisions. In practice, this is much harder than it sounds.

First, datasets often do not explicitly include demographic information, so the sensitive groups of interest are not directly visible. Second, even if one suspects that inequities exist, there are many possible ways to partition the data, which creates an unknown-unknown problem: you may not even know which groups to look for.

This paper, led by Mohsen Dehghankar, introduces the minoria mining problem to address exactly this challenge. Rather than assuming the minority groups are already known, the goal is to discover one-dimensional projections in the data cube that reveal potentially underrepresented and underperforming groups.

At a high level, the approach draws on ideas from computational geometry, especially levels of arrangements, to identify highly skewed projections where the tail of the distribution is also underperforming. These projections act as signals that something important and potentially unfair may be happening in the data.

The broader significance of the work is that it moves fairness analysis one step earlier in the pipeline. Instead of only correcting disparities after groups are predefined, it helps surface the groups that may deserve attention in the first place.

This is only one possible approach to minoria mining, and an important goal of the work is to open the door for future research on alternative formulations, algorithms, and system support for discovering hidden minority groups.

Citation: Mohsen Dehghankar and Abolfazl Asudeh. Mining the Minoria: Unknown, Under-represented, and Under-performing Minority Groups. Proceedings of the VLDB Endowment, 18(5): 1453-1465, 2025.

Paper Slides

Team

Fairness-aware Data Structures for Approximate Query Processing

Why this project matters

Plublications

FairHash: A Fair and Memory/Time-efficient Hashmap

Fair-Count-Min: Frequency Estimation under Equal Group-wise Approximation Factor

On Fair Epsilon Nets and Geometric Hitting Sets

Unbiased Binning

Fair Set Cover

Dynamic Necklace Splitting

Mining the Minoria

Project investigators

A. Asudeh

S. Sintos

Nima Shahbazi

Mohsen Dehghankar

Rishi Advani