NSF Project Website

Fairness-aware Data Structures for Approximate Query Processing

NSF IIS-2348919 (2024–2027) · III: Small

This project studies how to build approximate query processing methods and supporting data structures that are not only fast and scalable, but also fairness-aware. The goal is to develop principled foundations and practical techniques so approximate answers remain useful, efficient, and more equitable across groups.

Why this project matters

The abundance of data, coupled with recent advances in computation, has transformed almost every aspect of modern life. At the center of this transformation are data structures, which make it possible to store, index, summarize, and query massive datasets efficiently enough for real-world systems.

As data-driven technologies become increasingly embedded in everyday decision-making, however, their drawbacks and potential harms also become more visible, especially when systems produce biased outcomes. This project revisits classical data structures for approximate query processing through the lens of fairness. It envisions a future in which fairness is a first-class concern of database systems, and where fairness issues in database indices and summaries can be identified, analyzed, and resolved directly at the systems level.

To achieve this goal, the project develops data structures with theoretical guarantees that support group fairness across several core domains, including hashing, membership estimation, aggregate query estimation, and approximate ranking and selection query answering. The long-term objective is to build a system for fair data structures for approximate query processing that can integrate naturally into existing data management systems, bringing efficiency, accuracy, and fairness together as inseparable design principles.

Plublications

Fair Data Structures
SIGMOD 2024

FairHash: A Fair and Memory/Time-efficient Hashmap

Nima Shahbazi, Stavros Sintos, and Abolfazl Asudeh · Proc. ACM Manag. Data 2, 3, Article 136, June 2024, 29 pages

Comparison of traditional hashing, data-informed hashing, and FairHash

Hashmaps are among the most widely used data structures in computing, serving as a basic primitive in data processing, indexing, caching, and large-scale systems. Yet standard hash functions are designed to optimize efficiency, not fairness.

Given a dataset, traditional random hashmaps do not uniformly distribute points across a fixed number of buckets. Even data-informed hashmaps, while able to improve bucket balance, may still fail to preserve equal group representation inside each bucket. As a result, they can introduce systematic disparities that later propagate through downstream analytics and decision-making pipelines.

FairHash addresses this issue directly. It guarantees zero unfairness while simultaneously ensuring a uniform distribution of points across buckets and equal group ratios within each bucket, regardless of how the original data is distributed.

Formally, FairHash satisfies pairwise fairness: for pairs of points drawn from different demographic groups, the collision probability, and therefore the false positive rate, is exactly 1/m, where m is the number of buckets. This gives a precise fairness guarantee at the level of the data structure itself.

To the best of our knowledge, FairHash was the first work to study group fairness in a fundamental data structure. The design combines ideas from computational geometry with recent advances in the elegant yet less widely known problem of necklace splitting.

Beyond its theoretical contribution, FairHash has clear practical relevance across the responsible data science pipeline, wherever hashing is used as a building block for efficient storage, retrieval, or approximate processing.

Citation: Nima Shahbazi, Stavros Sintos, and Abolfazl Asudeh. 2024. FairHash: A Fair and Memory/Time-efficient Hashmap. Proc. ACM Manag. Data 2, 3, Article 136 (June 2024), 29 pages. https://doi.org/10.1145/3654939

SIGMOD 2026

Fair-Count-Min: Frequency Estimation under Equal Group-wise Approximation Factor

Nima Shahbazi, Stavros Sintos, and Abolfazl Asudeh · Proc. ACM Manag. Data 4, 3 (SIGMOD), Article 179 (June 2026), 42 pages

Illustration for the Fair-Count-Min paper

Frequency estimation is one of the core tasks in streaming data systems. When data arrives too quickly or in too large a volume to store explicitly, systems often rely on compact sketches such as Count-Min to answer aggregate queries approximately while using only sublinear space.

But compactness comes with error. In the classical Count-Min sketch, the additive approximation error is not experienced equally across all groups. In practice, this means that less frequent or less represented groups can suffer disproportionately larger relative distortion, creating an important fairness concern in data analysis and monitoring.

Fair-Count-Min revisits this classical sketch from a fairness perspective. The goal is not only to estimate frequencies efficiently, but also to ensure that the expected approximation factor is distributed equally across groups rather than being concentrated on the unpopular ones.

To accomplish this, the paper introduces a column-partitioning strategy together with group-aware semi-uniform hashing. This design eliminates collisions between elements from different groups, which is exactly the mechanism through which unequal error tends to arise in standard sketches.

We provide formal guarantees for group fairness, analyze the cost of enforcing it, and evaluate the method extensively on real-world datasets against representative state-of-the-art baselines. The results show that Fair-Count-Min achieves fairness with only a generally small additional error while preserving efficiency comparable to the standard Count-Min sketch.

More broadly, this work demonstrates that even highly optimized streaming primitives can be redesigned so fairness is built into the approximation mechanism itself, rather than handled later as an afterthought.

Citation: Nima Shahbazi, Stavros Sintos, and Abolfazl Asudeh. 2026. Fair-Count-Min: Frequency Estimation under Equal Group-wise Approximation Factor. Proc. ACM Manag. Data 4, 3 (SIGMOD), Article 179 (June 2026), 42 pages. https://doi.org/10.1145/3802056

VLDB 2026

On Fair Epsilon Nets and Geometric Hitting Sets

Mohsen Dehghankar, Stavros Sintos, Abolfazl Asudeh · Proceedings of the VLDB Endowment, Vol. 19, pp. 1032-1045

Illustration accompanying the fair epsilon nets and geometric hitting sets paper

Many data systems rely on small representative subsets of data to support fast queries and scalable analytics. In computational geometry, these summaries are captured by fundamental tools such as epsilon-nets, epsilon-samples, and geometric hitting sets, which provide strong guarantees for range queries and approximation.

However, these tools are inherently oblivious to fairness. In modern applications, datasets may contain multiple groups defined by sensitive application-specific attributes. When summaries are constructed without considering these groups, they can introduce systematic biases that propagate to downstream tasks, including query answering, visualization, and decision-making.

In this work, led by Mohsen Dehghankar, we revisit these classical geometric constructs through the lens of fairness. The goal is to ensure that summaries are not only accurate but also representative across groups. We introduce fairness-aware variants of epsilon-nets and geometric hitting sets under two natural notions of group fairness:

  • Demographic parity, which preserves the original proportions of groups in the data.
  • Custom ratio fairness, which allows user-specified representation across groups.

Formulating these fairness-aware variants, we design sampling-based algorithms for fair epsilon-nets that achieve near-optimal size guarantees, with only a logarithmic dependence on the number of groups. To further improve the quality of the summaries, we develop a discrepancy-based approach that produces smaller solutions with stronger theoretical guarantees.

Building on these results, we show that the fair geometric hitting set problem can be reduced to fair epsilon-nets, leading to approximation bounds with only a mild overhead due to fairness constraints. Our experimental evaluation demonstrates that enforcing fairness comes at a surprisingly small cost, requiring only a modest increase in summary size while effectively eliminating representation bias.

This work connects computational geometry with fairness-aware data management, showing that fairness can be incorporated directly into the core primitives underlying data summarization, rather than treated as a post-processing step. More broadly, it is part of our NSF-supported effort (Award No. 2348919) on fair data structures for approximate query processing, aiming to rethink the foundations of data systems so that efficiency, accuracy, and fairness are treated as first-class and inseparable design principles.

Citation: Mohsen Dehghankar, Stavros Sintos, and Abolfazl Asudeh. On Fair Epsilon Net and Geometric Hitting Set. Proceedings of the VLDB Endowment, Vol. 19, pp. 1032-1045.

Beyond Data Structures
KDD 2025

Fair Set Cover

Mohsen Dehghankar, Rahul Raychaudhury, Stavros Sintos, and Abolfazl Asudeh · Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 (KDD '25), August 3-7, 2025

Illustration for the Fair Set Cover paper

Set Cover is one of the most classical optimization problems in computer science, with deep theoretical roots and broad practical impact. It models a wide range of real-world tasks, including resource allocation, summarization, and selection problems, many of which directly affect how benefits, opportunities, or services are distributed.

Despite this wide applicability, Set Cover has largely been studied without an explicit notion of fairness. As a result, solutions that are optimal in size may still produce systematically unbalanced outcomes when the chosen sets overrepresent some groups while underrepresenting others.

In this work, led by Mohsen Dehghankar, we address this gap by introducing the Fair Set Cover problem: finding a minimum-size cover that also satisfies demographic parity in its selection of sets. This brings fairness directly into a core combinatorial optimization problem rather than treating it as a downstream correction.

We develop multiple variants of Fair Set Cover, analyze their computational hardness, and design efficient approximation algorithms for solving them. Under reasonable assumptions, our methods guarantee zero unfairness while incurring only a small increase in approximation ratio.

Our experimental results support the theory, showing that fairness comes at a negligible practical cost, with only minimal impact on solution size and runtime.

Beyond its direct applications, this work also provides algorithmic foundations for related problems such as representative dataset formation, fair regret-ratio minimizing sets, and some variants of fair clustering.

Citation: Mohsen Dehghankar, Rahul Raychaudhury, Stavros Sintos, and Abolfazl Asudeh. Fair Set Cover. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 (KDD '25), August 3-7, 2025.

ICDT 2026

Dynamic Necklace Splitting

Rishi Advani, Abolfazl Asudeh, Mohsen Dehghankar, and Stavros Sintos · 29th International Conference on Database Theory (EDBT/ICDT Conference 2026), Leibniz International Proceedings in Informatics (LIPIcs), Volume 365, pp. 19:1-19:20

Illustration for the Dynamic Necklace Splitting paper

Necklace splitting is a classic problem in combinatorics and fair division: given a sequence of colored beads, how can we divide it among multiple agents so that each receives an equal share of every color, using as few cuts as possible?

The story goes back to the early 1980s, when Bhatt and Leiserson introduced the problem in the context of parallel computation. Soon after, it became a beautiful example of the interplay between combinatorics and topology, with subsequent work connecting it to the Borsuk-Ulam theorem. Most recently, Noga Alon and Andrei Graur discovered an efficient approximation algorithm for finding a solution with few cuts [ICALP'21].

But the classical formulation is static, the necklace does not change. Modern data systems, however, are inherently dynamic.

In our recent work, led by Rishi Advani, we study dynamic necklace splitting: what happens when beads can be inserted, deleted, or relocated over time, while still maintaining fairness efficiently without recomputing from scratch?

This extension is motivated by practical applications, including fair hash maps, load balancing across distributed systems, and maintaining representative partitions in data processing pipelines.

Our key findings include:

  • The formalization of necklace splitting in dynamic settings.
  • Linear-time algorithms that maintain optimal fairness guarantees for the two-color case under updates.
  • Efficient algorithms for multi-color settings and batch updates.
  • A randomized approach that achieves approximate fairness with polylogarithmic update time when the number of agents is small.

To the best of my knowledge, this paper was the only submission to ICDT 2026 that received three Accept votes, reflecting the strength of the work led by my student, Rishi Advani.

Despite its elegant origins and growing relevance, necklace splitting remains relatively underexplored.

A natural next direction is to further generalize these results beyond two colors.

Citation: Rishi Advani, Abolfazl Asudeh, Mohsen Dehghankar, and Stavros Sintos. Dynamic Necklace Splitting. In 29th International Conference on Database Theory (EDBT/ICDT Conference 2026). Leibniz International Proceedings in Informatics (LIPIcs), Volume 365, pp. 19:1-19:20.

VLDB 2025

Mining the Minoria

Mohsen Dehghankar and Abolfazl Asudeh · Proceedings of the VLDB Endowment, 18(5): 1453-1465, 2025

Illustration for the Mining the Minoria paper

Suppose you are a responsible business or data owner and want to identify groups in your data that are systematically underrepresented and underperforming in downstream decisions. In practice, this is much harder than it sounds.

First, datasets often do not explicitly include demographic information, so the sensitive groups of interest are not directly visible. Second, even if one suspects that inequities exist, there are many possible ways to partition the data, which creates an unknown-unknown problem: you may not even know which groups to look for.

This paper, led by Mohsen Dehghankar, introduces the minoria mining problem to address exactly this challenge. Rather than assuming the minority groups are already known, the goal is to discover one-dimensional projections in the data cube that reveal potentially underrepresented and underperforming groups.

At a high level, the approach draws on ideas from computational geometry, especially levels of arrangements, to identify highly skewed projections where the tail of the distribution is also underperforming. These projections act as signals that something important and potentially unfair may be happening in the data.

The broader significance of the work is that it moves fairness analysis one step earlier in the pipeline. Instead of only correcting disparities after groups are predefined, it helps surface the groups that may deserve attention in the first place.

This is only one possible approach to minoria mining, and an important goal of the work is to open the door for future research on alternative formulations, algorithms, and system support for discovering hidden minority groups.

Citation: Mohsen Dehghankar and Abolfazl Asudeh. Mining the Minoria: Unknown, Under-represented, and Under-performing Minority Groups. Proceedings of the VLDB Endowment, 18(5): 1453-1465, 2025.

Project investigators

A. Asudeh

Principal Investigator

S. Sintos

Co-Principal Investigator

Nima Shahbazi

PhD Student Researcher

Mohsen Dehghankar

PhD Student Researcher

Rishi Advani

PhD Student Researcher