Machine learning systems are increasingly scrutinized for being unfair to minorities and other marginalized groups, and one recurring reason is simple but serious: those groups are often under-represented in the training data. In the ideal case, the dataset could be repaired by collecting more real examples from external sources. In practice, that option is often unavailable, too expensive, or too slow.
That gap creates a natural question for modern generative AI: if real data cannot be collected, can high-quality synthetic data be used to repair the dataset instead? This paper explores that question through Chameleon, a system that uses foundation models to generate synthetic multimodal tuples for fairness-aware data augmentation.
The appeal of this idea is immediate, because the dataset itself is rarely the final objective. It is usually an input to something else, such as training an ML model. If a small amount of carefully generated synthetic data can reduce under-representation and improve downstream fairness, then augmentation may be a practical substitute when real-world repair is impossible.
But making that work requires more than just generating extra examples. The system must identify the minimal set of synthetic tuples needed to address representation bias, preserve semantic integrity so the generated tuples fit the context and distribution of the original dataset, maintain visual and semantic quality that looks realistic to human evaluators, and remain cost-effective despite the monetary cost of querying foundation models. To address these challenges, the work introduces several technical ideas, including rejection tests and multi-armed bandit methods.
Comprehensive experiments show that augmentation with synthetic tuples generated by advanced foundation models can come remarkably close to the effect of adding real tuples when the goal is to reduce downstream unfairness. In that sense, Chameleon demonstrates that generative AI can play a concrete role in responsible data management, not merely by analyzing biased datasets, but by actively helping repair them.
Citation: Mahdi Erfanian, H. Jagadish, and Abolfazl Asudeh. 2024. Chameleon: Foundation Models for Fairness-aware Multi-modal Data Augmentation to Enhance Coverage of Minorities. In Proceedings of the VLDB Endowment, 17(11): 3470-3483.
Paper
Slides
Video