AIR Initiative Event: A Decade's Battle on Dataset Bias: Are We There Yet?

Starts2:00 pm on Monday, August 4, 2025
Ends3:00 pm on Monday, August 4, 2025

Speaker: Zhuang Liu, Assistant Professor of Computer Science at Princeton University

Talk Title: “A Decade's Battle on Dataset Bias: Are We There Yet?”

Abstract: Data is the prime ingredient of modern AI. To move toward "general" intelligence, models must learn from datasets that are as broad and unbiased as possible. Yet today’s large-scale vision datasets remain surprisingly skewed. By revisiting the decade-old “Name That Dataset” experiment, I show that a simple neural network classifier can guess an image’s source with over 80% accuracy — underscoring persistent bias and dataset fingerprints. Controlled perturbation studies further reveal which visual attributes (color, semantics, geometry, etc.) carry the strongest signals. The problem isn’t limited to vision: a lightweight classifier can identify the originating large language model from its text with 97% accuracy, exposing similarly sharp idiosyncrasies in model outputs, and by extension, their training data. These results demonstrate that scale alone cannot solve dataset bias, and I’ll conclude by discussing potential paths forward.

Bio: Zhuang Liu is an Assistant Professor of Computer Science at Princeton University. His research areas are deep learning and computer vision, with an emphasis on empirical approaches to understanding how models work and behave. His work spans vision and language, unified by a focus on deep learning methods, representations, and architectures. Prior to joining Princeton, he was a Research Scientist at Meta AI Research (FAIR) in New York City. He received his Ph.D. from UC Berkeley and his B.E. from Tsinghua University, both in Computer Science. He is a recipient of the CVPR Best Paper Award.

Faculty Host: Boqing Gong, Assistant Professor in Computer Science at Boston University

Location:: 665 Commonwealth Ave, Room 801

Back to Calendar