BabyVFM: Data-Efficient Pretraining of Vision LLMs Inspired by Infant Learning
BabyVFM: Data-Efficient Pretraining of Vision LLMs Inspired by Infant Learning
Guest Speaker: Dr. Boqing Gong
Moderated by Dr. Reza Rawassizadeh, Associate Professor of Computer Science
Friday, 10th October, 2025
Abstract: Pretraining vision (large) foundation models (VFMs) is prohibitively expensive, making it a privilege for institutions with abundant resources and leaving independent researchers to downstream tasks, such as benchmarking, interpreting, and aligning VFMs. This situation is a crisis for computer vision research — “What I cannot create, I do not understand,” quoted Richard Feynman. Independent researchers and the public cannot gain a true understanding, trust, and safe use of VFMs passively from open weights or APIs. Meanwhile, the few privileged VFM creators could momentarily reach a plateau without the broad research community’s nurturing.
Hence, we propose democratizing VFM pretraining by scaling it down to a developmentally plausible framework that is scientifically reasonable and computationally friendly to university budgets, aiming to promote exploration rather than exploitation of the pretraining and enable independent researchers to build general-purpose VFMs that approach “baby intelligence” to benefit efforts towards “grown-up” AI. This framework will closely mimic the minimal yet highly informative sensory experiences of human infants, encompassing:
- Pretraining data curated from longitudinal, egocentric audiovisual recordings of babies.
- A suite of developmentally aligned evaluation benchmarks assessing VFM capabilities against cognitive milestones like object permanence, social skills, and language acquisition.
- A user-friendly pretraining codebase and baseline models.
Bio: Boqing Gong is a computer science faculty member at Boston University and a part-time research scientist at Google DeepMind. His research on machine learning and computer vision focuses on visual recognition, video, and AI models’ generalization and efficiency.