BabyVFM: Data-Efficient Pretraining of Vision LLMs Inspired by Infant Learning

BabyVFM: Data-Efficient Pretraining of Vision LLMs Inspired by Infant Learning

Guest Speaker: Dr. Boqing Gong

Moderated by Dr. Reza Rawassizadeh, Associate Professor of Computer Science

Friday, 10th October, 2025

Abstract: Pretraining vision (large) foundation models (VFMs) is prohibitively expensive, making it a privilege for institutions with abundant resources and leaving independent researchers to downstream tasks, such as benchmarking, interpreting, and aligning VFMs. This situation is a crisis for computer vision research — “What I cannot create, I do not understand,” quoted Richard Feynman. Independent researchers and the public cannot gain a true understanding, trust, and safe use of VFMs passively from open weights or APIs. Meanwhile, the few privileged VFM creators could momentarily reach a plateau without the broad research community’s nurturing.

Hence, we propose democratizing VFM pretraining by scaling it down to a developmentally plausible framework that is scientifically reasonable and computationally friendly to university budgets, aiming to promote exploration rather than exploitation of the pretraining and enable independent researchers to build general-purpose VFMs that approach “baby intelligence” to benefit efforts towards “grown-up” AI. This framework will closely mimic the minimal yet highly informative sensory experiences of human infants, encompassing:

  1. Pretraining data curated from longitudinal, egocentric audiovisual recordings of babies.
  2. A suite of developmentally aligned evaluation benchmarks assessing VFM capabilities against cognitive milestones like object permanence, social skills, and language acquisition.
  3. A user-friendly pretraining codebase and baseline models.

Bio: Boqing Gong is a computer science faculty member at Boston University and a part-time research scientist at Google DeepMind. His research on machine learning and computer vision focuses on visual recognition, video, and AI models’ generalization and efficiency.

View all posts