• Starts: 11:00 am on Thursday, September 21, 2023
  • Ends: 12:30 pm on Thursday, September 21, 2023

Title: Efficient Learning in Single- and Multi-Modal Vision Transformers

Abstract: Transformers have revolutionized the way we approach reasoning and learning tasks in the field of computer vision, both in single and multi-modal settings. Self-supervised pre-training methods, such as the Masked Autoencoder (MAE), have emerged as a solution to maximize the potential of vision transformers, although MAE requires a large number of epochs to pre-train, making it expensive in practice. Our work introduces a supervised pre-training approach called SupMAE, which is more efficient, robust, and effective in transfer learning than MAE and other supervised pre-training methods. SupMAE achieves similar performance to MAE on ImageNet with the ViTB/16 model while using only 30% of the compute cost, and outperforms MAE on ImageNet variants. In addition, techniques have been developed to reduce the computational cost of vision transformers through post-training quantization, often using mixed precision schemes or partitioning the model. We propose Evol-Q, a contrastive loss-based approach using block-based evolutionary search for quantization scales that results in 1.5% better accuracy for 3, 4, and 8-bit models in less time than existing methods. In the case of multi-modal tasks such as audio-video event localization, effective multi-modal feature correspondence is necessary to understand the various temporal interactions. Existing approaches struggle in this regard due to ineffective multi-modal training strategies. We introduce AVE-CLIP, a framework that combines AudioCLIP, a model pre-trained on large-scale audio-visual data, with a multi-window temporal transformer to effectively handle different temporal scales of video frames. AVE-CLIP improves performance on the AVE dataset by 5.9% compared to previous work, demonstrating its effectiveness in practice. Taken together, SupMAE, Evol-Q, and AVE-CLIP demonstrate how to improve the efficiency and effectiveness of vision transformers in a variety of tasks.

Bio: Diana Marculescu is Department Chair, Cockrell Family Chair for Engineering Leadership #5, and Professor, Motorola Regents Chair in Electrical and Computer Engineering #2, at the University of Texas at Austin. She is also the Founding Director of the iMAGiNE Consortium on Intelligent Machine Engineering, a joint industry-university partnership focusing on engineering the machines that support intelligent applications from cloud to edge. Prior to joining UT Austin in December 2019, she was the David Edward Schramm Professor of Electrical and Computer Engineering, the Founding Director of the College of Engineering Center for Faculty Success (2015-2019) and has served as Associate Department Head for Academic Affairs in Electrical and Computer Engineering (2014-2018), all at Carnegie Mellon University. She received the Dipl.Ing. degree in computer science from the Polytechnic University of Bucharest, Bucharest, Romania (1991), and the Ph.D. degree in computer engineering from the University of Southern California, Los Angeles, CA (1998). Her research interests include energy- and reliability-aware computing, hardware aware machine learning, and computing for sustainability and natural science applications. Diana is a recipient of the National Science Foundation Faculty Career Award (2000-2004), the ACM SIGDA Technical Leadership Award (2003), the Carnegie Institute of Technology George Tallman Ladd Research Award (2004), and several best paper awards. She was an IEEE Circuits and Systems Society Distinguished Lecturer (2004-2005) and the Chair of the Association for Computing Machinery (ACM) Special Interest Group on Design Automation (2005-2009). Diana chaired several conferences and symposia in her area and has served as an Associate Editor for several IEEE and ACM journals. She was selected as an ELATE Fellow (2013-2014), and was the recipient of an Australian Research Council Future Fellowship (2013-2017), the Marie R. Pistilli Women in EDA Achievement Award (2014), and the Barbara Lazarus Award from Carnegie Mellon University (2018). Diana is a Fellow of ACM, IEEE, and AAAS.

PHO 906