Sights, Sounds, and Space: Audio-visual Learning in 3D Environments

Moving around in the world is naturally a multisensory experience, but today’s embodied agents are deaf—restricted to solely their visual perception of the environment. We explore audio-visual learning in complex, acoustically and visually realistic 3D environments. By both seeing and hearing, the agent must learn to navigate to a sounding object, use echolocation to anticipate its 3D surroundings, and discover the link between its visual inputs and spatial sound. To support this goal, we introduce SoundSpaces: a platform for audio rendering based on geometrical acoustic simulations for two sets of publicly available 3D environments. SoundSpaces makes it possible to insert arbitrary sound sources in an array of real-world scanned environments. Building on this platform, we pursue a series of audio-visual spatial learning tasks that suggest how audio can benefit visual understanding of 3D spaces.

When 11:00 am to 12:00 pm on Monday, April 12, 2021
Speakers Kristen Grauman, Professor, Department of Computer Science, the University of Texas at Austin