Bridging Linguistic and Visual Knowledge through Visual Genome

FALL 2018 RESEARCH INCUBATION AWARDEE

PI: Derry Wijaya, Assistant Professor, Computer Science

What is the Challenge?
Visual Question Answering (VQA) systems aim to provide answers to questions about images but currently suffer from limitations. Successful performance on the VQA task could dramatically improve the lives of blind people, whose questions about images are currently quite difﬁcult to answer, and the underlying technology has a wide range of potential applications such as image manipulation, image search, and many others. We see two problems with current systems: disorientation and myopia. Current VQA systems are disoriented in the sense that they do not know what sort of answer to give.

What is the Solution?
The solution builds on Visual Genome, a large dataset that associates images with ‘scene graphs’, constituted by object types, attributes, and relations, as well as question-answer pairs regarding the images. We hypothesize that an understanding of the distinction between descriptions and referring expressions will signiﬁcantly improve the quality of VQA systems and related applications. We will ﬁrst develop an extension of the Visual Genome corpus that links referring, describing, and attribute-denoting expressions with objects embedded in scene graphs. Based on this dataset, we will develop natural language generation systems for describing or referring to an object in a complex scene, which we will test through behavioral experiments, and incorporate them into a new VQA system.

What is the Process?
We will devise a system for generating a description and a referring expression for a given object in a scene. We will take a data-driven approach to the generation of object descriptions and references, building on the Visual Genome dataset. The ﬁrst step will involve the development of a corpus that associates references and descriptions with nodes in scene graphs. For descriptions, we will further classify them as ‘headline style’ (e.g. man with a microscope) or ‘full style’ (e.g. (a) man with a microscope) This dataset will be made publicly available for other researchers to use. We will then build a description and referring expression generation system and test it using behavioral eye-tracking studies measuring the speed and accuracy with which participants identify the object in question.