CIF: Small: Learning Mixed Membership Models with a Separable Latent Structure: Theory, Provably Efficient Algorithms, and Applications

Sponsor: National Science Foundation

Award Number: 1527618

PI: Prakash Ishwar

Co-I/Co-PI: Venkatesh Saligrama

Abstract:

In a wide spectrum of problems in science and engineering that includes hyperspectral imaging, gene expression analysis, and metabolic networks, the observed data is high-dimensional and arises from an unknown random mixture of a small set of unknown shared latent (hidden) causes. Being able to successfully and efficiently identify the latent causes from the observed data is important not only for scientific understanding, but also for efficient data representation and decision making. Popular algorithms for such problems make use of approximations and heuristics for computational tractability. As a consequence, consistency or efficiency guarantees for such algorithms are either very weak or nonexistent. This research involves the development of algorithms for learning latent-cause models from high-dimensional data with provable statistical and computational efficiency guarantees.

The linchpin of this research is a natural separability property of the shared latent factors ? the presence of a signature component for each latent factor that is approximately satisfied by the estimates produced by popular approaches. This research aims to establish that approximate separability is not only a natural and convenient structural property of mixed-membership latent-factor models, but is, in fact, an inevitable consequence of high-dimensionality. This research also involves the development of a suite of provably consistent and statistically and computationally efficient algorithms for a diverse set of mixed-membership latent-factor problems by suitably leveraging the geometry induced by the signature components. The key insight is to identify the signature parts of each latent factor as extreme points in a suitable space. This can be done efficiently through appropriately defined random projections. The random-projections-based algorithm is naturally amenable to a low-communication-cost distributed implementation that is attractive for web-scale distributed data-mining applications.

For more information: click here.