Collaborative Research: CIF: Small: Learning from Multiple Biased Sources

Sponsor: National Science Foundation

Award Number: CCF-2007350

PI: Venkatesh Saligrama

Abstract:

The field of artificial intelligence, and especially machine learning, is concerned with automating the performance of a task by learning from past performances of that task. Examples include classifying images and successfully navigating a maze. Classical machine learning methods assume that past occurrences of a task, or ?training data,? accurately represent future occurrences of the task. In many applications, however, training data are drawn from multiple sources that reflect future occurrences with varying degrees of quality. Examples include images labeled by crowd-sourced users or navigation of randomly simulated mazes. The objective of this project is to develop theoretical foundations of learning from multiple biased sources. The work will be motivated by applications in crowdsourcing and autonomous navigation as described above, as well as in video surveillance and nuclear threat detection. This research will support the cross-disciplinary development of a diverse cohort of PhD and undergraduate students at the University of Michigan and at Boston University.

To achieve these goals, the investigators will establish theoretical foundations for four broad classes of machine learning problems for which virtually no theory presently exists: (1) Classification from multiple corrupted sources, (2) Clustering with overlapping, nonparametric clusters, (3) Sim2Real reinforcement learning, and (4) Zero-shot learning. This project’s theoretical contributions will take the form of generalization error bounds, regret bounds, and sample complexity bounds, while also emphasizing distribution free or general nonparametric models wherever possible. To address the challenges of eliciting and aggregating biased information from multiple sources, the analyses will develop new technical tools, including weighted Rademacher complexity, regret analysis from biased bandit feedback, and oracle inequalities for density estimation, that are likely to find application in other learning settings. The research resulting from this effort will highlight distinctive features of learning from multiple sources, including various questions associated with multiple sample sizes. More generally, the research develops principled approaches for integrating heterogeneous data sources in both batch and sequential learning settings and under a variety of inter-source dependence models.

This award reflects NSF’s statutory mission and has been deemed worthy of support through evaluation using the Foundation’s intellectual merit and broader impacts review criteria.

For more information: click here