Clean Data and Unlimited Resources: A Probabilistic Journey to a Fantasyland: Barna Saha, U Maryland
- 11:00 am on Wednesday, March 5, 2014
- 12:15 pm on Wednesday, March 5, 2014
- Hariri Institute
Abstract: Over the last decade, while the quantity of digital data skyrocketed, the quality of data remained far from perfect. Use of automated tools and fallible humans for data collection and integration, complex storage, and unreliable channels for data processing and communication all contribute to low quality of data. Since poor data quality can have serious consequences on the results of data analyses, the importance of veracity, the fourth 'V' of Big Data, is increasingly being recognized. On the other hand, with the rise of data centers as a popular computing platform with intensive demand, the need for proper allocation of data center resources is more important than ever. In the first part of this talk, we develop scalable (near-linear time) algorithms to repair semi-structured data, such as XML, popular for data interchange and storage. While more than half of Web data is XML, a large fraction of it is erroneous, with the major source of errors being mismatch between open and close tags. This problem of repairing XML data leads to a broad class of problems that we call the ``Language Edit Distance Problem’’: given a string σ over alphabet Σ and a grammar G defined over Σ, compute the minimum number of repairs (insertions, deletions and substitutions), that are required to map σ into a valid member of G. The language edit distance problem is a significant generalization of the widely used string edit distance problem and has many applications beyond data quality, in compiler optimization, natural language processing, biological sciences, etc. Randomization here plays a crucial role in the design and analysis of algorithms. In the second part of the talk, we summarize our contributions in resource allocation problems. Specifically, we present our work on a powerful probabilistic method, the Lovász Local Lemma, and show how this leads to resolving a major open question, and the very first set of algorithms for a large class of combinatorial optimization problems. Bio. Barna Saha is a Senior Member of Technical Staff Research at AT&T-Shannon Labs, and has been there since August 2011 after completing her Ph.D. from the University of Maryland, College Park working with Samir Khuller. Her research interests span algorithm design and analysis, discrete optimization, and foundational aspects of databases and data management. She received a Dean's Dissertation Fellowship Award for excellence in dissertation research. She is the recipient of the best paper award at the Very Large Data Bases Conference (VLDB'09) for her work on uncertain ranking. Her work on designing scalable algorithms for data quality problems was the finalist for best paper awards at IEEE International Conference on Data Engineering (ICDE'12).