Stony the road we trod, Bitter the chastâ€™ning rod, Felt in the days when...
SE PhD Final Oral Defense of Jing Qian
- 10:00 am on Wednesday, August 20, 2014
- 12:00 pm on Wednesday, August 20, 2014
- 15 Saint Mary's Street, Rm 105
TITLE: Anomaly Detection in High-Dimensional Space ABSTRACT: Anomaly detection refers to identifying statistically significant deviations of data from an expected nominal behavior. It has found extensive use in a wide variety of applications such as fraud detection, intrusion detection, surveillance, etc. In this thesis we study two fundamental scenarios in anomaly detection: point-wise anomaly detection, and anomalous cluster detection on graphs. In the first part we consider the problem of point-wise anomaly detection. Given n i.i.d samples from some unknown nominal density, the goal is to design a mechanism that decides whether a new test point comes from the same distribution, under desired false alarm control. We focus on nonparametric approaches to this problem, more specifically, via estimating the minimum volume (MV) set corresponding to the unknown nominal distribution. Existing popular methods reformulate this problem into computing the volume of MV set containing the test point, which equals to the p-value at the test point, based on k-nearest neighbor (k-NN) distances among training samples. While having better performance than traditional algorithms such as one-class SVM, these k-NN based approaches have higher computational complexity at test stage. We propose a novel ranking-based approach. We first rank k-NN scores on the n-sample nominal training data. We then train limited complexity models to imitate the order of these scores based on the max-margin learning-to-rank framework. A test point is declared as an anomaly at α false alarm level if the predicted score is in the α-percentile. We perform asymptotic and finite sample analysis of our algorithm. Moreover, our approach only evaluates an SVM-type decision function for each test point, which significantly improves test stage complexity over density-based methods. In the second part we explore the problem of anomalous cluster detection on graphs, where signals are associated with nodes / edges. Existing work focuses on statistical aspects of the problem and has shown that the combinatorial test of maximizing the scan statistic over the collection of connected subgraphs performs well for signals following the exponential family of distributions. However this test is computationally infeasible, and there is little work on practical algorithms mainly due to difficulty of characterizing connectivity of nodes. We propose a novel approach that characterizes the family of connected sub-graphs in terms of linear matrix inequalities (LMI). We embed connected sub-graphs into matrices that satisfy the LMI and further relax the problem to a convex program that overcomes the exponential nature of searching for clusters on the graph. We show that the solution to this convex program guarantees connectivity. Furthermore, our approach is able to incorporate the shape information of the sub-graphs of interest. We show that our test is nearly minimax optimal for the exponential family of random variables on 1D and 2D lattice, and compare against other state-of-art methods on synthetic and real data sets. COMMITTEE: Advisor: Venkatesh Saligrama, SE/ECE; David Castañon, SE/ECE; Eric Kolaczyk, SE/Mathematics & Statistics; Henry Lam, Mathematics & Statistics; Chair: Mac Schwager, SE/ME