“Anomaly Detection with Score functions based on Nearest Neighbor Graphs”
Manqi Zhao (PhD ’11)
Prof. Venkatesh Saligrama
Funding: National Science Foundation
Background: Anomaly detection involves detecting statistically significant deviations of test data from nominal distribution. In typical applications the nominal distribution is unknown and generally cannot be reliably estimated from nominal training data due to a combination of factors such as limited data size and high dimensionality.
Description: We propose a novel non-parametric adaptive anomaly detection algorithm for high dimensional data based on score functions derived from nearest neighbor graphs on n-point nominal data. Anomalies are declared whenever the score of a test sample falls below alpha, which is supposed to be the desired false alarm level. The resulting anomaly detector is shown to be asymptotically optimal in that it is uniformly most powerful for the specified false alarm level, alpha, for the case when the anomaly density is a mixture of the nominal and a known density. Our algorithm is computationally efficient, being linear in dimension and quadratic in data size. It does not require choosing complicated tuning parameters or function approximation classes and it can adapt to local structure such as local change in dimensionality. We demonstrate the algorithm on both artificial and real data sets in high dimensional feature spaces.
Results: Below are shown sample results obtained in this project. On left are level sets of the nominal bivariate Gaussian mixture distribution used to illustrate the K-LPE algorithm. In the middle are results of K-LPE with K= 6 and Euclidean distance metric for m = 150 test points drawn from an equal mixture of 2D uniform and the (nominal) bivariate distributions. Scores for the test points are based on 200 nominal training samples. Scores falling below a threshold level 0.05 are declared as anomalies. The dotted contour corresponds to the exact bivariate Gaussian density level set at level alpha= 0.05. On right is the empirical distribution of the test point scores associated with the bivariate Gaussian that appears to be uniform while scores for the test points drawn from 2D uniform distribution cluster around zero.
M. Zhao and V. Saligrama, “Anomaly Detection with Score functions based on Nearest Neighbor Graphs”, Neural Information Processing Systems (NIPS) Conference, Vancouver, B.C., Canada, Dec 2009