SE PhD Prospectus Defense of Christy Lin

Starts:
11:00 am on Wednesday, May 22, 2019
Ends:
1:00 pm on Wednesday, May 22, 2019
Location:
15 Saint Mary's Street, Rm 105
TITLE: RANDOM WALK NODE EMBEDDINGS FOR NETWORK COMMUNITY DETECTION: THEORY AND ALGORITHMS

ABSTRACT: There has been an explosion of network data in the physical, chemical, biological, computational, and social sciences in the last few decades. Node embeddings, i.e., Euclidean-space representations of nodes in a network, make it possible to apply to network data, tools and algorithms from multivariate statistics and machine learning that were developed for Euclidean-space data. Random Walk Embeddings are a class of recently developed node embedding techniques where the vector representations are learned from proximal node co-occurrences in sampled random walks on the network. They have been applied to a number of supervised learning problems such as link prediction and node classification and have demonstrated state-of-the-art performance. Yet, random walk embeddings remain poorly understood.

This doctoral research studies random walk based node embeddings within the context of the community detection problem, an unsupervised network learning task in which the goal is to partition a network into sub-networks (communities) of nodes that have a similar pattern of connectivity. The broad goals of this doctoral research are to (i) develop and experimentally validate VEC, a random walk based node embedding algorithm for the community detection problem under the Stochastic Block Model (SBM) and its generalizations and (ii) develop a mathematical framework to theoretically analyze the properties of random walk node embeddings and use it to characterize the large-network asymptotic performance limits of community detection based on these embeddings.

Extensive experiments on real world networks and SBM random networks are presented to demonstrate the superior accuracy and robustness of VEC over competing algorithms. Empirical results suggest that VEC can attain the informationtheoretic limits for exact and weak community recovery for SBM networks. We outline preliminary work and a multi-step analysis strategy for establishing concentration and sample complexity bounds for the embedding vectors generated by VEC. This requires a deeper examination of the underlying optimization problem, which is non-convex and dependent on the sampled random walks. As a first step, we have developed a nuclear norm constrained convex relaxation of the original problem. Work is proposed to establish concentration bounds for the convex objective function and its optimal solution and to conduct experiments to support and validate our analysis.

COMMITTEE: CO-ADVISORS Prakash Ishwar, SE/ECE/CISE and Daniel Lewis Sussman, Math/Stat/CISE; Eric Kolaczyk,SE/Math/Stat/CISE; Lorenzo Orecchia, CS/CISE