Thursday 28

S	M	T	W	TH	F	S
24	25	26	27	28	1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31	1	2	3	4	5	6

Luke Miratrix - Harvard University

Title: An introspection on using sparse regression techniques to analyze text. Abstract: In this talk, I propose a general framework for topic-specific summarization of large text corpora, and illustrate how it can be used for analysis in two quite different contexts: legal decisions on workers' compensation claims (to understand relevant case law) and an OSHA database of occupation-related accident reports (to search for high risk circumstances). Our summarization framework, built on sparse classification methods, is a lightweight and flexible tool that offers a compromise between simple word frequency based methods currently in wide use, and more heavyweight, model-intensive methods such as Latent Dirichlet Allocation (LDA). For a particular topic of interest (e.g., emotional disability, or chemical gas), we automatically labels documents as being either on- or off-topic, and then use sparse classification methods to predict these labels with the high-dimensional counts of all the other words and phrases in the documents. The resulting small set of phrases found as predictive are then harvested as the summary. Using a branch-and-bound approach, this method can be extended to allow for phrases of arbitrary length, which allows for potentially rich summarization. I further discuss how focus on specific aspects of the corpus and the purpose of the summaries can inform choices of regularization parameters and constraints on the model. Overall, I argue that sparse methods have much to offer text analysis, and hope that this work opens the door for a new branch of research in this important field.

When	4:00 pm to 5:00 pm on Thursday, March 28, 2013
Building	MCS 148
Fees	Free

S	M	T	W	TH	F	S
24	25	26	27	28	1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31	1	2	3	4	5	6

S	M	T	W	TH	F	S
24	25	26	27	28	1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31	1	2	3	4	5	6

S	M	T	W	TH	F	S
24	25	26	27	28	1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31	1	2	3	4	5	6