CISE Seminar: Chao Tian

Date: Friday, November 8, 2024
Time: 3:00pm – 4:00pm
Location: 665 Commonwealth Ave., CDS 1101

Chao Tian
Associate Professor of Electrical & Computer Engineering
Texas A&M University

Transformers Learn Variable-order Markov Chains in-Context

Large language models (LLMs) have demonstrated impressive in-context learning (ICL) capability. However, it is still unclear how the underlying transformers accomplish it, especially in more complex scenarios. Toward this goal, several recent works studied how transformers learn fixed-order Markov chains (FOMC) in context, yet natural languages are more suitably modeled by variable-order Markov chains (VOMC), i.e., context trees (CTs). We study the ICL of VOMC by viewing language modeling as a form of data compression and focusing on small alphabets and low-order VOMCs. This perspective allows us to leverage mature compression algorithms, such as context-tree weighting (CTW) and prediction by partial matching (PPM) algorithms as baselines, the former of which is Bayesian optimal for a class of priors that we refer to as the CTW priors. We empirically observe a few phenomena: 1) Transformers can indeed learn to compress VOMC in-context, while PPM suffers significantly; 2) The performance of transformers is not very sensitive to the number of layers, and even a two-layer transformer can learn in-context quite well; and 3) Transformers trained and tested on non-CTW priors can significantly outperform the CTW algorithm. To explain these phenomena, we analyze the attention map of the transformers and extract two mechanisms, on which we provide two transformer constructions: 1) A construction with D+2 layers that can mimic the CTW algorithm accurately for CTs of maximum order D, 2) A 2-layer transformer that utilizes the feed-forward network for probability blending. These constructions can explain most of the phenomena mentioned above. One distinction from the FOMC setting is that a counting mechanism appears to play an important role. We implement these synthetic transformer layers and show that such hybrid transformers can match the ICL performance of transformers, and more interestingly, some of them can perform even better despite the much-reduced parameter sets. 

Chao Tian obtained his B.E degree from Tsinghua University, Beijing China, and his M.S. and Ph.D. degrees from Cornell University, Ithaca NY. He worked at AT&T Labs-Research (previously known as the Shannon Labs) as a researcher on communication and signal processing for seven years, before returning to academia. He was with the University of Tennessee Knoxville for a few years before joining Texas A&M University, where he is now an associate professor. His coauthored papers received several awards including the 2014 IEEE Data Storage Best Paper Award. He is currently an IEEE Information Theory Society Distinguished Lecturer. 

Faculty Host: Prakash Ishwar
Student Host: Akua Kodie Dickson