Quanta Magazine: To Understand AI, Watch How It Evolves

Excerpt from Quanta Magazine | By: Ben Brubaker | September 24, 2025 | Photo: Ken Richardson
These days, large language models such as ChatGPT are omnipresent. Yet their inner workings remain deeply mysterious. To Naomi Saphra, that’s an unsatisfying state of affairs. “We don’t know what makes a language model tick,” she said. “If we have these models everywhere, we should understand what they’re doing.”
Saphra, a research fellow at Harvard University’s Kempner Institute who will start a faculty job at Boston University in 2026 (Saphra will be an assistant professor in Computing & Data Sciences), has worked for over a decade in the growing field of interpretability, in which researchers poke around inside language models to uncover the mechanisms that make them work. While many of her fellow interpretability researchers draw inspiration from neuroscience, Saphra favors a different analogy. Interpretability, in her view, should take a cue from evolutionary biology.
“There’s this very famous quote by [the geneticist Theodosius] Dobzhansky: ‘Nothing makes sense in biology except in the light of evolution,’” she said. “Nothing makes sense in AI except in the light of stochastic gradient descent,” a classic algorithm that plays a central role in the training process through which large language models learn to generate coherent text.
Language models are based on neural networks, mathematical structures that process data using connections between artificial “neurons.” The strength of each connection is random at first, but during the training process the connections get tweaked as the model repeatedly attempts to predict the next word in sentences from a vast text dataset. Somehow, through trillions of tiny tweaks, the model develops internal structures that enable it to “generalize,” or respond fluently to unfamiliar inputs.
Most interpretability research focuses on understanding these structures in language models after the training process. Saphra is a prominent champion of an alternative approach that focuses on the training process itself. Just as biologists must understand an organism’s evolutionary history to fully understand the organism, she argues, interpretability researchers should pay more attention to what happens during training. “If you don’t understand the origins of the model, then you don’t understand why anything works,” she said.