MIT Latest News

Subscribe to MIT Latest News feed
MIT News is dedicated to communicating to the media and the public the news and achievements of the students, faculty, staff and the greater MIT community.
Updated: 19 hours 23 min ago

Guided learning lets “untrainable” neural networks realize their potential

Thu, 12/18/2025 - 4:20pm

Even networks long considered “untrainable” can learn effectively with a bit of a helping hand. Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have shown that a brief period of alignment between neural networks, a method they call guidance, can dramatically improve the performance of architectures previously thought unsuitable for modern tasks.

Their findings suggest that many so-called “ineffective” networks may simply start from less-than-ideal starting points, and that short-term guidance can place them in a spot that makes learning easier for the network. 

The team’s guidance method works by encouraging a target network to match the internal representations of a guide network during training. Unlike traditional methods like knowledge distillation, which focus on mimicking a teacher’s outputs, guidance transfers structural knowledge directly from one network to another. This means the target learns how the guide organizes information within each layer, rather than simply copying its behavior. Remarkably, even untrained networks contain architectural biases that can be transferred, while trained guides additionally convey learned patterns. 

“We found these results pretty surprising,” says Vighnesh Subramaniam ’23, MEng ’24, MIT Department of Electrical Engineering and Computer Science (EECS) PhD student and CSAIL researcher, who is a lead author on a paper presenting these findings. “It’s impressive that we could use representational similarity to make these traditionally ‘crappy’ networks actually work.”

Guide-ian angel 

A central question was whether guidance must continue throughout training, or if its primary effect is to provide a better initialization. To explore this, the researchers performed an experiment with deep fully connected networks (FCNs). Before training on the real problem, the network spent a few steps practicing with another network using random noise, like stretching before exercise. The results were striking: Networks that typically overfit immediately remained stable, achieved lower training loss, and avoided the classic performance degradation seen in something called standard FCNs. This alignment acted like a helpful warmup for the network, showing that even a short practice session can have lasting benefits without needing constant guidance.

The study also compared guidance to knowledge distillation, a popular approach in which a student network attempts to mimic a teacher’s outputs. When the teacher network was untrained, distillation failed completely, since the outputs contained no meaningful signal. Guidance, by contrast, still produced strong improvements because it leverages internal representations rather than final predictions. This result underscores a key insight: Untrained networks already encode valuable architectural biases that can steer other networks toward effective learning.

Beyond the experimental results, the findings have broad implications for understanding neural network architecture. The researchers suggest that success — or failure — often depends less on task-specific data, and more on the network’s position in parameter space. By aligning with a guide network, it’s possible to separate the contributions of architectural biases from those of learned knowledge. This allows scientists to identify which features of a network’s design support effective learning, and which challenges stem simply from poor initialization.

Guidance also opens new avenues for studying relationships between architectures. By measuring how easily one network can guide another, researchers can probe distances between functional designs and reexamine theories of neural network optimization. Since the method relies on representational similarity, it may reveal previously hidden structures in network design, helping to identify which components contribute most to learning and which do not.

Salvaging the hopeless

Ultimately, the work shows that so-called “untrainable” networks are not inherently doomed. With guidance, failure modes can be eliminated, overfitting avoided, and previously ineffective architectures brought into line with modern performance standards. The CSAIL team plans to explore which architectural elements are most responsible for these improvements and how these insights can influence future network design. By revealing the hidden potential of even the most stubborn networks, guidance provides a powerful new tool for understanding — and hopefully shaping — the foundations of machine learning.

“It’s generally assumed that different neural network architectures have particular strengths and weaknesses,” says Leyla Isik, Johns Hopkins University assistant professor of cognitive science, who wasn’t involved in the research. “This exciting research shows that one type of network can inherit the advantages of another architecture, without losing its original capabilities. Remarkably, the authors show this can be done using small, untrained ‘guide’ networks. This paper introduces a novel and concrete way to add different inductive biases into neural networks, which is critical for developing more efficient and human-aligned AI.”

Subramaniam wrote the paper with CSAIL colleagues: Research Scientist Brian Cheung; PhD student David Mayo ’18, MEng ’19; Research Associate Colin Conwell; principal investigators Boris Katz, a CSAIL principal research scientist, and Tomaso Poggio, an MIT professor in brain and cognitive sciences; and former CSAIL research scientist Andrei Barbu. Their work was supported, in part, by the Center for Brains, Minds, and Machines, the National Science Foundation, the MIT CSAIL Machine Learning Applications Initiative, the MIT-IBM Watson AI Lab, the U.S. Defense Advanced Research Projects Agency (DARPA), the U.S. Department of the Air Force Artificial Intelligence Accelerator, and the U.S. Air Force Office of Scientific Research.

Their work was recently presented at the Conference and Workshop on Neural Information Processing Systems (NeurIPS).

A new way to increase the capabilities of large language models

Wed, 12/17/2025 - 11:10pm

Most languages use word position and sentence structure to extract meaning. For example, “The cat sat on the box,” is not the same as “The box was on the cat.” Over a long text, like a financial document or a novel, the syntax of these words likely evolves. 

Similarly, a person might be tracking variables in a piece of code or following instructions that have conditional actions. These are examples of state changes and sequential reasoning that we expect state-of-the-art artificial intelligence systems to excel at; however, the existing, cutting-edge attention mechanism within transformers — the primarily architecture used in large language models (LLMs) for determining the importance of words — has theoretical and empirical limitations when it comes to such capabilities.

An attention mechanism allows an LLM to look back at earlier parts of a query or document and, based on its training, determine which details and words matter most; however, this mechanism alone does not understand word order. It “sees” all of the input words, a.k.a. tokens, at the same time and handles them in the order that they’re presented, so researchers have developed techniques to encode position information. This is key for domains that are highly structured, like language. But the predominant position-encoding method, called rotary position encoding (RoPE), only takes into account the relative distance between tokens in a sequence and is independent of the input data. This means that, for example, words that are four positions apart, like “cat” and “box” in the example above, will all receive the same fixed mathematical rotation specific to that relative distance. 

Now research led by MIT and the MIT-IBM Watson AI Lab has produced an encoding technique known as “PaTH Attention” that makes positional information adaptive and context-aware rather than static, as with RoPE.

“Transformers enable accurate and scalable modeling of many domains, but they have these limitations vis-a-vis state tracking, a class of phenomena that is thought to underlie important capabilities that we want in our AI systems. So, the important question is: How can we maintain the scalability and efficiency of transformers, while enabling state tracking?” says the paper’s senior author Yoon Kim, an associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL), and a researcher with the MIT-IBM Watson AI Lab.

A new paper on this work was presented earlier this month at the Conference on Neural Information Processing Systems (NeurIPS). Kim’s co-authors include lead author Songlin Yang, an EECS graduate student and former MIT-IBM Watson AI Lab Summer Program intern; Kaiyue Wen of Stanford University; Liliang Ren of Microsoft; and Yikang Shen, Shawn Tan, Mayank Mishra, and Rameswar Panda of IBM Research and the MIT-IBM Watson AI Lab.

Path to understanding 

Instead of assigning every word a fixed rotation based on relative distance between tokens, as RoPE does, PaTH Attention is flexible, treating the in-between words as a path made up of small, data-dependent transformations. Each transformation, based on a mathematical operation called a Householder reflection, acts like a tiny mirror that adjusts depending on the content of each token it passes. Each step in a sequence can influence how the model interprets information later on. The cumulative effect lets the system model how the meaning changes along the path between words, not just how far apart they are. This approach allows transformers to keep track of how entities and relationships change over time, giving it a sense of “positional memory.” Think of this as walking a path while experiencing your environment and how it affects you. Further, the team also developed a hardware-efficient algorithm to more efficiently compute attention scores between every pair of tokens so that the cumulative mathematical transformation from PaTH Attention is compressed and broken down into smaller computations so that it’s compatible with fast processing on GPUs.

The MIT-IBM researchers then explored PaTH Attention’s performance on synthetic and real-world tasks, including reasoning, long-context benchmarks, and full LLM training to see whether it improved a model’s ability to track information over time. The team tested its ability to follow the most recent “write” command despite many distracting steps and multi-step recall tests, tasks that are difficult for standard positional encoding methods like RoPE. The researchers also trained mid-size LLMs and compared them against other methods. PaTH Attention improved perplexity and outcompeted other methods on reasoning benchmarks it wasn’t trained on. They also evaluated retrieval, reasoning, and stability with inputs of tens of thousands of tokens. PaTH Attention consistently proved capable of content-awareness.

“We found that both on diagnostic tasks that are designed to test the limitations of transformers and on real-world language modeling tasks, our new approach was able to outperform existing attention mechanisms, while maintaining their efficiency,” says Kim. Further, “I’d be excited to see whether these types of data-dependent position encodings, like PATH, improve the performance of transformers on structured domains like biology, in [analyzing] proteins or DNA.”

Thinking bigger and more efficiently 

The researchers then investigated how the PaTH Attention mechanism would perform if it more similarly mimicked human cognition, where we ignore old or less-relevant information when making decisions. To do this, they combined PaTH Attention with another position encoding scheme known as the Forgetting Transformer (FoX), which allows models to selectively “forget.” The resulting PaTH-FoX system adds a way to down-weight information in a data-dependent way, achieving strong results across reasoning, long-context understanding, and language modeling benchmarks. In this way, PaTH Attention extends the expressive power of transformer architectures. 

Kim says research like this is part of a broader effort to develop the “next big thing” in AI. He explains that a major driver of both the deep learning and generative AI revolutions has been the creation of “general-purpose building blocks that can be applied to wide domains,” such as “convolution layers, RNN [recurrent neural network] layers,” and, most recently, transformers. Looking ahead, Kim notes that considerations like accuracy, expressivity, flexibility, and hardware scalability have been and will be essential. As he puts it, “the core enterprise of modern architecture research is trying to come up with these new primitives that maintain or improve the expressivity, while also being scalable.”

This work was supported, in part, by the MIT-IBM Watson AI Lab and the AI2050 program at Schmidt Sciences.

Pages