MIT Latest News
Simpler models can outperform deep learning at climate prediction
Environmental scientists are increasingly using enormous artificial intelligence models to make predictions about changes in weather and climate, but a new study by MIT researchers shows that bigger models are not always better.
The team demonstrates that, in certain climate scenarios, much simpler, physics-based models can generate more accurate predictions than state-of-the-art deep-learning models.
Their analysis also reveals that a benchmarking technique commonly used to evaluate machine-learning techniques for climate predictions can be distorted by natural variations in the data, like fluctuations in weather patterns. This could lead someone to believe a deep-learning model makes more accurate predictions when that is not the case.
The researchers developed a more robust way of evaluating these techniques, which shows that, while simple models are more accurate when estimating regional surface temperatures, deep-learning approaches can be the best choice for estimating local rainfall.
They used these results to enhance a simulation tool known as a climate emulator, which can rapidly simulate the effect of human activities onto a future climate.
The researchers see their work as a “cautionary tale” about the risk of deploying large AI models for climate science. While deep-learning models have shown incredible success in domains such as natural language, climate science contains a proven set of physical laws and approximations, and the challenge becomes how to incorporate those into AI models.
“We are trying to develop models that are going to be useful and relevant for the kinds of things that decision-makers need going forward when making climate policy choices. While it might be attractive to use the latest, big-picture machine-learning model on a climate problem, what this study shows is that stepping back and really thinking about the problem fundamentals is important and useful,” says study senior author Noelle Selin, a professor in the MIT Institute for Data, Systems, and Society (IDSS) and the Department of Earth, Atmospheric and Planetary Sciences (EAPS), and director of the Center for Sustainability Science and Strategy.
Selin’s co-authors are lead author Björn Lütjens, a former EAPS postdoc who is now a research scientist at IBM Research; senior author Raffaele Ferrari, the Cecil and Ida Green Professor of Oceanography in EAPS and co-director of the Lorenz Center; and Duncan Watson-Parris, assistant professor at the University of California at San Diego. Selin and Ferrari are also co-principal investigators of the Bringing Computation to the Climate Challenge project, out of which this research emerged. The paper appears today in the Journal of Advances in Modeling Earth Systems.
Comparing emulators
Because the Earth’s climate is so complex, running a state-of-the-art climate model to predict how pollution levels will impact environmental factors like temperature can take weeks on the world’s most powerful supercomputers.
Scientists often create climate emulators, simpler approximations of a state-of-the art climate model, which are faster and more accessible. A policymaker could use a climate emulator to see how alternative assumptions on greenhouse gas emissions would affect future temperatures, helping them develop regulations.
But an emulator isn’t very useful if it makes inaccurate predictions about the local impacts of climate change. While deep learning has become increasingly popular for emulation, few studies have explored whether these models perform better than tried-and-true approaches.
The MIT researchers performed such a study. They compared a traditional technique called linear pattern scaling (LPS) with a deep-learning model using a common benchmark dataset for evaluating climate emulators.
Their results showed that LPS outperformed deep-learning models on predicting nearly all parameters they tested, including temperature and precipitation.
“Large AI methods are very appealing to scientists, but they rarely solve a completely new problem, so implementing an existing solution first is necessary to find out whether the complex machine-learning approach actually improves upon it,” says Lütjens.
Some initial results seemed to fly in the face of the researchers’ domain knowledge. The powerful deep-learning model should have been more accurate when making predictions about precipitation, since those data don’t follow a linear pattern.
They found that the high amount of natural variability in climate model runs can cause the deep learning model to perform poorly on unpredictable long-term oscillations, like El Niño/La Niña. This skews the benchmarking scores in favor of LPS, which averages out those oscillations.
Constructing a new evaluation
From there, the researchers constructed a new evaluation with more data that address natural climate variability. With this new evaluation, the deep-learning model performed slightly better than LPS for local precipitation, but LPS was still more accurate for temperature predictions.
“It is important to use the modeling tool that is right for the problem, but in order to do that you also have to set up the problem the right way in the first place,” Selin says.
Based on these results, the researchers incorporated LPS into a climate emulation platform to predict local temperature changes in different emission scenarios.
“We are not advocating that LPS should always be the goal. It still has limitations. For instance, LPS doesn’t predict variability or extreme weather events,” Ferrari adds.
Rather, they hope their results emphasize the need to develop better benchmarking techniques, which could provide a fuller picture of which climate emulation technique is best suited for a particular situation.
“With an improved climate emulation benchmark, we could use more complex machine-learning methods to explore problems that are currently very hard to address, like the impacts of aerosols or estimations of extreme precipitation,” Lütjens says.
Ultimately, more accurate benchmarking techniques will help ensure policymakers are making decisions based on the best available information.
The researchers hope others build on their analysis, perhaps by studying additional improvements to climate emulation methods and benchmarks. Such research could explore impact-oriented metrics like drought indicators and wildfire risks, or new variables like regional wind speeds.
This research is funded, in part, by Schmidt Sciences, LLC, and is part of the MIT Climate Grand Challenges team for “Bringing Computation to the Climate Challenge.”
On the joys of being head of house at McCormick Hall
While sharing a single cup of coffee, Raul Radovitzky, the Jerome C. Hunsaker Professor in the Department of Aeronautics and Astronautics, and his wife Flavia Cardarelli, senior administrative assistant in the Institute for Data, Systems, and Society, recently discussed the love they have for their “nighttime jobs” living in McCormick Hall as faculty heads of house, and explained why it is so gratifying for them to be a part of this community.
The couple, married for 32 years, first met playing in a sandbox at the age of 3 in Argentina (but didn't start dating until they were in their 20s). Radovitzky has been a part of the MIT ecosystem since 2001, while Cardarelli began working at MIT in 2006. They became heads of house at McCormick Hall, the only all-female residence hall on campus, in 2015, and recently applied to extend their stay.
“Our head-of-house role is always full of surprises. We never know what we’ll encounter, but we love it. Students think we do this just for them, but in truth, it’s very rewarding for us as well. It keeps us on our toes and brings a lot of joy,” says Cardarelli. “We like to think of ourselves as the cool aunt and uncle for the students,” Radovitzky adds.
Heads of house at MIT influence many areas of students’ development by acting as advisors and mentors to their residents. Additionally, they work closely with the residence hall’s student government, as well as staff from the Division of Student Life, to foster their community’s culture.
Vice Chancellor for Student Life Suzy Nelson explains, “Our faculty heads of house have the long view at MIT and care deeply about students’ academic and personal growth. We are fortunate to have such dedicated faculty who serve in this way. The heads of house enhance the student experience in so many ways — whether it is helping a student with a personal problem, hosting Thanksgiving dinner for students who were not able to go home, or encouraging students to get involved in new activities, they are always there for students.”
“Our heads of house help our students fully participate in residential life. They model civil discourse at community dinners, mentor and tutor residents, and encourage residents to try new things. With great expertise and aplomb, they formally and informally help our students become their whole selves,” says Chancellor Melissa Nobles.
“I love teaching, I love conducting research with my group, and I enjoy serving as a head of house. The community aspect is deeply meaningful to me. MIT has become such a central part of our lives. Our kids are both MIT graduates, and we are incredibly proud of them. We do have a life outside of MIT — weekends with friends and family, personal activities — but MIT is a big part of who we are. It’s more than a job; it’s a community. We live on campus, and while it can be intense and demanding, we really love it,” says Radovitzky.
Jessica Quaye ’20, a former resident of McCormick Hall, says, “what sets McCormick apart is the way Raul and Flavia transform the four dorm walls into a home for everyone. You might come to McCormick alone, but you never leave alone. If you ran into them somewhere on campus, you could be sure that they would call you out and wave excitedly. You could invite Raul and Flavia to your concerts and they would show up to support your extracurricular endeavors. They built an incredible family that carries the fabric of MIT with a blend of academic brilliance, a warm open-door policy, and unwavering support for our extracurricular pursuits.”
Soundbytes
Q: What first drew you to the heads of house role?
Radovitzky: I had been aware of the role since I arrived at MIT, and over time, I started to wonder if it might be something we’d consider. When our kids were young, it didn’t seem feasible — we lived in the suburbs, and life there was good. But I always had an innate interest in building stronger connections with the student community.
Later, several colleagues encouraged us to apply. I discussed it with the family. Everyone was excited about it. Our teenagers were thrilled by the idea of living on a college campus. We applied together, submitting a letter as a family explaining why we were so passionate about it. We interviewed at McCormick, Baker, and McGregor. When we were offered McCormick, I’ll admit — I was nervous. I wasn’t sure I’d be the right fit for an all-female residence.
Cardarelli: We would have been nervous no matter where we ended up, but McCormick felt like home. It suited us in ways we didn’t anticipate. Raul, for instance, discovered he had a real rapport with the students, telling goofy jokes, making karaoke playlists, and learning about Taylor Swift and Nicki Minaj.
Radovitzky: It’s true! I never knew I’d become an expert at picking karaoke playlists. But we found our rhythm here, and it’s been deeply rewarding.
Q: What makes the McCormick community special?
Radovitzky: McCormick has a unique spirit. I can step out of our apartment and be greeted by 10 smiling faces. That energy is contagious. It’s not just about events or programming — it’s about building trust. We’ve built traditions around that, like our “make your own pizza” nights in our apartment, a wonderful McCormick event we inherited from our predecessors. We host four sessions each spring in which students roll out dough, choose toppings, and we chat as we cook and eat together. Everyone remembers the pizza nights — they’re mentioned in every testimonial.
Cardarelli: We’ve been lucky to have amazing graduate resident assistants and area directors every year. They’re essential partners in building community. They play a key role in creating community and supporting the students on their floors. They help with everything — from tutoring to events to walking students to urgent care if needed.
Radovitzky: In the fall, we take our residents to Crane Beach and host a welcome brunch. Karaoke in our apartment is a big hit too, and a unique way to make them comfortable coming to our apartment from day one. We do it three times a year — during orientation, and again each semester.
Cardarelli: We also host monthly barbecues open to all dorms and run McFast, our first-year tutoring program. Raul started by tutoring physics and math, four hours a week. Now, upperclass students lead most of the sessions. It’s great for both academic support and social connection.
Radovitzky: We also have an Independent Activities Period pasta night tradition. We cook for around 100 students, using four sauces that Flavia makes from scratch — bolognese, creamy mushroom, marinara, and pesto. Students love it.
Q: What’s unique about working in an all-female residence hall?
Cardarelli: I’ve helped students hem dresses, bake, and even apply makeup. It’s like having hundreds of daughters.
Radovitzky: The students here are incredibly mature and engaged. They show real interest in us as people. Many of the activities and connections we’ve built wouldn’t be possible in a different setting. Every year during “de-stress night,” I get my nails painted every color and have a face mask on. During “Are You Smarter Than an MIT Professor,” they dunk me in a water tank.
Engineering fantasy into reality
Growing up in the suburban town of Spring, Texas, just outside of Houston, Erik Ballesteros couldn’t help but be drawn in by the possibilities for humans in space.
It was the early 2000s, and NASA’s space shuttle program was the main transport for astronauts to the International Space Station (ISS). Ballesteros’ hometown was less than an hour from Johnson Space Center (JSC), where NASA’s mission control center and astronaut training facility are based. And as often as they could, he and his family would drive to JSC to check out the center’s public exhibits and presentations on human space exploration.
For Ballesteros, the highlight of these visits was always the tram tour, which brings visitors to JSC’s Astronaut Training Facility. There, the public can watch astronauts test out spaceflight prototypes and practice various operations in preparation for living and working on the International Space Station.
“It was a really inspiring place to be, and sometimes we would meet astronauts when they were doing signings,” he recalls. “I’d always see the gates where the astronauts would go back into the training facility, and I would think: One day I’ll be on the other side of that gate.”
Today, Ballesteros is a PhD student in mechanical engineering at MIT, and has already made good on his childhood goal. Before coming to MIT, he interned on multiple projects at JSC, working in the training facility to help test new spacesuit materials, portable life support systems, and a propulsion system for a prototype Mars rocket. He also helped train astronauts to operate the ISS’ emergency response systems.
Those early experiences steered him to MIT, where he hopes to make a more direct impact on human spaceflight. He and his advisor, Harry Asada, are building a system that will quite literally provide helping hands to future astronauts. The system, dubbed SuperLimbs, consists of a pair of wearable robotic arms that extend out from a backpack, similar to the fictional Inspector Gadget, or Doctor Octopus (“Doc Ock,” to comic book fans). Ballesteros and Asada are designing the robotic arms to be strong enough to lift an astronaut back up if they fall. The arms could also crab-walk around a spacecraft’s exterior as an astronaut inspects or makes repairs.
Ballesteros is collaborating with engineers at the NASA Jet Propulsion Laboratory to refine the design, which he plans to introduce to astronauts at JSC in the next year or two, for practical testing and user feedback. He says his time at MIT has helped him make connections across academia and in industry that have fueled his life and work.
“Success isn’t built by the actions of one, but rather it’s built on the shoulders of many,” Ballesteros says. “Connections — ones that you not just have, but maintain — are so vital to being able to open new doors and keep great ones open.”
Getting a jumpstart
Ballesteros didn’t always seek out those connections. As a kid, he counted down the minutes until the end of school, when he could go home to play video games and watch movies, “Star Wars” being a favorite. He also loved to create and had a talent for cosplay, tailoring intricate, life-like costumes inspired by cartoon and movie characters.
In high school, he took an introductory class in engineering that challenged students to build robots from kits, that they would then pit against each other, BattleBots-style. Ballesteros built a robotic ball that moved by shifting an internal weight, similar to Star Wars’ fictional, sphere-shaped BB-8.
“It was a good introduction, and I remember thinking, this engineering thing could be fun,” he says.
After graduating high school, Ballesteros attended the University of Texas at Austin, where he pursued a bachelor’s degree in aerospace engineering. What would typically be a four-year degree stretched into an eight-year period during which Ballesteros combined college with multiple work experiences, taking on internships at NASA and elsewhere.
In 2013, he interned at Lockheed Martin, where he contributed to various aspects of jet engine development. That experience unlocked a number of other aerospace opportunities. After a stint at NASA’s Kennedy Space Center, he went on to Johnson Space Center, where, as part of a co-op program called Pathways, he returned every spring or summer over the next five years, to intern in various departments across the center.
While the time at JSC gave him a huge amount of practical engineering experience, Ballesteros still wasn’t sure if it was the right fit. Along with his childhood fascination with astronauts and space, he had always loved cinema and the special effects that forged them. In 2018, he took a year off from the NASA Pathways program to intern at Disney, where he spent the spring semester working as a safety engineer, performing safety checks on Disney rides and attractions.
During this time, he got to know a few people in Imagineering — the research and development group that creates, designs, and builds rides, theme parks, and attractions. That summer, the group took him on as an intern, and he worked on the animatronics for upcoming rides, which involved translating certain scenes in a Disney movie into practical, safe, and functional scenes in an attraction.
“In animation, a lot of things they do are fantastical, and it was our job to find a way to make them real,” says Ballesteros, who loved every moment of the experience and hoped to be hired as an Imagineer after the internship came to an end. But he had one year left in his undergraduate degree and had to move on.
After graduating from UT Austin in December 2019, Ballesteros accepted a position at NASA’s Jet Propulsion Laboratory in Pasadena, California. He started at JPL in February of 2020, working on some last adjustments to the Mars Perseverance rover. After a few months during which JPL shifted to remote work during the Covid pandemic, Ballesteros was assigned to a project to develop a self-diagnosing spacecraft monitoring system. While working with that team, he met an engineer who was a former lecturer at MIT. As a practical suggestion, she nudged Ballesteros to consider pursuing a master’s degree, to add more value to his CV.
“She opened up the idea of going to grad school, which I hadn’t ever considered,” he says.
Full circle
In 2021, Ballesteros arrived at MIT to begin a master’s program in mechanical engineering. In interviewing with potential advisors, he immediately hit it off with Harry Asada, the Ford Professor of Enginering and director of the d'Arbeloff Laboratory for Information Systems and Technology. Years ago, Asada had pitched JPL an idea for wearable robotic arms to aid astronauts, which they quickly turned down. But Asada held onto the idea, and proposed that Ballesteros take it on as a feasibility study for his master’s thesis.
The project would require bringing a seemingly sci-fi idea into practical, functional form, for use by astronauts in future space missions. For Ballesteros, it was the perfect challenge. SuperLimbs became the focus of his master’s degree, which he earned in 2023. His initial plan was to return to industry, degree in hand. But he chose to stay at MIT to pursue a PhD, so that he could continue his work with SuperLimbs in an environment where he felt free to explore and try new things.
“MIT is like nerd Hogwarts,” he says. “One of the dreams I had as a kid was about the first day of school, and being able to build and be creative, and it was the happiest day of my life. And at MIT, I felt like that dream became reality.”
Ballesteros and Asada are now further developing SuperLimbs. The team recently re-pitched the idea to engineers at JPL, who reconsidered, and have since struck up a partnership to help test and refine the robot. In the next year or two, Ballesteros hopes to bring a fully functional, wearable design to Johnson Space Center, where astronauts can test it out in space-simulated settings.
In addition to his formal graduate work, Ballesteros has found a way to have a bit of Imagineer-like fun. He is a member of the MIT Robotics Team, which designs, builds, and runs robots in various competitions and challenges. Within this club, Ballesteros has formed a sub-club of sorts, called the Droid Builders, that aim to build animatronic droids from popular movies and franchises.
“I thought I could use what I learned from Imagineering and teach undergrads how to build robots from the ground up,” he says. “Now we’re building a full-scale WALL-E that could be fully autonomous. It’s cool to see everything come full circle.”
New technologies tackle brain health assessment for the military
Cognitive readiness denotes a person's ability to respond and adapt to the changes around them. This includes functions like keeping balance after tripping, or making the right decision in a challenging situation based on knowledge and past experiences. For military service members, cognitive readiness is crucial for their health and safety, as well as mission success. Injury to the brain is a major contributor to cognitive impairment, and between 2000 and 2024, more than 500,000 military service members were diagnosed with traumatic brain injury (TBI) — caused by anything from a fall during training to blast exposure on the battlefield. While impairment from factors like sleep deprivation can be treated through rest and recovery, others caused by injury may require more intense and prolonged medical attention.
"Current cognitive readiness tests administered to service members lack the sensitivity to detect subtle shifts in cognitive performance that may occur in individuals exposed to operational hazards," says Christopher Smalt, a researcher in the laboratory's Human Health and Performance Systems Group. "Unfortunately, the cumulative effects of these exposures are often not well-documented during military service or after transition to Veterans Affairs, making it challenging to provide effective support."
Smalt is part of a team at the laboratory developing a suite of portable diagnostic tests that provide near-real-time screening for brain injury and cognitive health. One of these tools, called READY, is a smartphone or tablet app that helps identify a potential change in cognitive performance in less than 90 seconds. Another tool, called MINDSCAPE — which is being developed in collaboration with Richard Fletcher, a visiting scientist in the Rapid Prototyping Group who leads the Mobile Technology Lab at the MIT Auto-ID Laboratory, and his students — uses virtual reality (VR) technology for a more in-depth analysis to pinpoint specific conditions such as TBI, post-traumatic stress disorder, or sleep deprivation. Using these tests, medical personnel on the battlefield can make quick and effective decisions for treatment triage.
Both READY and MINDSCAPE are a response to a series of Congressional legislation mandates, military program requirements, and mission-driven health needs to improve brain health screening capabilities for service members.
Cognitive readiness biomarkers
The READY and MINDSCAPE platforms incorporate more than a decade of laboratory research on finding the right indicators of cognitive readiness to build into rapid testing applications. Thomas Quatieri oversaw this work and identified balance, eye movement, and speech as three reliable biomarkers. He is leading the effort at Lincoln Laboratory to develop READY.
"READY stands for Rapid Evaluation of Attention for DutY, and is built on the premise that attention is the key to being 'ready' for a mission," he says. "In one view, we can think of attention as the mental state that allows you to focus on a task."
For someone to be attentive, their brain must continuously anticipate and process incoming sensory information and then instruct the body to respond appropriately. For example, if a friend yells "catch" and then throws a ball in your direction, in order to catch that ball, your brain must process the incoming auditory and visual data, predict in advance what may happen in the next few moments, and then direct your body to respond with an action that synchronizes those sensory data. The result? You realize from hearing the word "catch" and seeing the moving ball that your friend is throwing the ball to you, and you reach out a hand to catch it just in time.
"An unhealthy or fatigued brain — caused by TBI or sleep deprivation, for example — may have challenges within a neurosensory feed-forward [prediction] or feedback [error] system, thus hampering the person's ability to attend," Quatieri says.
READY's three tests measure a person’s ability to track a moving dot with their eye, balance, and hold a vowel fixed at one pitch. The app then uses the data to calculate a variability or "wobble" indicator, which represents changes from the test taker's baseline or from expected results based on others with similar demographics, or the general population. The results are displayed to the user as an indication of the patient's level of attention.
If the READY screen shows an impairment, the administrator can then direct the subject to follow up with MINDSCAPE, which stands for Mobile Interface for Neurological Diagnostic Situational Cognitive Assessment and Psychological Evaluation. MINDSCAPE uses VR technology to administer additional, in-depth tests to measure cognitive functions such as reaction time and working memory. These standard neurocognitive tests are recorded with multimodal physiological sensors, such as electroencephalography (EEG), photoplethysmography, and pupillometry, to better pinpoint diagnosis.
Holistic and adaptable
A key advantage of READY and MINDSCAPE is their ability to leverage existing technologies, allowing for rapid deployment in the field. By utilizing sensors and capabilities already integrated into smartphones, tablets, and VR devices, these assessment tools can be easily adapted for use in operational settings at a significantly reduced cost.
"We can immediately apply our advanced algorithms to the data collected from these devices, without the need for costly and time-consuming hardware development," Smalt says. "By harnessing the capabilities of commercially available technologies, we can quickly provide valuable insights and improve upon traditional assessment methods."
Bringing new capabilities and AI for brain-health sensing into operational environments is a theme across several projects at the laboratory. Another example is EYEBOOM (Electrooculography and Balance Blast Overpressure Monitoring System), a wearable technology developed for the U.S. Special Forces to monitor blast exposure. EYEBOOM continuously monitors a wearer's eye and body movements as they experience blast energy, and warns of potential harm. For this program, the laboratory developed an algorithm that could identify a potential change in physiology resulting from blast exposure during operations, rather than waiting for a check-in.
All three technologies are in development to be versatile, so they can be adapted for other relevant uses. For example, a workflow could pair EYEBOOM's monitoring capabilities with the READY and MINDSCAPE tests: EYEBOOM would continuously monitor for exposure risk and then prompt the wearer to seek additional assessment.
"A lot of times, research focuses on one specific modality, whereas what we do at the laboratory is search for a holistic solution that can be applied for many different purposes," Smalt says.
MINDSCAPE is undergoing testing at the Walter Reed National Military Center this year. READY will be tested with the U.S. Army Research Institute of Environmental Medicine (USARIEM) in 2026 in the context of sleep deprivation. Smalt and Quatieri also see the technologies finding use in civilian settings — on sporting event sidelines, in doctors' offices, or wherever else there is a need to assess brain readiness.
MINDSCAPE is being developed with clinical validation and support from Stefanie Kuchinsky at the Walter Reed National Military Medical Center. Quatieri and his team are developing the READY tests in collaboration with Jun Maruta and Jam Ghajar from the Brain Trauma Foundation (BTF), and Kristin Heaton from USARIEM. The tests are supported by concurrent evidence-based guidelines lead by the BTF and the Military TBI Initiative at Uniform Services University.
Can large language models figure out the real world?
Back in the 17th century, German astronomer Johannes Kepler figured out the laws of motion that made it possible to accurately predict where our solar system’s planets would appear in the sky as they orbit the sun. But it wasn’t until decades later, when Isaac Newton formulated the universal laws of gravitation, that the underlying principles were understood. Although they were inspired by Kepler’s laws, they went much further, and made it possible to apply the same formulas to everything from the trajectory of a cannon ball to the way the moon’s pull controls the tides on Earth — or how to launch a satellite from Earth to the surface of the moon or planets.
Today’s sophisticated artificial intelligence systems have gotten very good at making the kind of specific predictions that resemble Kepler’s orbit predictions. But do they know why these predictions work, with the kind of deep understanding that comes from basic principles like Newton’s laws? As the world grows ever-more dependent on these kinds of AI systems, researchers are struggling to try to measure just how they do what they do, and how deep their understanding of the real world actually is.
Now, researchers in MIT’s Laboratory for Information and Decision Systems (LIDS) and at Harvard University have devised a new approach to assessing how deeply these predictive systems understand their subject matter, and whether they can apply knowledge from one domain to a slightly different one. And by and large the answer at this point, in the examples they studied, is — not so much.
The findings were presented at the International Conference on Machine Learning, in Vancouver, British Columbia, last month by Harvard postdoc Keyon Vafa, MIT graduate student in electrical engineering and computer science and LIDS affiliate Peter G. Chang, MIT assistant professor and LIDS principal investigator Ashesh Rambachan, and MIT professor, LIDS principal investigator, and senior author Sendhil Mullainathan.
“Humans all the time have been able to make this transition from good predictions to world models,” says Vafa, the study’s lead author. So the question their team was addressing was, “have foundation models — has AI — been able to make that leap from predictions to world models? And we’re not asking are they capable, or can they, or will they. It’s just, have they done it so far?” he says.
“We know how to test whether an algorithm predicts well. But what we need is a way to test for whether it has understood well,” says Mullainathan, the Peter de Florez Professor with dual appointments in the MIT departments of Economics and Electrical Engineering and Computer Science and the senior author on the study. “Even defining what understanding means was a challenge.”
In the Kepler versus Newton analogy, Vafa says, “they both had models that worked really well on one task, and that worked essentially the same way on that task. What Newton offered was ideas that were able to generalize to new tasks.” That capability, when applied to the predictions made by various AI systems, would entail having it develop a world model so it can “transcend the task that you’re working on and be able to generalize to new kinds of problems and paradigms.”
Another analogy that helps to illustrate the point is the difference between centuries of accumulated knowledge of how to selectively breed crops and animals, versus Gregor Mendel’s insight into the underlying laws of genetic inheritance.
“There is a lot of excitement in the field about using foundation models to not just perform tasks, but to learn something about the world,” for example in the natural sciences, he says. “It would need to adapt, have a world model to adapt to any possible task.”
Are AI systems anywhere near the ability to reach such generalizations? To test the question, the team looked at different examples of predictive AI systems, at different levels of complexity. On the very simplest of examples, the systems succeeded in creating a realistic model of the simulated system, but as the examples got more complex that ability faded fast.
The team developed a new metric, a way of measuring quantitatively how well a system approximates real-world conditions. They call the measurement inductive bias — that is, a tendency or bias toward responses that reflect reality, based on inferences developed from looking at vast amounts of data on specific cases.
The simplest level of examples they looked at was known as a lattice model. In a one-dimensional lattice, something can move only along a line. Vafa compares it to a frog jumping between lily pads in a row. As the frog jumps or sits, it calls out what it’s doing — right, left, or stay. If it reaches the last lily pad in the row, it can only stay or go back. If someone, or an AI system, can just hear the calls, without knowing anything about the number of lily pads, can it figure out the configuration? The answer is yes: Predictive models do well at reconstructing the “world” in such a simple case. But even with lattices, as you increase the number of dimensions, the systems no longer can make that leap.
“For example, in a two-state or three-state lattice, we showed that the model does have a pretty good inductive bias toward the actual state,” says Chang. “But as we increase the number of states, then it starts to have a divergence from real-world models.”
A more complex problem is a system that can play the board game Othello, which involves players alternately placing black or white disks on a grid. The AI models can accurately predict what moves are allowable at a given point, but it turns out they do badly at inferring what the overall arrangement of pieces on the board is, including ones that are currently blocked from play.
The team then looked at five different categories of predictive models actually in use, and again, the more complex the systems involved, the more poorly the predictive modes performed at matching the true underlying world model.
With this new metric of inductive bias, “our hope is to provide a kind of test bed where you can evaluate different models, different training approaches, on problems where we know what the true world model is,” Vafa says. If it performs well on these cases where we already know the underlying reality, then we can have greater faith that its predictions may be useful even in cases “where we don’t really know what the truth is,” he says.
People are already trying to use these kinds of predictive AI systems to aid in scientific discovery, including such things as properties of chemical compounds that have never actually been created, or of potential pharmaceutical compounds, or for predicting the folding behavior and properties of unknown protein molecules. “For the more realistic problems,” Vafa says, “even for something like basic mechanics, we found that there seems to be a long way to go.”
Chang says, “There’s been a lot of hype around foundation models, where people are trying to build domain-specific foundation models — biology-based foundation models, physics-based foundation models, robotics foundation models, foundation models for other types of domains where people have been collecting a ton of data” and training these models to make predictions, “and then hoping that it acquires some knowledge of the domain itself, to be used for other downstream tasks.”
This work shows there’s a long way to go, but it also helps to show a path forward. “Our paper suggests that we can apply our metrics to evaluate how much the representation is learning, so that we can come up with better ways of training foundation models, or at least evaluate the models that we’re training currently,” Chang says. “As an engineering field, once we have a metric for something, people are really, really good at optimizing that metric.”
At convocation, President Kornbluth greets the Class of 2029
In welcoming the undergraduate Class of 2029 to campus in Cambridge, Massachusetts, MIT President Sally Kornbluth began the Institute’s convocation on Sunday with a greeting that underscored MIT’s confidence in its new students.
“We believe in all of you, in the learning, making, discovering, and inventing that you all have come here to do,” Kornbluth said. “And in your boundless potential as future leaders who will help solve real problems that people face in their daily lives.”
She added: “If you’re out there feeling really lucky to be joining this incredible community, I want you to know that we feel even more lucky. We’re delighted and grateful that you chose to bring your talent, your energy, your curiosity, creativity, and drive here to MIT. And we’re thrilled to be starting this new year with all of you.”
The event, officially called the President’s Convocation for First-years and Families, was held at the Johnson Ice Rink on campus.
While recognizing that academic life can be “intense” at MIT, Kornbluth highlighted the many opportunities available to students outside the classroom, too. A biologist and cancer researcher herself, Kornbluth observed that students can participate in the Undergraduate Research Opportunities Program (UROP), which Kornbluth called “an unmissable opportunity to work side by side with MIT faculty at the front lines of research.” She also noted that MIT offers abundant opportunities for entrepreneurship, as well as 450 official student organizations.
“It’s okay to be a beginner,” Kornbluth said. “Join a group you wouldn’t have had time for in high school. Explore a new skill. Volunteer in the neighborhoods around campus.”
And if the transition to college feels daunting at any point, she added, MIT provides considerable resources to students for well-being and academic help.
“Sometimes the only way to succeed in facing a big challenge or solving a tough problem is to admit there’s no way you can do it all yourself,” Kornbluth observed. “You’re surrounded by a community of caring people. So please don’t be shy about asking for guidance and help.”
The large audience heard additional remarks from two faculty members who themselves have MIT degrees, reflecting on student life at the Institute.
As a student, “The most important things I had were a willingness to take risks and put hard work into the things I cared about,” said Ankur Moitra SM ’09, PhD ’11, the Norbert Wiener Professor of Mathematics.
He emphasized to students the importance of staying grounded and being true to themselves, especially in the face of, say, social media pressures.
“These are the things that make it harder to find your own way and what you really care about,” Moitra said. “Because the rest of the world’s opinion is right there staring you in the face, and it’s impossible to avoid it. And how will you discover what’s important to you, what’s worth pouring yourself into?”
Moitra also advised students to be wary of the tech tools “that want to do the thinking for you, but take away your agency” in the process. He added: “I worry about this because it’s going to become too easy to rely on these tools, and there are going to be many times you’re going to be tempted, especially late at night, with looming p-set deadlines. As educators, we don’t always have fixes for these kinds of things, and all we can do is open the door and hope you walk through it.”
Beyond that, he suggested,“Periodically remind yourself about what’s been important to you all along, what brought you here. For your next four years, you’re going to be surrounded by creative, clever, passionate people every day, who are going to challenge you. Rise to that challenge.”
Christopher Palmer PhD ’14, an associate professor of finance in the MIT Sloan School of Management, began his remarks by revealing that his MIT undergraduate application was not accepted — although he later received his doctorate at the Institute and is now a tenured professor at MIT.
“I played the long game,” he quipped, drawing laughs.
Indeed, Palmer’s remarks focused on cultivating the resilience, focus, and concentration needed to flourish in the long run.
While being at MIT is “thrilling,” Palmer advised students to “build enough slack into your system to handle both the stress and take advantage of the opportunities” on campus. Much like a bank conducts a “stress test” to see if it can withstand changes, Palmer suggested, we can try the same with our workloads: “If you build a schedule that passes the stress test, that means time for curiosity and meaningful creativity.”
Students should also avoid the “false equivalency that your worth is determined by your achievements,” he added. “You have inherent, immutable, instrinsic, eternal value. Be discerning with your commitments. Future you will be so grateful that you have built in the capacity to sleep, to catch up, to say ‘Yes’ to cool invitations, and to attend to your mental health.”
Additionally, Palmer recommended that students pursue “deep work,” involving “the hard thinking where progress actually happens” — a concept, he noted, that has been elevated by computer scientist Cal Newport SM ’06, PhD ’09. As research shows, Palmer explained, “We can’t actually multitask. What we’re really doing is switching tasks at high frequency and incurring a small cost every single time we switch our focus.”
It might help students, he added, to try some structural changes: Put the phone away, turn off alerts, pause notifications, and cultivate sleep. A healthy blend of academic work, activities, and community fun can emerge.
Concluding her own remarks, Kornbluth also emphasized that attending MIT means being part of a community that is respectful of varying viewpoints and all people, and sustains an ethos of fair-minded understanding.
“I know you have extremely high expectations for yourselves,” Kornbluth said, adding: “We have high expectations for you, too, in all kinds of ways. But I want to emphasize one that’s more important than all the others — and that’s an expectation for how we treat each other. At MIT, the work we do is so important, and so hard, that it’s essential we treat each other with empathy, understanding and compassion. That we take care to express our own ideas with clarity and respect, and make room for sharply different points of view. And above all, that we keep engaging in conversation, even when it’s difficult, frustrating or painful.”
