TPC25 Highlights AI’s Expanding Role: Multimodal Data, Model Evaluation, and Non-LLM Architectures

August 6, 2025 by Staff

(Kitinut Jinapuck/Shutterstock)

How do we speed up AI-powered scientific discovery without sacrificing control? Is it possible to trace a language model’s answers back to the data it was trained on? What does fairness look like when AI is used to interpret maps instead of text? And what happens when these systems fail in ways no human ever would? These were some of the questions explored by four distinguished speakers during a plenary session at TPC25 last week.

From ORNL, Prasanna Balaprakash discussed how AI can help guide scientific research with speed and precision. Jiacheng Liu of AI2 shared a tracing tool that connects model outputs to their training data. Ricardo Baeza-Yates from BSC AI Institute challenged the field to take accountability for harm, misinformation, and waste. Kyoung-Sook Kim from Japan's AIST raised essential questions about fairness in geospatial systems and how data gaps and geographic bias can lead to skewed or unequal outcomes.

ORNL’s Multi-Focused AI Initiative

It’s worth remembering that ORNL’s legacy of AI exploration stretches back to 1979 to its first Applied Artificial Intelligence Project, and includes, along the way, the standing up of the first DoE Supercomputer with GPUs – Titan (2012) in which 8,688 CPUs were paired with an equal number of GPUs. Today, of course, Frontier is ORNL’s top machine and has roughly 37632 AMD GPUs. As of June 2025, it was number 2 on the Top500 List.

Prasanna Balaprakash, director of AI Programs and section head, Data and AI Systems at ORNL, and the leader of the current ORNL AI initiative, used his talk to briefly touch on ORNL’s AI legacy and to discuss its current priorities.

“Oak Ridge has a rich history in leveraging AI for science. The first AI project started in 1979, and in 1981 there was a review article where people were talking about how AI is going to transform scientific discoveries, spectroscopy, environmental management, and so on,” said Balaprakash. Those systems were conceived as rule-based expert systems, not the data-driven model-based systems we think of today.

When LLMs – particularly ChatGPT (11/2022) – burst onto the scene, the shock reverberated through all of IT and prompted rethinking.

“We said, okay, there are a lot of things happening in the LLM world. This is two years back. We wanted to set priorities and say what are some foundational advancements that we can make that should be long-term and should not become irrelevant because of advancements in the industry,” said Balaprakash.

Slide courtesy of Prasanna Balaprakash

“We need to build AI models with assurance, and within which we think about validation and verification, uncertainty quantification, and causal reasoning, and finally, efficiency,” he said. Yes, we have the big machines at our disposal, but at the same time, we need to think about how to efficiently scale large-scale models on these machines. We must also think about small models, models that can work at the edge, so that we can deploy them and accelerate scientific experiments and explore novel AI hardware and bring them in as part of the scientific workflow.”

Broadly speaking, the ORNL AI initiative is focused on advancing secure, assured, and efficient AI for three different domains: scientific simulation, experimental facilities and national security. Today, the initiative has 15 advanced AI projects involving roughly 50 researchers. “It's a very interdisciplinary team,” emphasized Balaprakash. We want to develop capabilities that will be used across the lab, and across the community.”

Among Balaprakash presented several examples of work, particularly highlighting the integration of multi-modal data, and incorporating uncertainty measures, and robustness.

Slide courtesy of Prasanna Balaprakash

“One of the exciting areas is foundation models for science, specifically foundation models that can accelerate scientific simulations. One big bold bet that we made two years ago, was decide to let's look at other modalities where industry will not play a major role. Let's look at large-scale spatiotemporal data, for example, nuclear fusion simulations, which are all large-scale spatiotemporal data and [determine] how can we make advancements in this space?”

One outcome was the development of the Oak Ridge Base Foundational Model for Earth System Predictability. “Last year, we did large-scale training on 49,000 GPUs, achieving or exceeding exascale throughput, the first time for fast spatial-temporal models. We built models up to 10 billion parameters. Again, first time for these types of models. This year, we expanded the work for downscaling these types of simulations; we are talking about reducing the computational time of the simulations by several orders of magnitude,” he said.

Slide courtesy of Prasanna Balaprakash

Balaprakash singled out efforts around graph models, uncertainty quantification, and connecting experimental instruments to Frontier to process data and manage experiments in real-time:

Graph Models. “We are making a big bet on large-scale graph foundation models that is building on terabytes of material science data, open data. If you go to Hugging Face today, you can find millions of derivatives of language models. But if you go and look at graph foundation models for material science, you will find only a handful, and we are among them. We build the largest graph foundation model that's available there, and we are making it easier for other people to use with all the scripts,” he said.
Uncertainty. “Uncertainty quantification is a first-class citizen for the AI models that we are developing for science. We are developing techniques that can allow people to take an uncertainty quantification library and plug that into their model. No scientist will say, “Oh, I don't want to use uncertainty quantification.” It's up to us who are primarily developing this type of technology to enable uncertainty quantification for the AI models that people are developing across the complex.”
Instrument-to-Frontier Link. What you see here (slide) is, a Topaz experiment from the ORNL spallation neutron source, and that is producing a large amount of data. What we're doing is moving from
Slide courtesy of Prasanna Balaprakash

TOPAZ (detector) to Frontier supercomputer, using 4000 GPUs to process that data within a few seconds, and provide real-time insights about this experiment. By doing that analysis, we can save experimental time. Specifically, saying if you just run this experiment as usual, you have to run this until the end, which eventually results in over-counting and whatnot. But if you do AI-enabled steering, we can use these models to predict what will happen, and as long as the model is good. Then we stop the experiment [sooner] and release this resource for others to use.”

Opening the Black Box of Large Language Models

Allen Institute for AI (Ai2) researcher Jiacheng Liu presented OLMoTrace, a tracing system that links LLM responses back to the multi-trillion-token corpus that shaped them. The tool sits inside the Ai2 Playground, which hosts the fully open OLMo family of models. After a user generates an answer, a single click highlights every long stretch of text that appears verbatim in the training set and lists the surrounding source documents. The result arrives in about 4.5 seconds for a typical 450-token reply, fast enough for interactive exploration.

Slide Courtesy of Jiacheng Liu

OLMoTrace uses an optimized suffix-array index developed in an earlier search project. The index spans more than three billion documents and four trillion tokens drawn from pre-training, mid-training, and post-training stages. By scanning for the longest exact matches at each position in the model output, the system avoids expensive substring searches and can operate in parallel across many cores. A ranking step surfaces spans that are both long and unique, then retrieves the full documents for context.

Liu showed how this data-centric view of model behavior serves several practical needs. Fact-checking is the most direct: when the model states that Seattle’s Space Needle was built for the 1962 World’s Fair, OLMoTrace surfaces a matching sentence from a reputable source, allowing quick verification. The tool also exposes the roots of hallucination. In one example, the model produced fabricated code-execution results, and tracing revealed training dialogues where a student supplied outputs without actually running code, suggesting that the model learned this bad habit from its data.

Slide Courtesy of Jiacheng Liu

For scientists, the value is transparency. Research teams can audit answers, inspect provenance, and decide whether a cited document meets the standards of their field. Data-traceable outputs also simplify compliance with emerging AI governance rules that demand justification for results used in healthcare, climate modeling, and other sensitive domains. Liu frames the work as complementary to mechanistic interpretability studies that map behaviors to neural circuits. By pairing weight-level analysis with corpus-level tracing, researchers can gain an understanding of how models reason and where they sometimes go wrong.

The project builds on Ai2’s commitment to open science. Every component of OLMo 2, from weights to training scripts, is public, so others can replicate the tracing pipeline or adapt it to other models. That openness helped OLMoTrace earn the Best Demo award at this year’s ACL conference in Vienna, Austria. Looking ahead, Liu says the team plans to refine relevance ranking, integrate approximate matching for paraphrased sources, and explore how retracing can help guide dataset curation.

From Discrimination to Disinformation: Baeza-Yates on ‘Irresponsible AI’

In a fast-paced and sobering plenary talk at TPC25, Ricardo Baeza-Yates, director of the BSC AI Institute, delivered a critical overview of what he calls “Irresponsible AI,” a framework of failures and oversights that continue to undermine the credibility, safety, and fairness of today’s AI systems. His taxonomy of AI misuse—automated discrimination, pseudoscience, unfair ecommerce, waste of resources, and human incompetence—set the stage for a broader critique that ranged from faulty model evaluation to the global consequences of generative AI.

“The first thing you need to remember is that data and models are proxies of the world. They are approximations,” Baeza-Yates emphasized early on, urging attendees not to confuse accuracy with understanding. AI, he argued, is often mistaken for a mirror of human reasoning when it is merely a layered prediction engine that is frequently wrong, and occasionally dangerous.

He warned against anthropomorphizing these technologies, rejecting terms like “ethical AI” or “trustworthy AI” as fundamentally flawed. “A machine cannot be ethical,” he said, emphasizing that ethics, like trust, are inherently human qualities. He argued that framing AI systems this way shifts responsibility away from their designers and toward users—an inversion he sees as both misleading and dangerous.

Ricardo Baeza-Yates, Director of the BSC AI Institute

The cost of those failures is growing. Baeza-Yates cataloged the unique harms emerging from generative AI: from the rapid spread of disinformation to unresolved copyright conflicts and growing mental health concerns. In perhaps the most chilling moment of the talk, he described real-world cases where chatbots were implicated in suicide, raising urgent questions about responsibility and intervention.

“Non-human errors are errors that AI will do, that humans will never do. And what’s the problem? We are not prepared for them,” he said, pointing to examples such as autonomous vehicles making decisions no human would make—like continuing to drive after hitting a pedestrian. These are not statistical outliers; they are systemic risks built into AI’s design, evaluation, and deployment.

Current approaches to success metrics, he argued, are part of the problem. “We shouldn’t measure accuracy. We should measure what happens when there’s a mistake,” he said.

Baeza-Yates also pushed back on the popular narrative of AI democratization. With over 7,000 languages spoken worldwide and less than 200 represented in most leading models, a large portion of the global population is effectively excluded from AI access. Add to this the digital divide, age restrictions, and cultural mismatches, and the idea of equitable AI quickly falls apart.

Baeza-Yates ended with a clear reminder of the limits of current AI systems—and of the illusions we project onto them.

“Remember, we need to be very lucid. They don’t see, read, or write—because to [do that], you have to understand,” he said. “They are predicting everything. Basically, they’re hallucinating all the time. But most of the time, the hallucination is correct.”

Fairness of Geospatial Foundation Models

Geospatial AI (GeoAI) is increasingly being used to make sense of the physical world, from satellite imagery and urban infrastructure to environmental monitoring and disaster response. At TPC 2025, Dr. Kyoung Sook Kim, Deputy Director at the National Institute of Advanced Industrial Science and Technology (AIST) in Japan, gave a compelling talk on the fairness of geospatial foundation models.

Dr. Kim has a background in geospatial databases, location-based services, and cyber-physical cloud computing, and has spent over a decade researching how to integrate geographic information systems (GIS) with big data and AI.

In her presentation, Dr. Kim brought attention to one of the most pressing issues facing the development of geospatial AI systems: the challenge of ensuring fairness across the full lifecycle of these models. In this case, fairness means making sure geospatial foundation models work well across different regions, not just the areas with the most data.

She explored how uneven data collection, gaps in spatial coverage, and biased assumptions during model training can lead to real-world consequences, particularly when AI is used to influence planning, infrastructure, or resource allocation. Her central message was that equity in GeoAI is not just a question of outcomes, but of how systems are designed, from the ground up.

A lot of attention has gone to fairness in language and image AI, but Dr. Kim explained that geospatial systems face their own unique problems. “There is no good definition,” she said. “We don’t have enough metrics to measure.” Unlike text or image data, location data is deeply tied to where it’s collected and what’s happening on the ground.

A model trained to work in one place might misrepresent or miss critical patterns somewhere else. That’s why, she explained, fairness in GeoAI can’t be one-size-fits-all; it needs to reflect the differences between regions, populations, and the data available to represent them.

These concerns become even more important when we consider the types of data GeoAI systems use. They are built on a wide variety of inputs, including satellite images, mobile phone signals, sensor readings, GPS logs, and location-tagged video. Dr. Kim walked through how this data enables a range of applications, from identifying solar infrastructure to coordinating disaster response, powering autonomous delivery, and supporting city planning.

However, she also raised concerns about the blind spots these systems can carry. Decisions made early on, such as how data is selected, labeled, or processed, can introduce bias long before any model is deployed. For fairness to be meaningful, she explained, we have to look closely at who is reflected in the data, and who is not.

Slide Courtesy of Dr. Kyoung Sook Kim

Dr. Kim concluded by calling for shared frameworks to help evaluate fairness as AI systems take on more responsibility for interpreting and acting on spatial data. She pointed to international standards, including new ISO efforts, as an important step toward more consistent definitions of fairness and data quality.

However, these tools are only part of the solution. Fairness is deeply contextual, shaped by geography, history, social complexity, and other factors. As AI becomes more embedded in infrastructure, planning, and public decision-making, ensuring equity in how these systems are built and applied will be more important than ever.

The Takeaway

These talks pointed to a broader shift in AI research. As models grow in scale, so too does the need for thoughtful design. It is no longer just about building systems that perform well in benchmarks, but about understanding where their data comes from, how their outputs are evaluated, and what real-world impact they carry. Moving forward, the work will hinge not only on smarter algorithms, but on how responsibly and inclusively they are built and applied.

Thank you for following our TPC25 coverage. Complete session videos and transcripts will be available shortly at TPC25.org.

Contributing to this article were Ali Azhar, Doug Eadline, Jaime Hampton, Drew Jolly, and John Russell.

Categories: AI/ML/DL

Tags: Ai2,fairness,geospatial,LLM,multimodal,ornl,talks,TPC,TPC25