Major AI transparency breakthrough: Ai2 model displays training data sources that may be linked to output
Ai2, the nonprofit research group, earlier today released OLMoTrace, a breakthrough tool that allows users to view and interact with the training data sources connected to specific AI chatbot outputs.
In a major step for transparency in artificial intelligence, the research institute Ai2 today released OLMoTrace, a feature in its Ai2 Playground system that lets the user trace the outputs of language models back to their training data in real time.
OLMoTrace shows documents from the training data that have exact text matches with the model response. Users can select a highlighted span to view its documents.
The new feature also lets users link to the original source material, a significant tool for researchers, academics, and journalists.
Some retrieved documents may be used to fact check parts of the model's response, if the response contains simple facts. However, creative generations (e.g. writing a poem) or novel generations (e.g. writing code) likely cannot be fact checked by looking at these retrieved documents.
Working and operating transparently
OLMo and OLMoTrace are efforts by the Allen Institute to influence how models are developed, released and made transparent. By offering users of its AI system the ability to drill down into outputs and discover the source of the system’s output, OLMoTrace demonstrates the feasibility of using tools (including those made available by the institute) to show an unprecedented level of transparency, contrary to what others in the industry have maintained.
Ai2 is a nonprofit AI research group founded in 2014 by the late Microsoft co-founder Paul Allen. In a release announcing the new feature, Ai2 wrote: “OLMoTrace is a manifestation of Ai2’s commitment to an open ecosystem – open models, open data, and beyond. OLMoTrace is available today with our flagship models, including OLMo 2 32B Instruct.”
check out how it works
Ai2 posted this video explainer:
how to access it
OLMoTrace is available starting today with Ai2’s three flagship models: OLMo 2 32B Instruct, OLMo 2 13B Instruct, and OLMoE 1B 7B Instruct, and it can be applied to any language model for which you have access to the training data.
what it looks like
Here’s a screen grab from OLMoTrace in action:
OLMoTrace activated on an Ai2 chatbot output.
To activate OLMoTrace for a model response, click on the “Show OLMoTrace” button. After a few seconds, several spans of the model response will be highlighted, and a document panel will appear on the right. The highlights indicate long and unique spans that appear verbatim at least once in the training data of this model. Ai2’s developers picked “long and unique” spans so that they are likely interesting enough to warrant further inspection.
In the side panel appears a collection of documents from the training data where the highlighted spans appear. Sometimes, a highlighted span may not be present contiguously in any single document, but different parts of that span are present (possibly in different documents) and together they cover the entire span.
If you click on a highlight in the model response, the side panel will show only documents containing the selected span. Similarly, if you click “Locate span” on a document in the side panel, the span highlights will narrow down to those that appear in the selected document. Clicking on the highlight or the document again cancels the selection.
learn more about olmotrace
Link to Ai2’s announcement of OLMoTrace’s release.
Download the Technical Paper regarding OLMoTrace.
Explore the Source Code for OLMoTrace.
like what you’re reading? subscribe to ‘ai spotlight,’ our free newsletter
OLMoTrace is available today with our three flagship models: OLMo 2 32B Instruct, OLMo 2 13B Instruct, and OLMoE 1B 7B Instruct, and it can be applied to any language model for which you have access to the training data.
We developed OLMoTrace to enable researchers, developers, and the general public to inspect where and how language models may have learned to generate certain word sequences. OLMoTrace is a one-of-a-kind feature and is made possible by
Ai2’s commitment to making large pretraining and post-training datasets open in the interest of advancing scientific research in AI and public understanding of AI systems. OLMoTrace is available today with our three flagship models: OLMo 2 32B Instruct, OLMo 2 13B Instruct, and OLMoE 1B 7B Instruct, and it can be applied to any language model for which you have access to the training data.