The Problem Has a Name: Semantic Drift
Ask ten analysts at a large enterprise what "monthly active users" means, and you will likely get ten different answers. Not because they are careless, but because each tool they work in — the BI platform, the data warehouse, the marketing cloud, the CRM — defines it slightly differently. One system counts logins. Another counts session initiations. A third filters out internal users. A fourth applies a 28-day rolling window instead of a calendar month. The metric carries the same name across the organization but arrives with different DNA in every context.
This phenomenon — metrics and business logic silently diverging as they travel across tools and teams — has a name: semantic drift. It has existed as long as organizations have had more than one data system. What changed in 2024 and 2025 is who got hurt by it.
For decades, semantic drift was a human problem. Analysts would reconcile discrepancies in spreadsheets, add footnotes to dashboards, and hold "data alignment" meetings to agree on which number was the "real" number. Frustrating, but manageable. The humans understood the context behind each definition even when the systems did not.
AI agents do not have that luxury. When a conversational analytics system or an agentic AI receives a natural language question — "what is our MoM revenue growth in the northeast region?" — it must resolve that question into precise computation against a specific data source using a specific metric definition. It cannot call a meeting. It cannot read the footnote. It either has a canonical, machine-readable definition of "revenue" and "northeast region" to reference, or it guesses. And when it guesses, it hallucinates — returning a confident, plausible-looking number that means something subtly different than what the person asked for.
The data industry recognized this problem and attempted to solve it, repeatedly, through semantic layers. Tools like LookML (Looker), MetricFlow (dbt), Cube, AtScale, and others gave organizations a way to define business metrics centrally and serve them consistently. These tools worked — inside their own ecosystems. The problem was portability. A metric defined in LookML lives in LookML. A metric defined in MetricFlow lives in dbt. If an organization used both tools, or wanted to feed the same definition into a different AI platform, it had to redefine the metric from scratch. Every new tool in the stack meant another surface where drift could occur.
That is the gap the Open Semantic Interchange Initiative (OSI) was created to close.
What Is the Open Semantic Interchange?
The Open Semantic Interchange is an industry coalition and open specification designed to create a universal interchange format for semantic metadata — a common language that analytics tools, AI platforms, BI systems, and data catalogs can all read, write, and exchange without translation loss.
Snowflake announced the initiative on September 23, 2025, at its annual Snowflake Summit. The framing was deliberately broad: this was not a Snowflake product, a Snowflake format, or a Snowflake-controlled standard. It was positioned as an open, vendor-neutral effort led by a coalition of companies who collectively recognized that no single vendor could solve the interoperability problem on their own — and that trying to do so would simply reproduce the fragmentation in a new form.
The founding coalition was notable for its range. It included not just Snowflake, but Salesforce (and by extension Tableau and Einstein Analytics), dbt Labs (the de facto backbone of the modern analytics engineering stack), BlackRock (representing large-scale institutional data consumers), RelationalAI, ThoughtSpot, Sigma, Omni, Alation, Atlan, Cube, Hex, Honeydew, Mistral AI, and others. The inclusion of Mistral AI was a deliberate signal: this specification was being designed from the ground up to be AI-legible, not just BI-legible.
The Core Proposition
OSI's proposition is simple to state and genuinely hard to execute: define a vendor-neutral, machine-readable format for representing semantic layer constructs — datasets, metrics, dimensions, relationships, and contextual metadata — such that any compliant tool can ingest a model defined in any other compliant tool without redefining anything.
Think of it as the USB-C of semantic layers. The connector is standardized. The devices on either end can be anything. A metric defined in dbt's MetricFlow and exported as OSI-formatted YAML should be importable into Snowflake's Cortex Analyst, or ThoughtSpot Spotter, or a custom agentic AI application, with no semantic loss. The business logic travels with the data definition, not just the data schema.
The specification lives on GitHub at github.com/open-semantic-interchange/OSI under an Apache 2.0 license. A dedicated project website launched at open-semantic-interchange.org in January 2026.
Founding and Expanded Coalition
The working group has grown substantially since the September 2025 announcement. The founding members included:
Since the January 2026 spec release, the working group has expanded to include AtScale, Coalesce, Collate, Credible, Databricks, JetBrains, Lightdash, Qlik, Collibra, DataHub, Domo, Firebolt, Informatica, Instacart, Preset, and Starburst. The addition of Databricks is particularly significant — it brings both the Databricks Lakehouse platform and Unity Catalog into the OSI orbit, meaning the two dominant cloud data platform ecosystems (Snowflake and Databricks) are now both participating in the same semantic standard.
Inside the Specification
The OSI specification uses a declarative YAML format. Its design lineage is visible: it draws heavily on dbt Labs' MetricFlow framework, which itself inherited ideas from LookML and earlier semantic layer traditions. But OSI adds two critical layers that prior formats lacked — dialect-aware expression support and first-class AI context.
Top-Level Structure
An OSI document has two top-level keys: a version identifier and a semantic_model array. Each entry in the array is a named semantic model — analogous to a data domain or subject area — that contains datasets, metrics, relationships, and contextual metadata. A single OSI file can define multiple models, though in practice most organizations will maintain one model per domain.
version: "0.1.1"
semantic_model:
- name: retail_analytics_model
description: Core semantic model for retail sales and customer analytics
ai_context:
instructions: "Use this model for retail analytics. It provides
comprehensive sales, customer, product, and store data. Supports
time-based analysis, customer segmentation, and store performance."
datasets: [...]
metrics: [...]
relationships: [...]
custom_extensions: [...]
Datasets: The Logical Business Entity Layer
Datasets in OSI represent logical business entities — typically corresponding to fact tables and dimension tables in a star schema, though the abstraction is not limited to that physical shape. Each dataset declares a source (the physical table or view), a primary key, and an array of fields. Critically, each field carries an ai_context block that provides synonyms — alternative natural language terms that map to this technical field name.
datasets:
- name: store_sales
source: tpcds.public.store_sales
primary_key: [ss_item_sk, ss_ticket_number]
description: Fact table containing all store sales transactions
ai_context:
synonyms:
- "sales transactions"
- "store purchases"
- "retail sales"
- "POS data"
fields:
- name: ss_ext_sales_price
expression:
dialects:
- dialect: ANSI_SQL
expression: ss_ext_sales_price
description: Extended sales price (quantity x unit price)
ai_context:
synonyms:
- "total price"
- "line total"
- "revenue"
The ai_context.synonyms array is where OSI does something genuinely new. Prior semantic layer formats were written for BI tools — systems where a user navigated a data model visually and selected fields by their technical names or curated display labels. AI agents work differently: they receive a natural language question and must resolve business terms — "revenue," "units sold," "store purchases" — to technical field identifiers. Without a structured synonym registry, that resolution happens probabilistically inside the LLM, which means it can be wrong. OSI externalizes that mapping into the data model itself, making it deterministic and auditable.
Metrics: Dialect-Aware Computation
Metrics in OSI are defined as named, reusable calculations. Each metric carries an expression block that supports multiple SQL dialects — ANSI SQL, Snowflake SQL, BigQuery Standard SQL, Spark SQL, and others can all be represented within a single metric definition. This is OSI's answer to the cross-platform execution problem: you do not have to choose a target platform at definition time.
metrics:
- name: customer_lifetime_value
expression:
dialects:
- dialect: ANSI_SQL
expression: >
SUM(store_sales.ss_ext_sales_price)
/ COUNT(DISTINCT customer.c_customer_sk)
description: Average lifetime sales value per unique customer
ai_context:
synonyms:
- "CLV"
- "LTV"
- "customer value"
- "lifetime revenue"
- "average customer worth"
- name: store_productivity
expression:
dialects:
- dialect: ANSI_SQL
expression: >
SUM(store_sales.ss_ext_sales_price)
/ NULLIF(SUM(store.s_number_employees), 0)
description: Sales revenue per employee across stores
ai_context:
synonyms:
- "sales per employee"
- "employee productivity"
- "revenue per headcount"
Notice that customer_lifetime_value is defined as a ratio — a sum divided by a distinct count — with its own ai_context.synonyms. When an AI agent encounters a user asking "what is our CLV by region?", it can resolve "CLV" to this metric definition, retrieve the precise SQL expression for the target platform's dialect, and execute it correctly without interpretation. The business logic travels with the question.
Custom Extensions: Vendor-Specific Metadata
One of OSI's design concessions to reality is the custom_extensions block. Rather than demanding that every possible vendor capability be encoded in the core spec, OSI provides a structured escape hatch: any vendor can attach arbitrary JSON metadata to a semantic model under their own namespace, and compliant tools that do not understand that namespace simply ignore it.
custom_extensions:
- vendor_name: SALESFORCE
data: |
{
"tableau_workbook_id": "retail_dashboard",
"einstein_enabled": true,
"crm_sync": {
"enabled": true,
"sync_frequency": "daily",
"customer_mapping": "customer.c_customer_id -> Account.AccountNumber"
}
}
- vendor_name: DBT
data: '{"project_name": "retail_analytics", "models_path": "models/semantic"}'
This design choice reflects a pragmatic understanding of how standards succeed. The core spec needs to be stable and minimal — covering only what every participant agrees belongs in the shared layer. Vendor-specific capabilities, integrations, and enrichments go in the extensions block. It is the same architectural decision that made HTTP extensible through headers and HTML extensible through custom attributes: reserve the core for consensus, allow the periphery to evolve.
Relationships: The Join Layer
OSI represents join relationships between datasets explicitly, specifying foreign key mappings, join type (left, inner, full outer), and cardinality (one-to-many, many-to-one, many-to-many). This is the information that a query engine needs to correctly traverse a star schema — or any other physical layout — at runtime. By encoding it in the semantic model rather than the query layer, OSI ensures that joining logic is defined once and applied consistently whether the consuming system is a traditional BI tool, a NL2SQL pipeline, or an agentic AI workflow.
The Interoperability Stack
OSI does not replace existing semantic layers — it connects them. Understanding where it fits requires a clear picture of the tools it is designed to interoperate with.
dbt Semantic Layer + MetricFlow
dbt Labs introduced MetricFlow as the query engine behind the dbt Semantic Layer, providing a structured way to define metrics in YAML that could be executed consistently across data warehouses. MetricFlow was already the closest thing to a de facto standard for analytics engineering teams. OSI's relationship with MetricFlow is additive: dbt has committed to using the OSI spec as the interchange format for MetricFlow definitions, meaning metrics authored in dbt can be exported as OSI-compliant YAML and consumed by any other OSI-compatible tool. The workflow is: author in dbt, export as OSI, import anywhere.
Snowflake Cortex Analyst
Snowflake's Cortex Analyst is a native NL2SQL capability built into the Snowflake platform, allowing users to ask natural language questions against Snowflake data. As the OSI lead sponsor, Snowflake has built Cortex Analyst around the OSI format — semantic models defined in OSI YAML are the configuration layer that tells Cortex Analyst how to interpret user questions, map natural language terms to fields, and generate accurate SQL. OSI is not incidental to Cortex Analyst; it is the semantic substrate the system runs on.
ThoughtSpot Spotter
ThoughtSpot joined OSI as a founding member and has committed to OSI compatibility in its Spotter conversational analytics product. For ThoughtSpot, OSI represents a path to consume semantic context defined elsewhere in an organization's stack — rather than requiring all semantic definitions to live in ThoughtSpot's own modeling layer, organizations can maintain a single OSI-formatted model and have Spotter read from it directly.
Salesforce / Tableau Einstein
Salesforce's participation covers both Tableau (its enterprise BI platform) and Einstein (its AI analytics layer). The custom_extensions example in the official OSI sample file references Tableau workbook IDs and Einstein enablement flags explicitly, reflecting how Salesforce intends to use the extension mechanism: core metric definitions live in the shared OSI layer, Tableau/Einstein-specific configuration lives in the extension block, and the two pieces travel together.
The Converter Layer
The OSI GitHub repository includes a converters directory — a growing collection of tools for translating existing semantic layer definitions into OSI-compliant YAML. Converters for LookML (Looker/Google) and MetricFlow are in active development. This converter layer is essential for adoption: organizations with existing semantic layer investments cannot start from scratch, and OSI's interoperability value only materializes if existing definitions can be migrated into the shared format without manual reauthoring.
custom_extensions block or be simplified. Organizations should validate converter output against expected query behavior before treating it as production-ready.
What OSI Means for Conversational Analytics
Conversational analytics — the ability for users to ask natural language questions and receive trustworthy, computed answers from enterprise data — has been technically feasible for several years. The constraint was never the language model's ability to generate SQL. Models like GPT-4 and Claude were generating syntactically correct SQL from natural language prompts as early as 2023. The constraint was the accuracy of that SQL given the semantic complexity of enterprise data models.
A language model asked to answer "what is our revenue this quarter compared to last quarter?" will generate SQL that looks correct. Whether it is actually correct depends entirely on how "revenue" is defined in the specific data environment the query executes against. If "revenue" should exclude returns, or should be net of discounts, or should use recognition date rather than order date, or should exclude a specific business unit that is in the middle of an acquisition — none of that information is in the schema. It exists in the semantic layer, or in human heads, or nowhere.
This is precisely the gap OSI addresses for conversational analytics workflows.
How OSI Changes the NL2SQL Pipeline
A traditional NL2SQL pipeline works roughly as follows: receive a natural language question, construct a prompt that includes database schema information, send the prompt to a language model, receive SQL output, execute it. The schema information gives the model column names and data types, but not business logic. The model must infer metric definitions from context clues in column names and table names — which is exactly where hallucination occurs.
An OSI-augmented pipeline changes the input layer fundamentally. Instead of raw schema, the model receives an OSI semantic model: structured definitions of what each metric means, how it is calculated, what natural language terms map to which technical fields, and what contextual instructions apply to this domain. The question "what is our CLV by region?" no longer requires the model to infer what CLV means — the OSI model has already defined it, mapped its synonyms, and provided the SQL expression that computes it correctly for the target platform dialect.
ai_context blocks are, in effect, structured retrieval augmentation for business logic.
Agentic AI Workflows
The implications extend beyond single-turn question answering. Agentic AI systems — multi-step workflows where an AI agent plans, executes, observes, and iterates — increasingly need to query enterprise data as part of complex reasoning chains. A procurement agent that identifies cost reduction opportunities, a financial agent that monitors for margin compression, a clinical operations agent that tracks patient throughput — all of these workflows require the agent to issue data queries against enterprise systems with precision.
Without semantic grounding, each of those agents must either embed metric definitions in its system prompt (fragile, duplicative, version-uncontrolled) or risk semantic drift in its results. With an OSI-formatted semantic model available at query time, the agent can retrieve the canonical definition of any metric it needs, issue a correctly-formed query, and trust that the result means what it expects.
The model-level ai_context.instructions field is particularly relevant here. It allows model authors to provide guidance specifically to AI agents consuming the model — context that a human BI user would not need but an autonomous agent does, such as which metrics are appropriate for which types of analysis, how to interpret null values in certain fields, or what business rules apply to edge cases. This is semantic documentation designed for machine consumption from the ground up.
The Single Source of Truth Problem
One of the persistent challenges in enterprise analytics is maintaining a single source of truth for metric definitions as organizations grow and their tool stacks proliferate. A business intelligence team defines "active user" in Looker. A data science team defines it differently in a Python notebook. A product team embeds a third definition in an event tracking system. An executive dashboard pulls from a fourth source. OSI does not eliminate this problem by itself — organizations still have to choose a canonical definition. But it gives them a format in which to encode that canonical definition once and have it flow, consistently, into every tool that participates in the OSI ecosystem. The single source of truth becomes portable in a way it has never been before.
Where Things Stand Now
Seven months have passed since the September 2025 announcement. The project has moved through a predictable but meaningful arc: from vision to specification to early implementation, with governance still evolving.
What the Databricks Addition Signals
Databricks joining the OSI working group in January 2026 deserves specific attention. Snowflake and Databricks have competed intensely for the enterprise data platform market for several years — they are each other's most direct competitor in the cloud lakehouse and data warehouse space. For Databricks to participate in an initiative led by Snowflake is a meaningful signal. It suggests that the participants believe the interoperability problem is large enough, and the business case for solving it strong enough, that competitive considerations are secondary.
It also materially changes OSI's coverage. Databricks brings Unity Catalog — its unified data governance and metadata layer — into the conversation. Unity Catalog already manages semantic metadata for Databricks workloads. OSI compatibility with Unity Catalog would mean that semantics defined in Databricks' ecosystem can participate in the shared interchange format, dramatically expanding the pool of organizations for whom OSI is a viable path.
Governance: The Unresolved Question
The most significant structural question OSI has not yet resolved is governance. The project has operated under Snowflake's informal leadership since the September 2025 announcement. Snowflake has stated explicitly that it intends to transition OSI to a "neutral, foundation-led governance model" — the same pattern that successful open standards like OpenAPI and CNCF-hosted projects have followed. What that foundation will be, when the transition will occur, and how voting rights and spec modification rights will be structured remain open questions as of April 2026.
This matters for adoption. Organizations evaluating whether to build their semantic layer tooling around OSI are asking whether the spec will remain stable, whether a single vendor can unilaterally change it, and whether their interests will be represented in future versions. A foundation structure answers those questions. Until it exists, some organizations will watch rather than commit.
The Comparison to Other Standards Efforts
It is worth situating OSI in the broader history of data interoperability efforts, most of which did not succeed. PMML, the Predictive Model Markup Language, tried to standardize model exchange across statistical tools in 1999 and achieved limited adoption. MDX, Microsoft's multidimensional query language, became a de facto standard but only within the OLAP cube ecosystem. The Semantic Web's RDF and OWL specifications produced a rich ontology framework that never achieved mainstream enterprise data adoption.
OSI's structural advantages over prior efforts are real. It has buy-in from organizations that represent the actual working stack of modern analytics engineering — dbt Labs, Snowflake, Databricks, Salesforce/Tableau — not just standards bodies and academic contributors. The YAML format is developer-friendly and already familiar to dbt users. The Apache 2.0 license removes commercial friction. And the problem it is solving is acutely felt right now, not hypothetically in the future.
The risks are also real. The working group spans competitors with different incentives. Vendor-specific extensions could dilute the core's interoperability value if they become the primary vehicle for differentiation. And the governance gap creates uncertainty that may slow adoption at exactly the moment when momentum matters.
What to Watch in the Next 12 Months
For practitioners tracking OSI's progress, three developments will be most telling. First, whether native OSI import/export ships in dbt Cloud and Snowflake Cortex Analyst — these are the two products with the largest install bases in the working group, and their implementations will create the first real-world validation of the spec under production workloads. Second, whether Databricks ships OSI support in Unity Catalog — that would effectively mean every major cloud data platform supports the format. Third, whether the governance foundation is announced. If Snowflake is serious about OSI being a community standard rather than a Snowflake standard, the foundation transition will happen. If it does not happen by mid-2026, the organizations currently watching from the sidelines will take note.
The specification itself is a starting point, not an endpoint. The OSI team has been explicit about this. Version 0.1.1 — the current specification version — is intentionally lean. Future versions will address row-level security context, materialization hints, time-series grain specifications, and richer AI instruction formats as real-world implementation surfaces edge cases. That evolution should be expected and welcomed. The value of a shared standard is not that it is perfect at v1.0; it is that it exists, and that the industry agrees to make it better together.
Sources
Snowflake Unites Industry Leaders to Unlock AI's Potential with the Open Semantic Interchange Initiative
The original announcement from Snowflake introducing OSI, its mission, and the founding coalition of partners.
Open Semantic Interchange (OSI) Specification Finalized
Announcement of the v1.0 spec release, new working group members, and the launch of the OSI project website.
What the Open Semantic Interchange (OSI) Spec Means for Metrics, Semantics, and AI
dbt Labs' perspective on OSI, MetricFlow's role in the initiative, and the commitment to operationalizing semantics through the interchange format.
Ending Semantic Drift: The First Unified Business Logic Foundation for AI and BI
Salesforce's articulation of the semantic drift problem and how OSI provides the foundation for consistent business logic across AI and BI systems.
OSI Specification Repository
The Apache 2.0-licensed repository containing the core specification, examples (including the TPC-DS retail semantic model used in this article), converters, and validation tooling.
Open Semantic Interchange (OSI) Further Expands Partner Ecosystem and Holds First Working Group Meeting
Update on the first OSI working group meeting and the expansion of the partner ecosystem.
Snowflake-Led Coalition Targets Data Fragmentation with Vendor-Neutral Semantic Standard
Independent analysis of the OSI announcement and its implications for the broader data ecosystem.
Open Semantic Interchange — Project Website
The official OSI project site with the current specification, working group directory, and community resources.