The Chatbot Didn't Get It Wrong. Your Data Did.

The Setup

I'm new to conversational analytics. Twenty-five years in data taught me it doesn't matter.

I haven't spent my career in conversational analytics. I've spent it in data — building pipelines, standing up warehouses, sitting in rooms where someone points at a number on a dashboard and asks why it doesn't match what the source system shows.

That question is 25 years old for me. And when I started looking seriously at conversational analytics, what struck me wasn't how different the problem was. It was how familiar it felt.

The fundamental question in data quality has never changed. Does the number we're displaying match the source system? If it does, the data is right. If it doesn't, the data is wrong. That's true whether the number lives on a report, a dashboard, or gets spoken back to a user by a chatbot.

The delivery mechanism has changed. The question hasn't.

That's the central idea in this brief. The fear around AI chatbots giving wrong answers is real — but most of it is pointed at the wrong thing. The risk isn't the AI. It's the data foundation the AI is sitting on top of. And that's a problem every CDO already knows how to think about.

The Real Question

The question you're actually asking

When a CDO approaches conversational analytics, there's usually one underlying concern they're trying to work through:

"If my source system shows $1M in Q1 Southwest revenue, and a user asks the chatbot that same question and gets $1M back — great. But is there something about the AI itself that makes that number less trustworthy than if it showed up on a dashboard?"

It's a fair question. The answer, based on how these systems actually work, is: not really. And the evidence from how NL2SQL and conversational analytics systems are built bears that out.

But there are real differences in what happens when things go wrong. That's where it gets nuanced — and where your team needs to be prepared. I'll get to that.

First, let me explain why the core hypothesis holds.

Same Problem

It's the same problem. Most of it, anyway.

Here's a simple test. Look at this list of root causes for wrong numbers in analytics. For each one, ask yourself: does this break a dashboard? Does it break a chatbot?

Root Cause	Breaks a Dashboard?	Breaks a Chatbot?
Data loaded late — ETL job failed	Yes	Yes
Duplicate records inflating aggregations	Yes	Yes
"Revenue" defined differently in two models	Yes	Yes
AVG() skewed by nulls	Yes	Yes
Orders not linked to a region (orphaned FK)	Yes	Yes
Fiscal calendar applied inconsistently	Yes	Yes
Row-level security misconfigured	Yes	Yes
Dynamic query picks the wrong column for "revenue"	No (static query)	Yes (without a semantic layer)
Same question phrased differently hits different logic	No	Yes (without synonym governance)

Seven of nine root causes are identical. The two that are genuinely new to conversational analytics are both addressed by the semantic layer — the part of the stack that maps business terms to verified database objects. If "revenue" is formally defined as SUM(revenue_amount) from FACT_ORDERS where returns are excluded, the chatbot can't improvise. It works from a controlled vocabulary, the same way a well-built dashboard metric does.

What researchers are actually saying about "hallucination"

The term "hallucination" has been doing a lot of heavy lifting in this conversation — and most of it is imprecise. There's an important distinction that tends to get lost.

When a general-purpose language model invents a fact from its training data — fabricating a statistic, attributing a quote, making up a case study — that's a model hallucination. That's a real model-level risk.

When a conversational analytics system queries a governed data warehouse and returns a wrong number, that's almost never the model. It's a data governance failure. The model reported faithfully what it found. What it found was wrong.

"Most so-called LLM hallucinations inside companies stem from outdated, inconsistent, or poorly retrieved enterprise data — not from defective models."

B-EYE, "LLMs Aren't Hallucinating — Your Enterprise Data Is Gaslighting Them" (2025)

insightsoftware says it differently, but lands in the same place. They trace analytics AI errors back to: no live connection to actual systems, missing business logic, governance gaps, and enterprise complexity without semantic documentation. Not one of those is a model problem. Every one of them is something your data team has been managing — or not managing — for years.

What is NL2SQL — and where did it come from?

NL2SQL (Natural Language to SQL) is the core technology behind conversational analytics. A user types a question in plain English — "What was my Q1 Southwest revenue?" — and the system translates it into a SQL query that runs against the data warehouse. The answer comes back as a number, a table, or a chart.

The concept goes back to research in the 1970s, when early systems like LUNAR tried to let scientists query databases in plain language. For decades it stayed largely academic — the models weren't capable enough and schemas weren't documented well enough for it to be reliable at scale.

That changed around 2022 and 2023. Large language models became capable enough to generate syntactically correct SQL from natural language with reasonable consistency, and the major data platforms — Snowflake, Databricks, Google, Microsoft — began building NL2SQL into their analytics products. By 2025 it was a standard feature in most enterprise BI platforms.

The important thing to understand: NL2SQL doesn't invent data. It queries your existing warehouse. When the answer is wrong, the question is almost always about what it queried — not how it translated the question.

Researchers building NL2SQL systems have measured this directly. Systems querying a well-governed semantic layer achieve accuracy above 85%. The same systems against raw, undocumented schemas fall to 40–60%. The model hasn't changed. The governance has. That 40-point gap is entirely explained by data quality.

The key insight

The fastest path to reducing wrong answers from your conversational analytics system is not a better model. It is better data — specifically: tested, documented, certified data with formal metric definitions. B-EYE puts it plainly: "Ensure data quality for LLMs with the same rigor as data in analytics dashboards. If it's outdated or wrong, it leads to bad outputs."

There's a fair argument that a chatbot is actually more transparent

Here's something that doesn't get said enough. A well-built conversational analytics system can show the user the exact SQL query it ran, the tables it touched, and a timestamp for when the data was last refreshed. Most dashboards don't do that. The underlying metric formula is often buried three levels deep in a calculated field that most users never see.

Done right, a conversational interface is more auditable, not less. That's not the norm today. But it's where the better platforms are heading, and it should inform how you build yours.

What Is Different

Where it is actually different

I want to be honest about this part. The hypothesis — that the QA discipline is the same — holds up. But there are three things that genuinely change when you move from dashboards to conversational analytics, and you need to understand them before your team deploys anything to production.

1. The wrong number sounds authoritative

A broken dashboard looks broken. A value of zero where you'd expect $1M is visually jarring. The chart is empty. The filter looks odd. The user senses something is off before they act on it.

A conversational system that returns a wrong number says: "Revenue in Q1 Southwest was $850,000." In full English. Formatted as a confident assertion. With no visual cue that anything is wrong.

Research from Master of Code found that users are about 30% more likely to trust incorrect information when it's presented as AI-generated output — compared to the same information in a traditional format. And trust drops approximately 20% after a user discovers the AI got it wrong. The error itself may be identical to a bad dashboard cell. The organizational damage when it surfaces is not.

2. There is no human buffer

Most BI environments have an implicit quality gate. A dashboard is built by an analyst who has, ideally, validated the numbers against source. The BI team reviews before publishing. The VP of Finance questions the Q3 figure before the board meeting. These are imperfect gates, but they exist.

Conversational analytics is designed to remove that buffer. That's the value proposition. Any employee can ask any question and get an answer without routing through an analyst. That directness is what makes it powerful. It's also why the data foundation needs to be better than what underpins the average dashboard — not the same.

3. The blast radius is wider

A broken dashboard is typically discovered by the people who use that dashboard. A known population, a predictable failure mode, and usually someone in the chain who can contextualize what went wrong.

A broken conversational analytics answer can surface to any user, on any question, with no advance warning about which queries are fragile. The first time you find out there's a problem might be when an executive asks a question in a board meeting and the chatbot returns a number that doesn't match the slide deck.

The uncomfortable implication

Conversational analytics doesn't create new data quality problems. It reveals the ones you already have. The chatbot reports faithfully from a data layer that was never as trustworthy as your dashboards made it look. The scrutiny is appropriate — just redirect it toward your data foundation, not toward the AI.

Go-Live Checklist

A checklist before you go live

This covers both the data layer and the conversational-specific additions. Tags indicate whether this is primarily a team responsibility or a CDO decision point.

All fact tables have uniqueness + not-null tests on surrogate keys Team
All foreign key relationships are tested and passing Team
Data freshness SLAs are defined and monitored for every exposed table Both
All quantitative measures have range validation (no negative revenue, etc.) Team
Every column exposed to the conversational system has a plain-English description Team
Every metric has a formal definition: calculation, grain, dimensions, owner Both
Non-additive metrics (ratios, percentages) are protected from incorrect SUM() Team
Ambiguous synonyms are explicitly mapped (revenue vs. ARR vs. MRR) Team
A certification process exists — only approved metrics are exposed CDO Decision
Business owners have reviewed and signed off on metric definitions CDO Decision
Row-level security is configured correctly and tested by role Both
Query validation is in place — system checks that SQL ran and returned results Team
Provenance is surfaced in every answer (SQL used, tables queried, data freshness) Both
A user feedback mechanism exists to flag incorrect answers CDO Decision
A response protocol is defined for when data quality failures are confirmed CDO Decision

A note on the CDO decisions

The items tagged as CDO decisions are not technical — they're governance decisions. Who owns a metric? What does certification mean at your organization? Who decides when a metric is ready for broad access? These require your judgment, not just your team's execution. They're also the decisions that most organizations skip, which is why their conversational analytics systems struggle.

Org Reality

The organizational reality

There's one more thing worth naming, because it shapes everything else.

The reason conversational analytics faces so much more scrutiny than dashboards isn't really about the technology. It's about accountability and attribution.

When a dashboard shows a wrong number, accountability is distributed. The data engineer built the pipeline. The analyst built the report. The BI team published it. By the time someone finds the error, it's a "data issue" — an expected, occasional occurrence that everyone has learned to tolerate.

When a chatbot gives a wrong answer, the attribution is immediate: "the AI got it wrong." There's no distribution of blame. The technology becomes the story, not the data quality failure that caused it. Organizations that tolerated broken dashboards for years are suddenly demanding perfection from the chatbot. The standard hasn't changed. The attribution has.

I think this is actually useful information for a CDO. It tells you where the political risk lives. It also tells you that the investment in data quality you make for conversational analytics will be visible and credited in a way that the same investment in dashboards never was.

Use that. The justification for finally getting your semantic layer right, finally certifying your metrics, finally documenting what "revenue" actually means across your organization — that justification is much easier to make when a chatbot is on the line than when another dashboard is.

The framing that works internally

"We need to get the data right before we expose it through a conversational interface" is an easier ask to leadership than "we need to get the data right before we build the next dashboard." It's the same work. It gets funded differently. If you're building conversational analytics, use the moment to fix the foundation your whole analytics stack sits on.

Bottom Line

Where I land on this

The root cause of a wrong number is always in the data. It doesn't matter whether that number shows up on a bar chart or gets spoken by a chatbot. Stale data, undefined metrics, duplicate records, inconsistent fiscal calendars — these break both delivery mechanisms equally. The QA discipline required is the same.

What's different is what happens when it goes wrong. A chatbot speaks with authority. There's no analyst in the loop. The blast radius is wider. The trust damage is faster and harder to recover from. None of that changes the root cause — it changes the cost of the failure.

So the answer to the CDO's core question is: no, you don't need a fundamentally different quality framework for conversational analytics. You need the quality framework you should have had all along — applied with more urgency, a formal semantic layer, and a user feedback loop that closes the gap between when something goes wrong and when your team finds out.

The chatbot didn't get it wrong. Your data did. The good news is that means you already know how to fix it.

Sources

Industry LLMs Aren't Hallucinating — Your Enterprise Data Is Gaslighting Them — B-EYE. Central argument: enterprise analytics hallucinations are data governance failures, not model failures.
Industry What Is Causing AI Hallucinations With Analytics? — insightsoftware. Traces analytics AI errors to data governance root causes; draws parallel to "Excel hell" data quality problems.
Industry Data Quality Is Not a Checkbox — Anblicks. Argues that every data quality defect becomes a trust defect when AI is the delivery mechanism.
Research NL2SQL is a solved problem... Not! — CIDR 2024. Documents the 40-point accuracy gap between NL2SQL on governed semantic layers vs. raw schemas.
Research Boundary-Aware NL2SQL: Integrating Reliability through Hybrid — arxiv 2025. Reliability architecture for production NL2SQL systems, including abstention and clarification patterns.
Platform Enterprise NL2SQL with Semantic Enrichment — Oracle Cloud Infrastructure. Semantic layer as a shared quality propagation mechanism across all NL2SQL pipeline stages.
Platform Conversational Analytics vs Traditional BI Dashboards — Lumi AI. Notes that inline SQL provenance in conversational systems can exceed the auditability of traditional dashboards.
Framework dbt Data Tests Documentation — dbt Labs. Standard reference for model-level data quality testing applicable to both dashboard and conversational analytics data layers.
Framework MetricFlow Metrics Overview — dbt Labs. Semantic layer metric definitions that function as quality contracts for any downstream delivery mechanism.