I'm new to conversational analytics. Twenty-five years in data taught me it doesn't matter.
I haven't spent my career in conversational analytics. I've spent it in data — building pipelines, standing up warehouses, sitting in rooms where someone points at a number on a dashboard and asks why it doesn't match what the source system shows.
That question is 25 years old for me. And when I started looking seriously at conversational analytics, what struck me wasn't how different the problem was. It was how familiar it felt.
The fundamental question in data quality has never changed. Does the number we're displaying match the source system? If it does, the data is right. If it doesn't, the data is wrong. That's true whether the number lives on a report, a dashboard, or gets spoken back to a user by a chatbot.
The delivery mechanism has changed. The question hasn't.
That's the central idea in this brief. The fear around AI chatbots giving wrong answers is real — but most of it is pointed at the wrong thing. The risk isn't the AI. It's the data foundation the AI is sitting on top of. And that's a problem every CDO already knows how to think about.
The question you're actually asking
When a CDO approaches conversational analytics, there's usually one underlying concern they're trying to work through:
"If my source system shows $1M in Q1 Southwest revenue, and a user asks the chatbot that same question and gets $1M back — great. But is there something about the AI itself that makes that number less trustworthy than if it showed up on a dashboard?"
It's a fair question. The answer, based on how these systems actually work, is: not really. And the evidence from how NL2SQL and conversational analytics systems are built bears that out.
But there are real differences in what happens when things go wrong. That's where it gets nuanced — and where your team needs to be prepared. I'll get to that.
First, let me explain why the core hypothesis holds.
It's the same problem. Most of it, anyway.
Here's a simple test. Look at this list of root causes for wrong numbers in analytics. For each one, ask yourself: does this break a dashboard? Does it break a chatbot?
| Root Cause | Breaks a Dashboard? | Breaks a Chatbot? |
|---|---|---|
| Data loaded late — ETL job failed | Yes | Yes |
| Duplicate records inflating aggregations | Yes | Yes |
| "Revenue" defined differently in two models | Yes | Yes |
| AVG() skewed by nulls | Yes | Yes |
| Orders not linked to a region (orphaned FK) | Yes | Yes |
| Fiscal calendar applied inconsistently | Yes | Yes |
| Row-level security misconfigured | Yes | Yes |
| Dynamic query picks the wrong column for "revenue" | No (static query) | Yes (without a semantic layer) |
| Same question phrased differently hits different logic | No | Yes (without synonym governance) |
Seven of nine root causes are identical. The two that are genuinely new to conversational analytics are both addressed by the semantic layer — the part of the stack that maps business terms to verified database objects. If "revenue" is formally defined as SUM(revenue_amount) from FACT_ORDERS where returns are excluded, the chatbot can't improvise. It works from a controlled vocabulary, the same way a well-built dashboard metric does.
What researchers are actually saying about "hallucination"
The term "hallucination" has been doing a lot of heavy lifting in this conversation — and most of it is imprecise. There's an important distinction that tends to get lost.
When a general-purpose language model invents a fact from its training data — fabricating a statistic, attributing a quote, making up a case study — that's a model hallucination. That's a real model-level risk.
When a conversational analytics system queries a governed data warehouse and returns a wrong number, that's almost never the model. It's a data governance failure. The model reported faithfully what it found. What it found was wrong.
"Most so-called LLM hallucinations inside companies stem from outdated, inconsistent, or poorly retrieved enterprise data — not from defective models."
insightsoftware says it differently, but lands in the same place. They trace analytics AI errors back to: no live connection to actual systems, missing business logic, governance gaps, and enterprise complexity without semantic documentation. Not one of those is a model problem. Every one of them is something your data team has been managing — or not managing — for years.
NL2SQL (Natural Language to SQL) is the core technology behind conversational analytics. A user types a question in plain English — "What was my Q1 Southwest revenue?" — and the system translates it into a SQL query that runs against the data warehouse. The answer comes back as a number, a table, or a chart.
The concept goes back to research in the 1970s, when early systems like LUNAR tried to let scientists query databases in plain language. For decades it stayed largely academic — the models weren't capable enough and schemas weren't documented well enough for it to be reliable at scale.
That changed around 2022 and 2023. Large language models became capable enough to generate syntactically correct SQL from natural language with reasonable consistency, and the major data platforms — Snowflake, Databricks, Google, Microsoft — began building NL2SQL into their analytics products. By 2025 it was a standard feature in most enterprise BI platforms.
The important thing to understand: NL2SQL doesn't invent data. It queries your existing warehouse. When the answer is wrong, the question is almost always about what it queried — not how it translated the question.
Researchers building NL2SQL systems have measured this directly. Systems querying a well-governed semantic layer achieve accuracy above 85%. The same systems against raw, undocumented schemas fall to 40–60%. The model hasn't changed. The governance has. That 40-point gap is entirely explained by data quality.
The fastest path to reducing wrong answers from your conversational analytics system is not a better model. It is better data — specifically: tested, documented, certified data with formal metric definitions. B-EYE puts it plainly: "Ensure data quality for LLMs with the same rigor as data in analytics dashboards. If it's outdated or wrong, it leads to bad outputs."
There's a fair argument that a chatbot is actually more transparent
Here's something that doesn't get said enough. A well-built conversational analytics system can show the user the exact SQL query it ran, the tables it touched, and a timestamp for when the data was last refreshed. Most dashboards don't do that. The underlying metric formula is often buried three levels deep in a calculated field that most users never see.
Done right, a conversational interface is more auditable, not less. That's not the norm today. But it's where the better platforms are heading, and it should inform how you build yours.
Where it is actually different
I want to be honest about this part. The hypothesis — that the QA discipline is the same — holds up. But there are three things that genuinely change when you move from dashboards to conversational analytics, and you need to understand them before your team deploys anything to production.
1. The wrong number sounds authoritative
A broken dashboard looks broken. A value of zero where you'd expect $1M is visually jarring. The chart is empty. The filter looks odd. The user senses something is off before they act on it.
A conversational system that returns a wrong number says: "Revenue in Q1 Southwest was $850,000." In full English. Formatted as a confident assertion. With no visual cue that anything is wrong.
Research from Master of Code found that users are about 30% more likely to trust incorrect information when it's presented as AI-generated output — compared to the same information in a traditional format. And trust drops approximately 20% after a user discovers the AI got it wrong. The error itself may be identical to a bad dashboard cell. The organizational damage when it surfaces is not.
2. There is no human buffer
Most BI environments have an implicit quality gate. A dashboard is built by an analyst who has, ideally, validated the numbers against source. The BI team reviews before publishing. The VP of Finance questions the Q3 figure before the board meeting. These are imperfect gates, but they exist.
Conversational analytics is designed to remove that buffer. That's the value proposition. Any employee can ask any question and get an answer without routing through an analyst. That directness is what makes it powerful. It's also why the data foundation needs to be better than what underpins the average dashboard — not the same.
3. The blast radius is wider
A broken dashboard is typically discovered by the people who use that dashboard. A known population, a predictable failure mode, and usually someone in the chain who can contextualize what went wrong.
A broken conversational analytics answer can surface to any user, on any question, with no advance warning about which queries are fragile. The first time you find out there's a problem might be when an executive asks a question in a board meeting and the chatbot returns a number that doesn't match the slide deck.
Conversational analytics doesn't create new data quality problems. It reveals the ones you already have. The chatbot reports faithfully from a data layer that was never as trustworthy as your dashboards made it look. The scrutiny is appropriate — just redirect it toward your data foundation, not toward the AI.