← AI Intelligence Hub

I'm new to conversational analytics. Twenty-five years in data taught me it doesn't matter.

I haven't spent my career in conversational analytics. I've spent it in data — building pipelines, standing up warehouses, sitting in rooms where someone points at a number on a dashboard and asks why it doesn't match what the source system shows.

That question is 25 years old for me. And when I started looking seriously at conversational analytics, what struck me wasn't how different the problem was. It was how familiar it felt.

The fundamental question in data quality has never changed. Does the number we're displaying match the source system? If it does, the data is right. If it doesn't, the data is wrong. That's true whether the number lives on a report, a dashboard, or gets spoken back to a user by a chatbot.

The delivery mechanism has changed. The question hasn't.

That's the central idea in this brief. The fear around AI chatbots giving wrong answers is real — but most of it is pointed at the wrong thing. The risk isn't the AI. It's the data foundation the AI is sitting on top of. And that's a problem every CDO already knows how to think about.

The question you're actually asking

When a CDO approaches conversational analytics, there's usually one underlying concern they're trying to work through:

"If my source system shows $1M in Q1 Southwest revenue, and a user asks the chatbot that same question and gets $1M back — great. But is there something about the AI itself that makes that number less trustworthy than if it showed up on a dashboard?"

It's a fair question. The answer, based on how these systems actually work, is: not really. And the evidence from how NL2SQL and conversational analytics systems are built bears that out.

But there are real differences in what happens when things go wrong. That's where it gets nuanced — and where your team needs to be prepared. I'll get to that.

First, let me explain why the core hypothesis holds.

It's the same problem. Most of it, anyway.

Here's a simple test. Look at this list of root causes for wrong numbers in analytics. For each one, ask yourself: does this break a dashboard? Does it break a chatbot?

Root Cause Breaks a Dashboard? Breaks a Chatbot?
Data loaded late — ETL job failed Yes Yes
Duplicate records inflating aggregations Yes Yes
"Revenue" defined differently in two models Yes Yes
AVG() skewed by nulls Yes Yes
Orders not linked to a region (orphaned FK) Yes Yes
Fiscal calendar applied inconsistently Yes Yes
Row-level security misconfigured Yes Yes
Dynamic query picks the wrong column for "revenue" No (static query) Yes (without a semantic layer)
Same question phrased differently hits different logic No Yes (without synonym governance)

Seven of nine root causes are identical. The two that are genuinely new to conversational analytics are both addressed by the semantic layer — the part of the stack that maps business terms to verified database objects. If "revenue" is formally defined as SUM(revenue_amount) from FACT_ORDERS where returns are excluded, the chatbot can't improvise. It works from a controlled vocabulary, the same way a well-built dashboard metric does.

What researchers are actually saying about "hallucination"

The term "hallucination" has been doing a lot of heavy lifting in this conversation — and most of it is imprecise. There's an important distinction that tends to get lost.

When a general-purpose language model invents a fact from its training data — fabricating a statistic, attributing a quote, making up a case study — that's a model hallucination. That's a real model-level risk.

When a conversational analytics system queries a governed data warehouse and returns a wrong number, that's almost never the model. It's a data governance failure. The model reported faithfully what it found. What it found was wrong.

"Most so-called LLM hallucinations inside companies stem from outdated, inconsistent, or poorly retrieved enterprise data — not from defective models."

B-EYE, "LLMs Aren't Hallucinating — Your Enterprise Data Is Gaslighting Them" (2025)

insightsoftware says it differently, but lands in the same place. They trace analytics AI errors back to: no live connection to actual systems, missing business logic, governance gaps, and enterprise complexity without semantic documentation. Not one of those is a model problem. Every one of them is something your data team has been managing — or not managing — for years.

What is NL2SQL — and where did it come from?

NL2SQL (Natural Language to SQL) is the core technology behind conversational analytics. A user types a question in plain English — "What was my Q1 Southwest revenue?" — and the system translates it into a SQL query that runs against the data warehouse. The answer comes back as a number, a table, or a chart.

The concept goes back to research in the 1970s, when early systems like LUNAR tried to let scientists query databases in plain language. For decades it stayed largely academic — the models weren't capable enough and schemas weren't documented well enough for it to be reliable at scale.

That changed around 2022 and 2023. Large language models became capable enough to generate syntactically correct SQL from natural language with reasonable consistency, and the major data platforms — Snowflake, Databricks, Google, Microsoft — began building NL2SQL into their analytics products. By 2025 it was a standard feature in most enterprise BI platforms.

The important thing to understand: NL2SQL doesn't invent data. It queries your existing warehouse. When the answer is wrong, the question is almost always about what it queried — not how it translated the question.

Researchers building NL2SQL systems have measured this directly. Systems querying a well-governed semantic layer achieve accuracy above 85%. The same systems against raw, undocumented schemas fall to 40–60%. The model hasn't changed. The governance has. That 40-point gap is entirely explained by data quality.

The key insight

The fastest path to reducing wrong answers from your conversational analytics system is not a better model. It is better data — specifically: tested, documented, certified data with formal metric definitions. B-EYE puts it plainly: "Ensure data quality for LLMs with the same rigor as data in analytics dashboards. If it's outdated or wrong, it leads to bad outputs."

There's a fair argument that a chatbot is actually more transparent

Here's something that doesn't get said enough. A well-built conversational analytics system can show the user the exact SQL query it ran, the tables it touched, and a timestamp for when the data was last refreshed. Most dashboards don't do that. The underlying metric formula is often buried three levels deep in a calculated field that most users never see.

Done right, a conversational interface is more auditable, not less. That's not the norm today. But it's where the better platforms are heading, and it should inform how you build yours.

Where it is actually different

I want to be honest about this part. The hypothesis — that the QA discipline is the same — holds up. But there are three things that genuinely change when you move from dashboards to conversational analytics, and you need to understand them before your team deploys anything to production.

1. The wrong number sounds authoritative

A broken dashboard looks broken. A value of zero where you'd expect $1M is visually jarring. The chart is empty. The filter looks odd. The user senses something is off before they act on it.

A conversational system that returns a wrong number says: "Revenue in Q1 Southwest was $850,000." In full English. Formatted as a confident assertion. With no visual cue that anything is wrong.

Research from Master of Code found that users are about 30% more likely to trust incorrect information when it's presented as AI-generated output — compared to the same information in a traditional format. And trust drops approximately 20% after a user discovers the AI got it wrong. The error itself may be identical to a bad dashboard cell. The organizational damage when it surfaces is not.

2. There is no human buffer

Most BI environments have an implicit quality gate. A dashboard is built by an analyst who has, ideally, validated the numbers against source. The BI team reviews before publishing. The VP of Finance questions the Q3 figure before the board meeting. These are imperfect gates, but they exist.

Conversational analytics is designed to remove that buffer. That's the value proposition. Any employee can ask any question and get an answer without routing through an analyst. That directness is what makes it powerful. It's also why the data foundation needs to be better than what underpins the average dashboard — not the same.

3. The blast radius is wider

A broken dashboard is typically discovered by the people who use that dashboard. A known population, a predictable failure mode, and usually someone in the chain who can contextualize what went wrong.

A broken conversational analytics answer can surface to any user, on any question, with no advance warning about which queries are fragile. The first time you find out there's a problem might be when an executive asks a question in a board meeting and the chatbot returns a number that doesn't match the slide deck.

The uncomfortable implication

Conversational analytics doesn't create new data quality problems. It reveals the ones you already have. The chatbot reports faithfully from a data layer that was never as trustworthy as your dashboards made it look. The scrutiny is appropriate — just redirect it toward your data foundation, not toward the AI.

What your team needs to do

This is not a new quality framework. It's applying the discipline that should have already existed — with higher urgency, and a few additional steps specific to how conversational systems work.
Before you deploy anything

Run an honest data readiness assessment

The single most important thing your team can do before going live is audit what's actually in your warehouse. Not what you think is there — what is. Most organizations are surprised.

  • For every table you plan to expose to conversational queries: do all fact tables have uniqueness tests on their surrogate keys? Are foreign keys validated? Are there freshness checks with defined SLAs?
  • Do you have a metric definition document — not just field names, but what each metric means in plain English, how it's calculated, what its grain is, and who owns it?
  • Is there a "certified" flag — some mechanism that distinguishes metrics your data team has reviewed and approved from raw tables that shouldn't be publicly queried?
  • Are ambiguous synonyms documented? Does "revenue" mean gross, net, ARR, or something else in your organization? Does the semantic layer enforce one answer?

If the answer to most of these is "not really," that tells you where to invest before the chatbot ever shows up in production.

The three investments that matter most

Build the semantic layer. Document the metrics. Show the provenance.

Most of the conversational analytics quality work reduces to these three investments. Everything else is downstream of them.

  • A governed semantic layer. Every table and column exposed to the conversational system has a description. Every metric is formally defined with a calculation, grain, dimension list, and owner. Platforms like dbt Semantic Layer, Snowflake Cortex Analyst, and Looker LookML all give you a mechanism to enforce this. Without it, NL2SQL accuracy falls by roughly 40 points.
  • Metric certification gates. Not everything in your warehouse should be queryable by a chatbot. Create a process where metrics have to be reviewed and approved by a data owner before they're exposed to conversational tools. This is not bureaucracy — it's the equivalent of a BI team reviewing a dashboard before it goes to the business. Same principle, new interface.
  • Provenance displayed to users. Every answer the system returns should show what query it ran, what tables it touched, and when the data was last updated. This is the trust mechanism that makes conversational analytics defensible. When a user can see the SQL, they can verify it. That's a better audit trail than most dashboards provide.
Ongoing operations

Monitor, respond, and create a feedback loop

Data quality isn't a one-time project. It's a practice. A few things that need to be in place from day one:

  • Anomaly detection on key metrics. Automated alerts when aggregated values deviate significantly from prior periods without a known cause. This catches data loading failures before users do.
  • A user feedback mechanism. Users should be able to flag an answer as incorrect directly from the conversational interface. That flag routes to the data owner, who investigates the source. Without this loop, data issues go underground and surface at the worst possible moments.
  • A protocol for when quality fails. When a data quality issue is confirmed, remove the affected metric from the system's schema visibility immediately. Surface a message to users that data for that metric is under review. Fix, re-test, re-certify, re-expose. This is the same process you'd use for a broken dashboard — the difference is that with a chatbot, the response needs to happen faster because the blast radius is larger.

A checklist before you go live

This covers both the data layer and the conversational-specific additions. Tags indicate whether this is primarily a team responsibility or a CDO decision point.

A note on the CDO decisions

The items tagged as CDO decisions are not technical — they're governance decisions. Who owns a metric? What does certification mean at your organization? Who decides when a metric is ready for broad access? These require your judgment, not just your team's execution. They're also the decisions that most organizations skip, which is why their conversational analytics systems struggle.

The organizational reality

There's one more thing worth naming, because it shapes everything else.

The reason conversational analytics faces so much more scrutiny than dashboards isn't really about the technology. It's about accountability and attribution.

When a dashboard shows a wrong number, accountability is distributed. The data engineer built the pipeline. The analyst built the report. The BI team published it. By the time someone finds the error, it's a "data issue" — an expected, occasional occurrence that everyone has learned to tolerate.

When a chatbot gives a wrong answer, the attribution is immediate: "the AI got it wrong." There's no distribution of blame. The technology becomes the story, not the data quality failure that caused it. Organizations that tolerated broken dashboards for years are suddenly demanding perfection from the chatbot. The standard hasn't changed. The attribution has.

I think this is actually useful information for a CDO. It tells you where the political risk lives. It also tells you that the investment in data quality you make for conversational analytics will be visible and credited in a way that the same investment in dashboards never was.

Use that. The justification for finally getting your semantic layer right, finally certifying your metrics, finally documenting what "revenue" actually means across your organization — that justification is much easier to make when a chatbot is on the line than when another dashboard is.

The framing that works internally

"We need to get the data right before we expose it through a conversational interface" is an easier ask to leadership than "we need to get the data right before we build the next dashboard." It's the same work. It gets funded differently. If you're building conversational analytics, use the moment to fix the foundation your whole analytics stack sits on.

Where I land on this

The root cause of a wrong number is always in the data. It doesn't matter whether that number shows up on a bar chart or gets spoken by a chatbot. Stale data, undefined metrics, duplicate records, inconsistent fiscal calendars — these break both delivery mechanisms equally. The QA discipline required is the same.

What's different is what happens when it goes wrong. A chatbot speaks with authority. There's no analyst in the loop. The blast radius is wider. The trust damage is faster and harder to recover from. None of that changes the root cause — it changes the cost of the failure.

So the answer to the CDO's core question is: no, you don't need a fundamentally different quality framework for conversational analytics. You need the quality framework you should have had all along — applied with more urgency, a formal semantic layer, and a user feedback loop that closes the gap between when something goes wrong and when your team finds out.

The chatbot didn't get it wrong. Your data did. The good news is that means you already know how to fix it.

Sources