Nov 19, 2025 - Rob Darling

Bridging the AI Gap in Business Intelligence

How every employee can get trusted answers from data without waiting on dashboards or data teams.

This is the business leader version of our 2025 Conversational Analytics Report. Data or Technical leader? View the technical edition here.

Executive Summary

Organizations are investing heavily in AI, racing to unlock a new frontier of productivity. This year alone, companies are expected to spend $644B on Generative AI, with $37B devoted to software tools that promise faster decisions and higher productivity.

Yet inside many businesses, one challenge still exists. Business users are still waiting for dashboards to be built or updated. They’re still filing tickets and hoping an analyst can get them the data fast. And even when AI is introduced, they’re not always sure the answers provided are correct.

The core problem is simple: traditional BI was built for a slower world. Dashboards depend on fixed definitions and scheduled refresh cycles. When the business shifts, dashboards lag behind. Data teams try to keep up, but the backlog grows. People don’t really want more dashboards; they want immediate, clear, and reliable data or answers to their questions.

The promise of AI is compelling: ask a question in plain language and get a helpful, accurate answer in seconds. This report explores how close we are to that promise today. We tested four leading conversational analytics tools: Snowflake Cortex Analyst, Databricks Genie, ThoughtSpot Spotter, and runQL, using realistic business datasets and questions.

We found that accuracy ranged from 54% to 100% with these solutions, and inconsistency ranged from 0% to 12%. One tool got nearly half the questions wrong. Some gave different answers to the same question when asked again.

We also tested what happens when you add a “guardrail layer” on top of Snowflake Cortex Analyst and Databricks Genie. When runQL’s proprietary system was layered on top of both platforms, they became more accurate, more consistent, and less risky to use.

This report tells that story: where today’s tools fall short, how experts and technology working together are a winning combination, and how a platform like runQL can help with everyday decision support.

The Business Insights Bottleneck

Imagine a typical week for a business leader. A new campaign launches, sales numbers move in an unexpected direction, or a board member asks a pointed question about churn. You know the data exists somewhere in your systems. But getting a clear, up-to-date answer still means chasing dashboards, emailing analysts, or submitting a ticket into a queue.

The data team wants to help. They’re smart, motivated, and close to the numbers. But they’re also buried under requests. Every new question competes with month-end reporting, executive dashboards, and urgent ad hoc asks from other teams. As the business moves faster, the gap between when a question appears and when the answer arrives only widens.

Surveys confirm what leaders are feeling. Data experts report¹ spending 49% of their time on one-off requests (adhoc requests) with common turnaround times of one to four weeks. Many organizations openly acknowledge that they’ve missed opportunities or absorbed unnecessary costs because decision-makers didn’t have timely, trustworthy insights, which results in estimated losses of more than $2 million² annually.

The result is a familiar frustration: important decisions are made with partial or no information, while the “real” answer is locked somewhere in the data.

1. Sigma Computing, Data Language Survey. https://www.sigmacomputing.com/blog/breaking-down-the-data-language-barrier
2. Veritas, Value of Data Study. https://www.veritas.com/news-releases/2019-03-12-data-management-challenges-cost-millions-a-year-reveals-veritas-research

The Promise and the Reality of AI

When people talk about AI in analytics, they usually describe a simple, attractive vision. A business user types, “What were our top three products by margin last quarter?” and seconds later, a clear, accurate answer appears with no dashboard, no ticket, no delay.

If that vision were consistently true today, the bottleneck would disappear. Analysts could focus on deeper questions, and business users would feel truly empowered. Everyone would be looking at the same numbers and moving faster together.

But the reality is more complicated for AI-only and some conversational analytics solutions. External research shows that AI without schema or context, averages 16% accuracy³ on database queries. With schema and context, accuracy can improve to 67%⁴ but still does not reach the level that most leaders would accept for financial, operational, or customer-facing decisions. And most of these benchmarks only measure one-time accuracy, not whether the tool can reliably repeat the same correct answer reliably.

In other words, AI can be brilliant in moments, but business leaders need it to be boringly reliable.

3. data.world, Generative AI - Accuracy in the Enterprise. https://data.world/blog/generative-ai-benchmark-increasing-the-accuracy-of-llms-in-the-enterprise-with-a-knowledge-graph/
4. Spider 2.0, Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows with researchers from Google DeepMind, Google Cloud AI Research, University of Waterloo, Salesforce Research, and the University of Hong Kong. https://spider2-sql.github.io/

We're Past the Shiny Demo Phase

The broader AI market is going through a similar journey. According to Gartner’s 2025 Hype Cycle, GenAI is entering the “Trough of Disillusionment,” the phase where big promises meet operational reality.

This isn’t a sign that AI has failed. It’s a sign that organizations are asking better questions. Instead of asking, “Can we do something with AI?” they’re asking, “Where can AI actually work for us?” and “How do we make sure it’s safe, accurate, governed, and worth the investment?”

Research shows that companies seeing real ROI from AI tend to have a few things in common⁵. They use systems that learn from feedback, provide context, work together with humans⁶, and automatically improve over time. They buy vendor solutions over trying large internal builds. And they focus on specific, operational use cases, where value can be measured and improved rather than vague “AI everywhere” visions. 74% of executives report⁷ at least one GenAI use case with measurable ROI, productivity gains, and deployment across core functions such as customer service, marketing, and IT.

The lesson for analytics is clear: AI works best when a system is layered on top of it, giving it structure, guardrails, workflows, and clear jobs to do, not when it’s deployed as a generic chatbot on top of a database.

5. MIT NANDA (2025, July). The GenAI Divide—State of AI in Business 2025. https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf
6. Capgemini Research Institute. (2025, July). Rise of agentic AI—How trust is the key to human-AI collaboration. Capgemini. https://www.capgemini.com/wp-content/uploads/2025/07/Final-Web-Version-Report-AI-Agents.pdf
7. Google Cloud, with research by National Research Group (2025, September). The ROI of AI 2025—How agents are unlocking the next wave of AI-driven business value. https://services.google.com/fh/files/misc/google_cloud_roi_of_ai_2025.pdf

What We Tested

To understand how conversational analytics tools behave in practice, we designed a benchmark that felt more like a real workday than a lab test.

We started with two datasets. The first was Jaffle Shop, a fictional but widely used dataset representing food and beverage transactions. The second, BitThreads, was a more complex, fictional retail dataset covering marketing spend, sales transactions, and HR information. This is closer to what a modern business might actually see with their data.

We then assembled 50 business questions touching on topics like performance, customer behavior, and trends. Each question was run three times on each platform (see the Appendix for details). For every run, we asked three simple questions: Did the tool get the answer right? Did it give the same answer again? And did it generate usable SQL without errors?

We evaluated Snowflake Cortex Analyst, Databricks Genie, ThoughtSpot Spotter, and runQL. To keep the comparison fair, we ran runQL in two modes: a “Basic Test Mode” that turned off several proprietary features for the "apples-to-apples" comparison with the other Conversational Analytics platforms and runQL in “Smart Test Mode” which included more of the guardrails and features used in production. We did not enable runQL’s full production setup features, as those additional capabilities would favour runQL and skew the comparison.

What We Learned About Accuracy and Consistency

The headline numbers tell part of the story. In our tests, runQL in Basic Test Mode achieved an average accuracy of 92%. Snowflake Cortex Analyst followed at 88.67%, Databricks Genie at 84.67%, and ThoughtSpot Spotter at 54.67%.

On paper, anything above 85% may sound acceptable. But in a BI setting, where each answer might drive hiring plans, discount strategies, or budget allocations, even an 8–11% error rate means a meaningful number of decisions are being made on flawed information.

Beyond pure accuracy, we saw other issues. Some tools produced queries that technically ran but captured the wrong logic, grouping by the wrong fields, missing filters, or summing the wrong measures. Databricks Genie also failed to generate usable SQL in 7% of the cases.

Most benchmarks stop at one-time accuracy, but we wanted to know:

If you ask the same question three times, do you get the same answer?

Imagine a simple question: “What were our total sales for Q3 2025?” On Monday, the system returns $10.2M to you. Later on Monday, a colleague asks the same questions and gets $9.8M. On Wednesday, it reports $10.5M. All for the same time period and data. Which number do you use for your forecast, your board deck, or your bonus calculations?

When we looked at the consistency of the platforms, these risks became clearer. Snowflake Cortex Analyst had inconsistent answers on 6% of questions, Databricks Genie had 10%, ThoughtSpot Spotter had 12%, runQL (Basic Test Mode) had 0% inconsistency*, and runQL (Smart Test Mode) by design had 0% inconsistency. Even when tools were mostly accurate on average, their lack of consistency makes it hard to know which version of the answer to trust.

runQL in Smart Test Mode behaves differently. By design, it reuses validated queries, metrics, and applied guardrails to avoid random variation. In this mode, answers were both accurate and stable across runs.

*Although runQL in Basic Test Mode showed perfect consistency in this benchmark, we have seen inconsistent answers for this mode in other tests. Basic Test Mode does not include the guardrails needed to ensure consistency in production environments.

For business users, inconsistency is particularly dangerous. It raises the question, “Which version of the truth are we using?” and quietly erodes confidence in both the AI and the data team.

What This Means for Business Leaders

For business leaders, the conclusion is straightforward. Today’s AI-powered analytics tools can be impressive, and in many cases they are already helpful. But they are not yet safe to deploy everywhere without additional structure.

Tools that sometimes get answers wrong, or that are inconsistent in their answer to the same question from one time to the next, introduce risk. That risk doesn’t always appear as a dramatic failure, sometimes it appears as a small misalignment. A slightly off revenue figure, a misclassified customer segment, a forecast that doesn’t quite match reality. Over time, these small errors erode trust in both the AI and the data team.

The solution is not to abandon AI. It is to surround it with guardrails, feedback loops, and expert oversight so it can be fast and trustworthy at the same time.

Why Guardrails and Experts Matter

Many organizations start with a simple idea: let the AI generate answers, and if something seems off, a human will catch it. In theory, this “Human-in-the-Loop” approach combines the best of both worlds.

In practice, it usually doesn’t work that way. Business users are busy and not trained to read SQL or spot subtle data nuances. Analysts are already overloaded and may only see the AI’s output after a decision has been made. The result is a system where AI can quietly influence decisions without anyone truly owning the quality of its answers.

We’ve found that a better pattern is AI + Guardrails + Expert-in-the-Loop. In this model, AI is still responsible for speed when there isn’t already a validated/cached answer or metric. When AI does need to generate something new, guardrails; the rules, thresholds, and deterministic standards sit between the AI and the business user. Only when a result fails to meet those standards, is an expert looped in.

In our benchmark, tools without this structure had error rates ranging from 8% to 45%, and inconsistency rates up to 12%, which makes it hard to build trust. When runQL’s Smart Test Mode was used with its deterministic guardrails, ranking, and answer reuse, the results changed. The system not only answered more questions correctly, it answered them consistently and predictably.

In practice, the most trusted and accurate insights come from combining AI for speed, system guardrails for consistency, and Expert-in-the-Loop processes for judgment and nuance, which are core elements of runQL’s production setup. This combined approach delivers not only faster insights, but verifiable, repeatable, and context-aware answers at scale in a system that gets smarter over time.

Optional: Enhancing Snowflake & Databricks

Some organizations may already have Snowflake Cortex Analyst (Snowflake Intelligence) or Databricks Genie. The question is not whether to replace them, but how to make them safer and more effective in an AI-driven world.

runQL is designed with that reality in mind. It can act as a standalone platform, or it can sit on top of existing tools as a safety and trust layer. In follow-up tests, we passed queries generated by Snowflake Cortex Analyst and Databricks Genie through runQL before they reached the user.

When we did this, accuracy and consistency improved to 100% on the Jaffa Shop dataset. Accuracy on the BitThreads dataset improved by 4% to 8% and consistency was 100%. Low-confidence or incorrect queries were intercepted and rewritten before reaching users, and when a perfect fix wasn’t possible, runQL prevented wrong answers from being presented as “truth” to business users. In production mode these wrong answers would have been routed to an analyst.

This approach means organizations can keep their existing stack, but add the guardrails needed to make AI-powered answers something they can stand behind.

How runQL Bridges Speed and Trust

runQL was built around a simple idea: business users should get fast answers they trust, and data teams should stay in control of how those answers are produced.

To do that, runQL starts with what is already known. When a question matches existing, validated queries or metrics, it reuses that logic rather than generating something new from scratch. This makes common questions almost instant to answer and ensures that people across the organization see the same numbers.

When questions are ambiguous, runQL doesn’t guess in silence. It asks for clarification, surfaces assumptions, and explains what it is about to do. If confidence is low, the system knows when to pause and loop in an analyst instead of bluffing its way through.

Every AI-generated query is logged and explainable, so leaders can see how key answers were produced. And because runQL never sends your raw data to external LLMs, it supports security and compliance needs by design.

The result is a system that feels fast and conversational for business users, yet structured and governable for data and security teams.

What to Look for in a Conversational Analytics Platform

When you evaluate conversational analytics tools, the feature list will be long. Underneath the buzzwords, a few questions matter most.

Can the platform answer questions accurately and repeatably, not just in demos but over time? Does it give you visibility into how answers were produced, so you can explain them to a CFO, a risk officer, or a regulator? Does it log and govern every query, or does it act like a black box?

Can your analysts easily step in when AI is unsure, without leaving their tools or losing context? Will your data stay where it belongs, or are raw tables being pushed out to third-party systems without the right controls? And finally, can the platform work with your existing databases and fit your existing data architecture if you have existing tools like dbt labs, Cube, Snowflake Cortex Analyst, or Databricks Genie?

The right platform won’t just generate answers. It will help you trust those answers enough to act on them.

Closing Thoughts for Business Leaders

You don’t need to become an AI expert to lead in this new era. What you need is confidence that when you ask a question about your business, the answer you get is both fast and trustworthy.

runQL is built to support that kind of leadership. It gives every employee an AI data assistant that respects the guardrails your organization needs. It gives every analyst an AI-powered teammate that amplifies their impact instead of adding noise. And it gives leaders a path to faster, more reliable decisions.

AI can move fast. With the right platform, your business can, too.

Rob Darling, Founder & CEO

Chat With Us

Appendix A: Research Test Scores

This table summarizes the test results for each platform and dataset. The columns appear in the order the tests were run: Jaffle 1 was executed across all platforms first, followed by BitThreads 1 across all platforms, and so on for the remaining cycles.

Platform	Result Category	Jaffle 1	BitThreads 1	Jaffle 2	BitThreads 2	Jaffle 3	BitThreads 3	Average
runQL (Basic Test Mode)	Correct	24	22	24	22	24	22	92.0%
	Wrong	1	3	1	3	1	3	8.00%
	SQL Error	0	0	0	0	0	0	0%
Snowflake Cortex Analyst	Correct	22	23	23	22	22	21	88.67%
	Wrong	3	2	2	3	3	4	11.33%
	SQL Error	0	0	0	0	0	0	0%
Databricks Genie	Correct	22	21	23	19	23	19	84.67%
	Wrong	1	2	2	3	1	3	8.00%
	SQL Error	2	2	0	3	1	3	7.33%
ThoughtSpot Spotter	Correct	13	-	14	-	14	-	54.6%
	Wrong	12	-	11	-	11	-	45.33%
	SQL Error	0	-	0	-	0	-	0%
runQL (Smart Test Mode)	Correct	25	25	25	25	25	25	100%
	Wrong	0	0	0	0	0	0	0%
	SQL Error	0	0	0	0	0	0	0%

Appendix B: Research Datasets

Datasets

Dataset One was based on the well-known dbt Labs Jaffle Shop project and included data automatically generated using dbt Labs’ official scripts.

Dataset Two was a custom-built fictional retail dataset for a company called BitThreads. It contains data and structures that would not have been available in any public LLM training corpus.

Each platform generated a SQL query based on the user’s question, executed it against the database, and returned both the query and results for evaluation.

Note: None of the tested platforms had seen this data or these questions prior to running the tests. runQL, Snowflake, and ThoughtSpot did not store raw sample data as part of their semantic information, while Databricks stored raw sample data alongside the semantic information. For clarity, runQL creates and uses synthetic data for semantic enrichment but never uses raw sample data.

How the Questions Were Created

Generation: We provided ChatGPT 4o with the schema, high-level business context (e.g., “This is for a retail chain that sells clothing”), and available metadata from each dataset, then instructed it to generate business-user questions aligned to themes such as sales performance and customer behavior.
Volume: We requested 50 questions per dataset.
Curation: We reviewed the outputs and selected 25 questions per dataset that had clear intent and could be objectively evaluated for correctness.

This approach ensured a mix of simple and complex questions across domains, allowing us to fairly assess how well each platform handled structured business queries.

Sample Question: Jaffle Shop

Which customer had the highest number of orders during the calendar year 2024? Please show the customer name and their total number of orders.

Vendor	Answered Correctly?	Answer Provided	What they did
Snowflake Cortex Analyst	✕	Thomas Jones, 115	Grouped by customer_name and there are multiple customers with the same name.
Databricks Genie	✔︎	Tracy Nunez, 70	Grouped correctly by customer_id
ThoughtSpot Spotter	✕	Thomas Jones, 115	Grouped by customer_name and there are multiple customers with the same name.
runQL (Basic Test Mode)	✔︎	Tracy Nunez, 70	Grouped correctly by customer_id
runQL (Smart Test Mode)	✔︎	Tracy Nunez, 70	Grouped correctly by customer_id

Sample Question: Jaffle Shop

Which customer had the highest order total on their 2nd order, excluding the price of any drink items in the order? Please show the customer name and the total amount without the drinks.

Vendor	1st Try	2nd Try	3rd Try
Snowflake Cortex Analyst	✔︎	✕	✔︎
Databricks Genie	✕	✔︎	✔︎
ThoughtSpot Spotter	✕	✕	✕
runQL (Basic Test Mode)	✔︎	✔︎	✔︎
runQL (Smart Test Mode)	✔︎	✔︎	✔︎

The above question from the benchmark shows how some platforms struggle with consistency even on simple datasets.