Nov 19, 2025 - Rob Darling

Bridging the AI Gap in Business Intelligence

Everything leaders need to know about Conversational Analytics and how to get the most out of AI to ensure faster, trusted decisions.

This is the data and technical leader version of our 2025 Conversational Analytics Report. Business leader? View the business edition here.

Executive Summary

Businesses everywhere are racing to unlock a new frontier of productivity to empower every employee to make faster, accurate decisions. Lost productivity costs U.S. businesses an estimated $1.8 trillion annually, and knowledge workers waste 5.3 hours every week waiting for information that should be immediately accessible. To solve this, organizations are projected to spend $644 billion this year on Generative AI, with $37 billion allocated to software tools alone.

Yet a major challenge persists: BI dashboards and data teams can’t keep pace with the speed of modern business. Traditional dashboards were built for a slower era. They rely on static definitions and rigid reports, leaving employees waiting for insights and data teams buried in requests. In reality, employees don’t want dashboards, they want answers. Meanwhile, new Conversational Analytics solutions often look impressive in demos but fall short in production because they lack the system-level architecture needed for accuracy and consistency.

This report asks two practical questions:

How accurate and consistent are today’s leading AI Conversational Analytics tools?
What’s actually required to make these tools reliable enough for real-world, self-serve analytics?

To evaluate the current state of the space, we first tested Snowflake Cortex Analyst, Databricks Genie, ThoughtSpot Spotter, and runQL (in Basic Test Mode) across 50 business questions, two realistic datasets, and three independent test cycles. In an "apples-to-apples" configuration, using each platform’s recommended setup, accuracy ranged from 54.67% to 92%, and inconsistency rates ranged from 0% to 12%. These findings, consistent with Snowflake's published accuracy results of 90%, demonstrate that base AI Conversational Analytics architectures are not yet reliable enough for self-serve analytics at scale.

Next, we evaluated what happens when additional system components like intent classification, deterministic checks, ranking, and answer caching are enabled in runQL. With these components active (runQL in Smart Test Mode), accuracy and consistency reached 100% across all datasets and all test runs.

Finally, as a third evaluation, we layered runQL in Smart Test Mode on top of Snowflake Cortex Analyst and Databricks Genie. In this augmented configuration, their accuracy improved (in some cases to 100%), inconsistencies were eliminated, and incorrect or low-confidence queries were intercepted, rewritten, or blocked before reaching end users.

The key finding is architectural: achieving reliable Conversational Analytics requires architecture and process layers that provide deterministic guardrails, semantic grounding, reusable answers, ranking, and Expert-in-the-Loop workflows.

This report breaks down the current capabilities and limitations of AI Conversational Analytics, examining accuracy, consistency, user expectations, and the system design required to make AI-driven analytics production-ready.

The Business Insights Bottleneck

Today's business users expect fast, self-serve answers, yet business decisions now move faster than traditional analytics can support. Most organizations face a growing backlog between business questions and answers from data stored in their databases. Traditional BI dashboards were built for a slower era, relying on the same questions and rigid report cycles. As priorities shift, employees wait for new reports while data teams become overextended. As a result, most organizations find themselves constrained by:

BI teams overwhelmed by the backlog of requests from the business
Exploding data volumes (168B terabytes expected this year)
Business needs that move faster than traditional BI tools
Dashboards that can’t keep pace with changing business user needs and questions

A Sigma Computing survey found that 76 percent of data experts spend 49 percent of their time on adhoc data reports/requests, with turnaround times ranging from one to four weeks. Respondents cited the shortage of available data analysts as the primary cause of the backlog. Similarly, a Veritas survey² of 1500 IT leaders found that 97 percent of the global organizations surveyed believe they have missed valuable opportunities as a result of ineffective data management. In fact, 35 percent admit to losing out on new revenue opportunities, while 39 percent say their data challenges have caused an increase in operating costs, and organizations estimated more than $2 million in annual losses are tied to these challenges.

The result? Slower decision-making, missed revenue opportunities, higher operational costs, and difficulty keeping pace with evolving business demands.

1. Sigma Computing, Data Language Survey. https://www.sigmacomputing.com/blog/breaking-down-the-data-language-barrier
2. Veritas, Value of Data Study. https://www.veritas.com/news-releases/2019-03-12-data-management-challenges-cost-millions-a-year-reveals-veritas-research

The Promise of AI for Business Intelligence

AI promises to revolutionize BI by enabling:

Natural language questions
Instant insights for business users
Democratized data access

But in real-world use, there is a gap for both standalone AI and most AI Conversational Analytics systems. These solutions struggle with:

Inconsistent answers on the same repeat questions
Wrong answers that business users often can’t identify as wrong
Maintaining trust due to wrong and inconsistent answers

In BI, wrong answers are more than a tech issue, they are a trust issue.

External studies highlight this gap clearly:

Without schema or context, AI averages 16 percent accuracy on structured database queries, according to a study by data.world³.
With schema and context, accuracy can rise to 67 percent on the Spider 2.0 Enterprise Data Evaluation⁴, a benchmark developed by researchers from Google DeepMind, Google Cloud AI Research, University of Waterloo, Salesforce Research, and the University of Hong Kong.

However, like most accuracy studies, these benchmarks only measure one-time accuracy, not the ability to consistently produce the same correct result. Consistency is critical for business users and data teams alike.

To address these gaps, leading AI Conversational Analytics platforms now combine AI with deterministic processes, semantic layers, context grounding, and expert oversight to deliver faster, more accurate, and more reliable insights. Before diving deeper, it’s important to understand where organizations are in the process of adopting AI.

3. data.world, Generative AI - Accuracy in the Enterprise. https://data.world/blog/generative-ai-benchmark-increasing-the-accuracy-of-llms-in-the-enterprise-with-a-knowledge-graph/
4. Spider 2.0, Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows with researchers from Google DeepMind, Google Cloud AI Research, University of Waterloo, Salesforce Research, and the University of Hong Kong. https://spider2-sql.github.io/

Where Are We in the AI Adoption Curve?

According to Gartner’s 2025 Hype Cycle for Emerging Technologies, GenAI is entering the "Trough of Disillusionment" where early excitement gives way to practical results. For many organizations, this is the most exciting phase, when experimentation transitions into measurable impact. As adoption matures, leaders are shifting their focus towards accuracy, governance, and ROI.

Across industries, leaders are shifting from asking if they should adopt AI to asking how they can use it responsibly and effectively. This shift signals a growing maturity in AI adoption, one that centers on delivering reliable outcomes and real business value.

A recent MIT report⁵ found that organizations who successfully achieve ROI on their AI implementations share four common practices:

Selecting systems that learn automatically by retaining feedback, providing context, and improving over time
Leveraging vendor solutions, which show twice the success rate (66%) over internal builds (33%)
Focusing on operational use cases, which deliver stronger ROI
Targeting specific, high-value, well-defined use cases

Complementing this, research by Google and the National Research Group⁶ has found that:

74% of executives see ROI on at least one GenAI use case
70% of executives have seen measurable productivity gains from AI solutions

Armed with this new data, smart leaders are shifting their thinking to:

Prioritize oversight, governance, and measurable ROI rather than one-off demos or generic LLM integrations.
Choose platforms that retain feedback, maintain up-to-date context, and improve over time, characteristics that consistently outperform LLM wrappers or conversational tools without these system-level processes.
Adopt architectures designed to learn continuously and incorporating Expert-in-the-Loop workflows where needed, an approach shown to drive higher success rates in AI-human collaboration⁷.

5. MIT NANDA (2025, July). The GenAI Divide—State of AI in Business 2025. https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf
6. Google Cloud, with research by National Research Group (2025, September). The ROI of AI 2025—How agents are unlocking the next wave of AI-driven business value. https://services.google.com/fh/files/misc/google_cloud_roi_of_ai_2025.pdf
7. Capgemini Research Institute. (2025, July). Rise of agentic AI—How trust is the key to human-AI collaboration. Capgemini. https://www.capgemini.com/wp-content/uploads/2025/07/Final-Web-Version-Report-AI-Agents.pdf

Platform Configuration

In this benchmark we evaluated Snowflake Cortex Analyst (also used by Snowflake Intelligence), Databricks Genie, ThoughtSpot Spotter, and runQL. For each vendor we followed the official setup documentation to configure the required data connections and semantic information, including data models, and data relationships.
See Appendix B for the datasets and ERD diagrams.

Vendor	Semantic & Relationship Configuration	LLM Model Provider
Snowflake Cortex Analyst	✔︎ Manually Created	Anthropic
Databricks Genie	✔︎ Manually Created	OpenAI
ThoughtSpot Spotter	✔︎ Manually Created	OpenAI
runQL	✔︎ Automatically Generated	Google & OpenAI

There are a variety of Large Language Models (LLMs) used by these solutions, including OpenAI’s GPT-4.1 and 4o, Anthropic’s Claude Sonnet 4 and 3.5, and Google’s Gemini 2.5 Flash. All of these AI Conversational Analytics solutions offer a semantic layer (models, metadata, etc.) and processes, in conjunction with the LLM to get more accurate results than a standalone LLM.

While each platform offers different features and database backends, we designed this as an equal "apples-to-apples" test for the first evaluation. For runQL, this meant deliberately disabling several proprietary capabilities that extend beyond what the other platforms provide. As a result, runQL was not tested in its normal production setup, but instead in a Basic Test Mode configuration:

Basic Test Mode (baseline used for this evaluation)
- Provides automatically generated semantic information, schema metadata, synthetic data, and the deterministic guardrail engine to support SQL answer generation
- Intentionally excludes the intent engine, ranking engine, answer caching, answers context engine, and Expert-in-the-Loop review

We did not create certified queries, saved queries, or define metrics for any of the platforms.

Benchmark Methodology

To evaluate accuracy, consistency, and reliability, we designed a benchmark using 50 unique business questions across two separate datasets.

The first dataset is Jaffle Shop from dbt Labs, which contains fictional food and beverage transactions. The second, more complex dataset, BitThreads, includes marketing spend, sales transactions, and human resources data for a fictional clothing retail chain. (See Appendix B for more details.)

For each dataset, we selected unique business questions aligned to themes such as sales performance and customer behavior. (See "How The Questions Were Created" in Appendix B.)

Batching: The 50 questions were split into two batches of 25, one batch per dataset.
Cycling: For each cycle, we submitted the 25 Jaffle Shop questions, followed by the 25 BitThreads questions. This process was repeated twice more (Cycle 1 → Cycle 2 → Cycle 3).
Total Volume: This resulted in 150 questions per platform, with each of the 50 questions submitted three separate times across multiple cycles.

This approach ensured that each question was re-evaluated freshly in every round, minimizing the risk of cached responses and delivering a more accurate measure of both first-time accuracy and consistency across sessions.

This study was designed to explore the practical limits of AI Conversational Analytics, not to compare platforms competitively, but to highlight current capabilities and common challenges. It is intended to inform business and technology leaders (CEOs, CIOs, COOs, and CTOs) on where these tools deliver value today, where gaps remain, and what solutions are necessary to ensure responsible and effective use.

Results: Evaluation 1 - Accuracy, Inaccuracy, and Error Rates

After running 50 structured business questions across the two datasets and repeating this process three times, clear patterns emerged in the accuracy and consistency of these solutions. While Snowflake Cortex Analyst (also used by Snowflake Intelligence), ThoughtSpot Spotter, Databricks Genie, and runQL in Basic Test Mode all showed promise, they also revealed notable limitations in accuracy and consistency. (For the full test scores, see Appendix A.)

Average Accuracy

runQL in Basic Test Mode achieved the highest average accuracy at 92%, followed by Snowflake Cortex Analyst at 88.67%, Databricks Genie at 84.67%, and ThoughtSpot Spotter at 54.67%.
While a 92% average accuracy rate is strong, it still means that 8% of the answers were incorrect, a significant inaccuracy rate in decision-making environments that breaks trust.

Average Inaccuracy (Wrong Answers)

ThoughtSpot Spotter had an average inaccuracy rate of 45.33%, Databricks Genie at 8% (plus 7% in SQL errors), Snowflake Cortex Analyst at 11.33%, and runQL in Basic Test Mode at 8%.
Databricks Genie also failed to generate executable SQL in 7% of cases.
Across platforms, inaccurate answers ranged from incorrect aggregations to misleading or incomplete outputs.

Sample Question: Jaffle Shop

Which customer had the highest number of orders during the calendar year 2024? Please show the customer name and their total number of orders.

Vendor	Answered Correctly?	Answer Provided	What they did
Snowflake Cortex Analyst	✕	Thomas Jones, 115	Grouped by customer_name and there are multiple customers with the same name.
Databricks Genie	✔︎	Tracy Nunez, 70	Grouped correctly by customer_id
ThoughtSpot Spotter	✕	Thomas Jones, 115	Grouped by customer_name and there are multiple customers with the same name.
runQL (Basic Test Mode)	✔︎	Tracy Nunez, 70	Grouped correctly by customer_id

Chart: Jaffle Shop Test Results (Grouped by test run)

In the following chart we show the results of the three Jaffle Shop dataset test runs. Each bar represents the % of questions answered correctly and incorrectly by each platform for the test run.

Chart: Jaffle Shop Test Results (Grouped by platform)

In the following chart we see the performance of each platform across test runs. Snowflake, Databricks, and ThoughtSpot show variability in the % of accurate answers even though the questions are the same in each test.

Chart: BitThreads Test Results (Grouped by test run)

In the following chart we show the results of the three BitThreads dataset test runs. Each bar represents the % of questions answered correctly and incorrectly by each platform for the test run.

Chart: BitThreads Test Results (Grouped by platform)

Chart: Combined Average Inaccuracy & Error Rates

Across all questions in the test cycles, runQL in Basic Test Mode averaged an 8% incorrect answer rate, Snowflake averaged 11%, Databricks averaged 8%, and ThoughtSpot averaged 45%. Databricks was the only platform generating SQL syntax errors, occurring 7% on average.

Results: Evaluation 1 - Consistency & Inconsistency

Most accuracy benchmarks only measure one-time performance. Because LLMs are probabilistic, the same question can produce different answers (queries) across runs, even when the underlying data, schema, and context remain unchanged. To understand how Conversational Analytics systems behave under real-world conditions, we tested repeatability (consistency), not just one-time accuracy.

In this benchmark, we asked the same 50 questions three times on each platform to measure how often they produced the same answer versus a different one. This consistency test reveals whether a system is stable enough for real-world decision-making, not just one-time demo-level performance. (For the full test scores, see Appendix A.)

Inconsistency Rates

Across the 150 questions in the evaluation:

Snowflake Cortex Analyst: 6% of questions had an inconsistent answer
Databricks Genie: 10% of questions had an inconsistent answer
ThoughtSpot Spotter: 12% of questions had an inconsistent answer
runQL (Basic Test Mode): 0% (perfect consistency*)

*Although runQL in Basic Test Mode showed perfect consistency in this benchmark, we have seen inconsistent answers for this mode in other tests. Basic Test Mode does not include the guardrails needed to ensure consistency in production environments.

These inconsistency rates reflect the non-deterministic behavior of the underlying LLMs and highlight the challenge of finding Conversational Analytics solutions that deliver both consistency and accuracy.

Sample Question: Jaffle Shop

Which customer had the highest order total on their 2nd order, excluding the price of any drink items in the order? Please show the customer name and the total amount without the drinks.

Vendor	1st Try	2nd Try	3rd Try
Snowflake Cortex Analyst	✔︎	✕	✔︎
Databricks Genie	✕	✔︎	✔︎
ThoughtSpot Spotter	✕	✕	✕
runQL (Basic Test Mode)	✔︎	✔︎	✔︎

The above question from the benchmark shows how some platforms struggle with consistency even on simple datasets.

Chart: Inconsistency Rates (Grouped by platform)

In the following chart we see the inconsistency rate for each platform. This is the rate in which the platform answers the same question differently, answering a question correctly sometimes and incorrectly other times.

Results: Evaluation 1 - Summary

As the test results show, even with the progress made in AI and these solutions, the rate of incorrect answers and inconsistent results remains a significant business risk, even with smaller schemas and controlled datasets.
Most business users are not equipped to identify or validate incorrect answers, which means they may unknowingly rely on flawed insights. (see the sample questions above)
This underscores the need for safeguards such as deterministic guardrails, systems that learn and improve over time, Expert-in-the-Loop processes when needed, and transparency in how AI-generated answers are produced.

Platforms in the first evaluation:

Failed to provide correct answers 8% to 45% of the time.
Produced different answers to the same question across runs, correct sometimes and incorrect other times.
Left users with no clear way to identify incorrect answers unless they were experts in the data. (see sample questions above)

Evaluation 2: runQL in Smart Test Mode

For the second evaluation, we ran the same benchmark again: 50 questions, two datasets, and three independent runs using runQL in Smart Test Mode. This configuration enables additional system components including the intent engine, ranking engine, and deterministic answer caching. As with the first evaluation, we started from a blank slate and did not pre-populate the answer cache.

Results: Evaluation 2 - Summary

runQL in Smart Test Mode, achieved 100% accuracy and 100% consistency across all questions, datasets, and runs.

While this configuration performed well in the benchmark conditions, real enterprise environments can involve more complex schemas, edge cases, and ambiguous business logic. For this reason, we recommend systems that extend beyond these components when deployed in production.

Note: We did not enable runQL’s full production setup for any evaluation in this report. In production, runQL's proprietary system includes all Smart Test Mode features plus answer feedback, certified answer caching, ranking rerouting, Expert-in-the-Loop workflows, and more. Organizations may also optionally use the dbt Metrics Layer or Cube Semantic Layer with runQL. The table below summarizes the differences between production and the configurations used in these evaluations.

Capability	Basic Test Mode	Smart Test Mode	Production (not used in these tests)
Automatic Semantic Descriptions & Schema Metadata	Y	Y	Y
Automatic Synthetic Data	Y	Y	Y
Deterministic Guardrail Engine	Y	Y	Y
Intent Engine	N	Y	Y
Ranking Engine	N	Y	Y
Answer Caching Engine*	N	Y	Y
Answer Feedback	N	N	Y
Certified Answer Caching Engine	N	N	Y
Ranking Rerouting Engine	N	N	Y
Automatic Data Catalog	N	N	Y
Cube/dbt Labs Semantic Layer Integration (optional)	N	N	Y
Answers Context Engine (existing answers + query + query meta data)	N	N	Y
Expert-in-the-Loop (ticketing workflow - when needed)	N	N	Y

*The Answer Caching Engine was never pre-populated for any evaluations.

Evaluation 3: Enhancing Snowflake & Databricks with runQL

While runQL can be used as a standalone product, its proprietary system can also operate as a layer on top of Snowflake Cortex Analyst or Databricks Genie, allowing organizations to improve their existing investments. In follow-up testing, we evaluated how runQL in Smart Test Mode performs in this augmentation role by assessing and improving AI-generated queries before they reach business users.

These follow-up tests were not part of the original study, and the questions and schemas had changed slightly since the first and second evaluation. For clarity, the BitThreads schema has been normalized more and the questions have been refined as we continue to update our evaluation suite. Still, the results were clear:

Results: Evaluation 3 - Improved Accuracy & Consistency

When SQL generated by Snowflake Cortex Analyst or Databricks Genie were passed through runQL’s proprietary system, the following improvements were observed:

Databricks Genie (Jaffle Shop Dataset):
- Accuracy improved to 100%
- Consistency improved to 100%
- Low-confidence or incorrect queries were intercepted and rewritten before reaching users.
Databricks Genie (BitThreads Dataset):
- Accuracy improved by 4%
- Consistency improved to 100%
- Incorrect answers were caught and blocked, preventing user-facing errors even when a perfect fix was not possible
Snowflake Cortex Analyst (Jaffle Shop Dataset):
- Accuracy improved to 100%
- Consistency improved to 100%
- Low-confidence or incorrect queries were intercepted and rewritten before reaching users.
Snowflake Cortex Analyst (BitThreads Dataset):
- Accuracy improved by 8%
- Consistency improved to 100%
- All but one of the incorrect answers were stopped before reaching business users.

Why This Matters

Many organizations have already invested in platforms like Snowflake Cortex Analyst and Databricks Genie. For teams looking to bridge the gap between AI-generated speed and analyst-level trust, runQL offers a complementary layer that:

Intercepts low-confidence answers (queries) before they cause harm
Improves answer accuracy using semantic context and judging workflows
Provides transparency around why queries were accepted, modified, or blocked
Reduces analyst burden by helping AI get closer to the right answer, more often
Enable organizations to easily build a library of approved queries, making the system smarter while increasing trust and saving on inference costs.

Why Guardrails + Expert-in-the-Loop > AI-Only + Human-in-the-Loop

In a production system, Human-in-the-Loop alone is not enough for structured data. A combination of guardrails and Expert-in-the-Loop oversight (when needed) are required, for the same reason business users (humans) already turn to data analysts (experts) today: Data can be nuanced, and analysts have the deep context needed to interpret that nuance, both in the data itself and the surrounding business context.

Context Matters: AI by itself lacks a deep understanding of your business and data model.
Oversight Reduces Risk: Expert review catches what AI misses.
Consistency Builds Trust: Business teams need reliable answers every time.

Our testing showed that AI-only approaches, even those with semantic information, still produced incorrect answers ranging from 8% to 45% of the time and inconsistent results up to 12% of the time. These incorrect and inconsistent answers will increase as usage scales, exposing your organization to higher risk.

In contrast, runQL Smart Test Mode improved performance through proprietary capabilities such as the Deterministic Guardrail Engine, Ranking Engine, and Answer Caching Engine. However, Smart Test Mode still excluded Expert-in-the-Loop workflows, Certified Answers, and the Context Engine used in full production. These tools introduce additional layers of assurance, strengthening both accuracy and consistency in real-world production.

In practice, the most trusted and accurate insights come from combining AI capabilities with guardrails and Expert-in-the-Loop processes which are core elements of runQL’s production setup. This combined approach delivers not only faster insights, but verifiable, repeatable, and context-aware answers at scale in a system that gets smarter over time.

How runQL Bridges the Gap Between Speed and Trust

1. Start with Deterministic Queries and Processes

Leverage existing, validated answers first.

runQL begins with deterministic answers from the certified answer library and defined metrics before generating anything new with AI. These deterministic answers, vetted by data experts and grounded in known business logic, provide fast, reliable insights that users can trust. This reduces the need to "reinvent the wheel" with every question, speeds up time-to-answer, and minimizes the risk of AI-generated mistakes.

Speed: Reusing validated queries (answers) or defined metrics wherever possible, business users get instant responses to common questions, no need to wait for query generation (inference), debugging, or analysts.
Trust: Because these answers and metrics are approved and auditable, stakeholders can rely on the results with confidence, knowing the logic has already been vetted and aligned with the business context.
Scalability: Your certified answer library grows over time, becoming smarter and more comprehensive as more questions are answered.
Governance: Questions that fall outside the answer library or return low-confidence results are automatically routed to analysts, ensuring expert oversight where and when needed.

With this approach, runQL doesn’t just promise fast answers; it ensures they’re the right answers. That’s how it delivers both the speed business users demand and the trust data leaders require.

2. Clarify Ambiguity

AI asks follow-up questions instead of guessing.

runQL prompts for clarification when a request is vague rather than guessing. If an assumption must be made, it is clearly stated. This prevents misleading queries and ensures the output aligns with the user's actual intent.

3. Control AI Confidence Levels

Set thresholds for when AI can answer and when a human should step in.

Not every query is suited for AI to handle alone. runQL allows teams to set confidence thresholds that determine when AI can answer and when a human expert should step in. Only AI answers that meet or exceed the confidence threshold are delivered to business users; all others are routed to analysts for review.

4. Governance Built In

Audit, log, and document every query.

runQL automatically audits, logs, and documents every AI-generated query, whether answered automatically or reviewed by an expert. This ensures full traceability and supports compliance, accountability, and ongoing improvement to your data governance processes.

5. Data Privacy First

Ensure your data stays secure.

runQL ensures your data stays secure: raw data is never sent to external LLMs, and sensitive information never leaves your environment. This protects against privacy risks, supports regulatory compliance, and builds organizational trust in AI use.

Ready to Close the Gap?

We’d love to show you how we’re approaching this challenge differently.

If you’re tired of data-request backlogs, AI solutions that erode trust, and lost productivity, runQL enables fast, accurate, and trusted insights at scale.

Rob Darling, Founder & CEO

Chat With Us

Appendix A: Research Test Scores

This table summarizes the test results for each platform and dataset. The columns appear in the order the tests were run: Jaffle 1 was executed across all platforms first, followed by BitThreads 1 across all platforms, and so on for the remaining cycles.

Platform	Result Category	Jaffle 1	BitThreads 1	Jaffle 2	BitThreads 2	Jaffle 3	BitThreads 3	Average
runQL (Basic Test Mode)	Correct	24	22	24	22	24	22	92.0%
	Wrong	1	3	1	3	1	3	8.00%
	SQL Error	0	0	0	0	0	0	0%
Snowflake Cortex Analyst	Correct	22	23	23	22	22	21	88.67%
	Wrong	3	2	2	3	3	4	11.33%
	SQL Error	0	0	0	0	0	0	0%
Databricks Genie	Correct	22	21	23	19	23	19	84.67%
	Wrong	1	2	2	3	1	3	8.00%
	SQL Error	2	2	0	3	1	3	7.33%
ThoughtSpot Spotter	Correct	13	-	14	-	14	-	54.6%
	Wrong	12	-	11	-	11	-	45.33%
	SQL Error	0	-	0	-	0	-	0%
runQL (Smart Test Mode)	Correct	25	25	25	25	25	25	100%
	Wrong	0	0	0	0	0	0	0%
	SQL Error	0	0	0	0	0	0	0%

Appendix B: Research Datasets

Datasets

Dataset One was based on the well-known dbt Labs Jaffle Shop project and included data automatically generated using dbt Labs’ official scripts.

Dataset Two was a custom-built fictional retail dataset for a company called BitThreads. It contains data and structures that would not have been available in any public LLM training corpus.

Each platform generated a SQL query based on the user’s question, executed it against the database, and returned both the query and results for evaluation.

Note: None of the tested platforms had seen this data or these questions prior to running the tests. runQL, Snowflake, and ThoughtSpot did not store raw sample data as part of their semantic information, while Databricks stored raw sample data alongside the semantic information. For clarity, runQL creates and uses synthetic data for semantic enrichment but never uses raw sample data.

How the Questions Were Created

Generation: We provided ChatGPT 4o with the schema, high-level business context (e.g., “This is for a retail chain that sells clothing”), and available metadata from each dataset, then instructed it to generate business-user questions aligned to themes such as sales performance and customer behavior.
Volume: We requested 50 questions per dataset.
Curation: We reviewed the outputs and selected 25 questions per dataset that had clear intent and could be objectively evaluated for correctness.

This approach ensured a mix of simple and complex questions across domains, allowing us to fairly assess how well each platform handled structured business queries.

Jaffle Shop Entity Relationship Diagram (ERD)

The Jaffle Shop database tracks food and drink sales using six tables and 258,637 rows of data.

Data: Jaffle Shop Sample

This is a small sample of Jaffle Shop data used for Sample Question 1 in the report.

Customers Table

customer_id	customer_name	count_lifetime_orders	first_ordered_at	last_ordered_at	lifetime_spend	customer_type
0f66b255-1b57-4e5f-a78e-9ff612859c94	Thomas Garcia	92	2022-10-06 0:00:00	2022-10-06 0:00:00	1207	returning
b237d5b9-56e9-4710-b7cd-8b472425f2a4	Thomas Hamilton	5	2024-03-01 0:00:00	2025-01-03 0:00:00	31	returning
006f4fdc-8d23-4269-bddf-6d05fa886347	Thomas Johnson	9	2024-08-15 0:00:00	2025-04-22 0:00:00	95	returning
1d2f11a3-83ae-44c3-98b2-4527e4ad1e75	Thomas Johnson	84	2023-10-24 0:00:00	2025-05-12 0:00:00	1156	returning
8b667f5a-e991-4096-a057-7eb506928c07	Thomas Jones	90	2023-09-20 0:00:00	2025-05-02 0:00:00	1153	returning
9537ddf2-9408-4324-a9bb-e60bfbe6fb8b	Thomas Jones	25	2023-12-05 0:00:00	2025-05-09 0:00:00	1135	returning
ee99c556-a6cd-41c5-bdfb-9f17038509af	Thomas Morris	1	2024-07-12 0:00:00	2024-07-12 0:00:00	10	new

Orders Table

order_id	location_id	customer_id	order_total	ordered_at
4369ee22-289c-4e48-b2b2-8cca472d3a7	25241dcc-6646-48d2-a012-d97dc9e43b43	006f4fdc-8d23-4269-bddf-6d05fa886347	8	2024-08-15 0:00:00
541f0d24-a763-42f3-b15c-f0e3e103cf54	25241dcc-6646-48d2-a012-d97dc9e43b43	006f4fdc-8d23-4269-bddf-6d05fa886347	23	2024-08-20 0:00:00
6a45995d-daef-4f95-b8c5-47d358e80841	25241dcc-6646-48d2-a012-d97dc9e43b43	006f4fdc-8d23-4269-bddf-6d05fa886347	5	2024-09-10 0:00:00
7f902677-0497-41a9-bffe-77d01443feec	25241dcc-6646-48d2-a012-d97dc9e43b43	006f4fdc-8d23-4269-bddf-6d05fa886347	8	2024-09-25 0:00:00
0c164a94-60a2-4ce4-9e82-8ad8f8f8c531	25241dcc-6646-48d2-a012-d97dc9e43b43	006f4fdc-8d23-4269-bddf-6d05fa886347	6	2024-11-15 0:00:00
d87e4419-219f-40b5-a630-671dbe6c4187	25241dcc-6646-48d2-a012-d97dc9e43b43	006f4fdc-8d23-4269-bddf-6d05fa886347	25	2025-01-03 0:00:00
aa806a8d-14ee-485d-a72f-d35a7020b8f5	25241dcc-6646-48d2-a012-d97dc9e43b43	006f4fdc-8d23-4269-bddf-6d05fa886347	5	2025-02-07 0:00:00

BitThreads Entity Relationship Diagram (ERD)

The BitThreads database supports the operations of a fictional clothing retail chain and contains twenty-six tables and 27,160 rows of data.