Building a Reliable AI Analytics Agent with BigQuery, a Semantic Layer, and Google ADK

How to prevent your LLM from confidently giving you the wrong numbers

You’ve probably seen it: an LLM generates impressively complex SQL queries in seconds, executes them flawlessly, and returns results that look completely reasonable. Your stakeholders are amazed. Your team is excited. And then someone checks the numbers manually.

They’re wrong.

Not obviously wrong , subtly, dangerously wrong. The kind of wrong that makes it into a board presentation before anyone notices.

Here’s the uncomfortable truth: Large Language Models are exceptionally good at generating SQL. They are exceptionally bad at understanding your business logic.

If you connect an LLM directly to BigQuery and ask it to answer business questions, you’re playing Russian roulette with your metrics. Eventually, you’ll get incorrect joins, drifting metric definitions, and KPIs that look right but aren’t trusted by anyone who actually understands the data.

This article shows you how to build analytics agents that are both powerful and trustworthy , using a semantic layer and Google’s Agent Development Kit (ADK) while keeping everything lightweight and cloud-native.

The Core Problem: Schema ≠ Meaning

Let me show you why this is harder than it looks.

Your BigQuery schema might have these columns:

orders.net_amount
orders.total_amount
orders.tax_amount
orders.refunded_amount
orders.discount_amount

Now, quick question: which one is “revenue”?

  • Is it net of tax?
  • Is it net of refunds?
  • Is it net of discounts?
  • At order grain or line-item grain?
  • Does it include canceled orders?
  • What about orders in the current billing period that haven’t shipped yet?

Humans in your organization know the answer. There’s probably a Slack thread with 47 messages and three Excel attachments that definitively settled this question back in Q2. Your finance team has it documented somewhere. Your data team has it in their tribal knowledge.

LLMs have to guess. And they’ll guess differently every time, depending on how the question is phrased, what context is in the conversation, and which training examples most closely matched the query pattern.

A reliable analytics agent must never guess business logic. Full stop.

The Solution: Separate Meaning from Storage

The architecture we’re building has three distinct layers, each with a clear responsibility:

Natural language question

[ADK Agent] ← Reasoning & orchestration

[Semantic Registry] ← Metrics & definitions

[BigQuery] ← Raw storage & computation

Here’s the key insight that makes this work:

The agent selects definitions, not tables.

Instead of giving your LLM free reign to explore your database schema and construct queries, you give it a curated menu of pre-defined metrics. It’s the difference between:

“Figure out how to calculate revenue from these 47 tables”
“Here are 12 approved metrics you can query, including net_revenue”

Let’s build this layer by layer.

Layer 1: BigQuery Computation Layer (Views)

Your raw tables stay raw. Messy, denormalized, full of edge cases — exactly as data warehouses tend to be.

But you don’t expose raw tables to your semantic layer. Instead, you create analytics views that encode all your complex business logic once, in SQL, where it belongs.

Here’s a concrete example:

-- analytics.fact_orders
CREATE VIEW analytics.fact_orders AS
SELECT
o.order_id,
DATE(o.order_timestamp) AS order_date,
o.customer_id,
c.country_code AS country,
o.total_amount,
o.tax_amount,
COALESCE(r.refunded_amount, 0) AS refunded_amount,
o.discount_amount,
-- Here's where business logic lives
(o.total_amount - o.tax_amount - COALESCE(r.refunded_amount, 0)) AS net_amount
FROM raw.orders o
LEFT JOIN raw.customers c
ON o.customer_id = c.customer_id
LEFT JOIN (
SELECT
order_id,
SUM(refund_amount) AS refunded_amount
FROM raw.refunds
WHERE refund_status = 'COMPLETED'
GROUP BY order_id
) r ON o.order_id = r.order_id
WHERE o.order_status != 'CANCELLED';

Notice what this view does:

  • Encodes joins: The relationship between orders, customers, and refunds is handled once
  • Normalizes grain: Everything is at order-level grain
  • Implements business rules: Cancelled orders excluded, only completed refunds counted
  • Derives metrics: net_amount calculation is explicit and consistent
  • Hides complexity: Downstream consumers don’t see the raw mess

This is just SQL, running in BigQuery, using standard views. No new infrastructure, no vendor lock-in, no black boxes. But it’s the foundation that makes everything else possible.

Layer 2: Semantic Registry (Machine-Readable Metadata)

This is where the magic happens. The semantic registry is a machine-readable catalog that tells your AI agent exactly what metrics exist, how they’re calculated, and what constraints apply.

Here’s what a real metrics registry looks like:

metrics.yaml:

metrics:
net_revenue:
description: Revenue excluding tax, refunds, and cancelled orders
sql: SUM(net_amount)
source: analytics.fact_orders
grain: day
dimensions:
- order_date
- country
- customer_segment
constraints:
max_date_range_days: 366

order_count:
description: Total number of completed orders
sql: COUNT(DISTINCT order_id)
source: analytics.fact_orders
grain: day
dimensions:
- order_date
- country

average_order_value:
description: Average net revenue per order
sql: SUM(net_amount) / COUNT(DISTINCT order_id)
source: analytics.fact_orders
grain: day
dimensions:
- order_date
- country

dimensions.yaml:

dimensions:
order_date:
column: order_date
type: date
description: Date the order was placed

country:
column: country
type: string
description: Customer country code (ISO 3166-1 alpha-2)
allowed_values:
- US
- GB
- FR
- DE
- CA

customer_segment:
column: customer_segment
type: string
description: Customer segmentation tier
allowed_values:
- enterprise
- mid_market
- smb

Why This Registry Matters for LLMs

Let’s compare what an agent has to do with and without a semantic registry:

Without a registry, the agent must:

  1. Infer which columns represent metrics
  2. Infer how to join tables
  3. Infer the correct grain for aggregation
  4. Infer what filters are valid
  5. Infer which edge cases to handle
  6. Hope all these inferences are correct

With a registry, the agent:

  1. Selects from a finite, approved set of metrics
  2. Reuses pre-validated SQL expressions
  3. Respects explicit constraints
  4. Returns results with documented definitions
  5. Can’t accidentally create new “interpretations” of your metrics

You’re not “prompting better.” You’re removing ambiguity from the system architecture.

This isn’t a prompt engineering trick. This is software engineering.

Layer 3: Google ADK Agent (Reasoning Layer)

Now we get to the AI part. But notice: we built two entire layers before we even touched the LLM.

Google’s Agent Development Kit provides the structured reasoning layer that sits on top of our semantic foundation. ADK gives us:

  • Structured tool calling: The agent invokes functions with typed parameters, not free-form SQL
  • Deterministic execution: Same question → same query → same result
  • Grounding in trusted systems: The agent reasons about metrics, not tables
  • Observability: You can trace exactly how the agent reached its conclusion

Here’s the key constraint: The agent is not allowed to write free SQL.

Instead, it has access to a tool that looks like this:

Tool Contract:

@tool
def query_metric(
metric_name: str,
filters: dict[str, Any],
start_date: str,
end_date: str,
group_by: list[str] = None
) -> dict:
"""
Query a predefined metric from the semantic registry.

Args:
metric_name: Name of the metric (e.g., 'net_revenue')
filters: Dimension filters (e.g., {'country': 'US'})
start_date: Start date in YYYY-MM-DD format
end_date: End date in YYYY-MM-DD format
group_by: Optional list of dimensions to group by

Returns:
Query results with metadata
"""
# Implementation details below

The agent’s job is to:

  1. Map user intent → metric name
  2. Extract filters from the question
  3. Determine the time window
  4. Validate everything against the registry
  5. Invoke the tool with correct parameters

The tool’s job is to:

  1. Look up the metric definition
  2. Validate filters against allowed dimensions
  3. Enforce constraints (like max date range)
  4. Generate SQL using approved templates
  5. Execute on BigQuery
  6. Return results with full lineage

End-to-End Example: From Question to Query

Let’s watch this system in action.

User Question

“What was net revenue in France last month?”

Agent Reasoning (ADK trace)

Step 1: Parse intent
- Metric needed: net_revenue
- Filter: country = 'FR'
- Time window: last month (2025-12-01 to 2025-12-31)

Step 2: Validate against registry
✓ Metric 'net_revenue' exists
✓ Dimension 'country' is allowed
✓ Value 'FR' is in allowed_values
✓ Date range: 31 days (within max_date_range_days: 366)

Step 3: Generate SQL from template
- Source: analytics.fact_orders
- Aggregation: SUM(net_amount)
- Filters: country = 'FR' AND order_date BETWEEN ...

Step 4: Execute query

Generated SQL (Bounded and Safe)

SELECT
SUM(net_amount) AS net_revenue
FROM analytics.fact_orders
WHERE country = 'FR'
AND order_date BETWEEN '2025-12-01' AND '2025-12-31';

result

{
"metric": "net_revenue",
"value": 1247832.45,
"currency": "USD",
"filters": {
"country": "FR",
"start_date": "2025-12-01",
"end_date": "2025-12-31"
},
"definition": "Revenue excluding tax, refunds, and cancelled orders",
"source": "analytics.fact_orders",
"execution_time_ms": 342
}

Notice what didn’t happen:

  • ❌ No guessing about which columns to use
  • ❌ No hallucinated joins
  • ❌ No ambiguity about what “revenue” means
  • ❌ No metric drift over time

The agent followed a deterministic path from question to answer, bounded by explicit rules at every step.

Adding Constraints (Critical for Production)

In a real system, you need guardrails to prevent expensive mistakes. The semantic registry supports this natively:

global_constraints:
max_date_range_days: 366
max_results: 10000
allowed_aggregations:
- SUM
- COUNT
- AVG
- MAX
- MIN
require_date_filter: true

metric_constraints:
net_revenue:
required_dimensions:
- order_date
allowed_group_by:
- country
- customer_segment
- order_date
max_date_range_days: 90 # Override global for this metric

These constraints prevent:

  • Accidental full-table scans that cost thousands of dollars
  • Invalid dimension combinations that return nonsense
  • Queries that would time out or hit BigQuery limits
  • Metric misuse (like averaging a sum, or summing an average)

LLMs respect rules if rules are explicit. The key is making those rules machine-readable and enforced at runtime, not buried in documentation.

Implementation: ADK Agent Code

Here’s what the actual implementation looks like using Google ADK:

from google.adk import Agent, Tool
from google.cloud import bigquery
import yaml
from datetime import datetime, timedelta

class SemanticLayerAgent:
def __init__(self, registry_path: str):
self.bq_client = bigquery.Client()
self.load_registry(registry_path)

def load_registry(self, path: str):
"""Load semantic registry from YAML files"""
with open(f"{path}/metrics.yaml") as f:
self.metrics = yaml.safe_load(f)['metrics']
with open(f"{path}/dimensions.yaml") as f:
self.dimensions = yaml.safe_load(f)['dimensions']
with open(f"{path}/constraints.yaml") as f:
self.constraints = yaml.safe_load(f)

@Tool
def query_metric(
self,
metric_name: str,
filters: dict,
start_date: str,
end_date: str,
group_by: list[str] = None
) -> dict:
"""Query a predefined metric from the semantic registry"""

# Validate metric exists
if metric_name not in self.metrics:
return {
"error": f"Unknown metric: {metric_name}",
"available_metrics": list(self.metrics.keys())
}

metric = self.metrics[metric_name]

# Validate date range
days = (datetime.fromisoformat(end_date) -
datetime.fromisoformat(start_date)).days
max_days = metric.get('constraints', {}).get(
'max_date_range_days',
self.constraints['global_constraints']['max_date_range_days']
)
if days > max_days:
return {
"error": f"Date range too large: {days} days (max: {max_days})"
}

# Validate filters
for dim, value in filters.items():
if dim not in metric['dimensions']:
return {
"error": f"Invalid dimension '{dim}' for metric '{metric_name}'",
"allowed_dimensions": metric['dimensions']
}

# Build SQL from template
sql = self._build_sql(metric, filters, start_date, end_date, group_by)

# Execute query
try:
query_job = self.bq_client.query(sql)
results = [dict(row) for row in query_job.result()]

return {
"metric": metric_name,
"description": metric['description'],
"results": results,
"filters": filters,
"date_range": {"start": start_date, "end": end_date},
"sql": sql,
"execution_time_ms": query_job.total_bytes_processed
}
except Exception as e:
return {"error": f"Query failed: {str(e)}"}

def _build_sql(self, metric, filters, start_date, end_date, group_by):
"""Generate SQL from metric definition"""
sql_parts = [
f"SELECT",
f" {metric['sql']} AS {metric.get('name', 'value')}"
]

if group_by:
sql_parts[1] += ","
sql_parts.append(" " + ", ".join(group_by))

sql_parts.append(f"FROM {metric['source']}")

# Build WHERE clause
where_conditions = [
f"order_date BETWEEN '{start_date}' AND '{end_date}'"
]
for dim, value in filters.items():
# Properly escape based on dimension type
dim_def = self.dimensions[dim]
if dim_def['type'] == 'string':
where_conditions.append(f"{dim} = '{value}'")
else:
where_conditions.append(f"{dim} = {value}")

sql_parts.append("WHERE " + " AND ".join(where_conditions))

if group_by:
sql_parts.append("GROUP BY " + ", ".join(group_by))

return "n".join(sql_parts)


# Create the agent
semantic_agent = SemanticLayerAgent(registry_path="./semantic-layer")

agent = Agent(
name="AnalyticsAgent",
model="gemini-2.0-flash-exp",
instruction="""You are an analytics expert that helps users query business metrics.

Always use the query_metric tool to answer questions about metrics.
Never generate SQL directly - always use the predefined metrics.

When you receive a question:
1. Identify which metric(s) the user is asking about
2. Extract any filters (country, segment, etc.)
3. Determine the time window
4. Call query_metric with the correct parameters
5. Present the results clearly, including the metric definition

If a metric doesn't exist, tell the user what metrics are available.
If filters are invalid, explain what dimensions are allowed for that metric.
""",
tools=[semantic_agent.query_metric]
)

Where to Store the Registry

Keep it simple. The semantic registry doesn’t need fancy infrastructure:

Option 1: Git Repository (Recommended)

semantic-layer/
├── metrics.yaml
├── dimensions.yaml
├── constraints.yaml
├── README.md
└── CHANGELOG.md

Benefits:

  • Version control built-in
  • Code review process for changes
  • Easy to audit and rollback
  • Works with existing CI/CD
  • No service dependencies

Option 2: GCS Bucket

from google.cloud import storage

client = storage.Client()
bucket = client.bucket('my-semantic-layer')
blob = bucket.blob('metrics.yaml')
registry = yaml.safe_load(blob.download_as_text())

Why This Works So Well with BigQuery

This architecture plays to BigQuery’s strengths:

BigQuery gives you:

  • Serverless scale (no capacity planning)
  • Sub-second query performance
  • Columnar storage (efficient aggregations)
  • Cost controls (slot reservations, query quotas)
  • SQL expressiveness (standard SQL, window functions, UDFs)

The semantic layer gives you:

  • Consistent metric definitions
  • Governed data access
  • Self-documenting analytics
  • Protection against drift

ADK gives you:

  • Structured reasoning
  • Explainable AI
  • Safe orchestration
  • Production-grade reliability

Together, you get analytics agents you can actually trust in production.

Real Use Cases

This pattern excels in scenarios where correctness matters more than speed:

1. AI Analytics Copilots

Internal tools where employees ask questions about business metrics. The agent can explain trends, compare periods, and drill into dimensions — all while guaranteeing consistent definitions.

2. KPI Explanation Bots

Automated systems that explain why metrics moved. “Revenue dropped 12% in EMEA last week” becomes “Revenue dropped 12% in EMEA last week, driven by a 23% decrease in Germany partially offset by 8% growth in France.”

3. Log and Metric Summarizers

Agents that analyze application logs and metrics to identify anomalies. The semantic layer ensures “error rate” means the same thing across all services.

4. Embedded Analytics in SaaS

Customer-facing analytics where users ask questions about their data. You can’t afford to show them wrong numbers — the semantic layer prevents that.

5. Internal Data Q&A Tools

Slack bots, CLI tools, or web interfaces where teams query data. Instead of writing SQL or waiting for analysts, they ask natural language questions and get trusted answers.

The common thread: anywhere the answer must be correct, not just fluent.

Start Small, Scale Incrementally

You don’t need a perfect semantic layer on day one. Start minimal:

Week 1: Foundation

  • Create one analytics view in BigQuery
  • Define three core metrics in YAML
  • Build one simple ADK tool
  • Test with your most common questions

Week 2: Validation

  • Run queries side-by-side (agent vs. manual SQL)
  • Verify results match your BI dashboards
  • Add constraints based on what breaks
  • Document edge cases you discover

Week 3: Expansion

  • Add more metrics incrementally
  • Introduce dimension hierarchies
  • Implement query caching
  • Add monitoring and observability

Month 2+: Production Hardening

  • Add cost controls and quota management
  • Implement audit logging
  • Create metric approval workflow
  • Build self-service metric registration

The beauty of this architecture: you can evolve incrementally without re-architecture. Every metric you add makes the system more valuable without making it more complex.

Common Pitfalls (and How to Avoid Them)

Pitfall 1: Making the Semantic Layer Too Complex

Don’t: Try to model every possible query pattern upfront
Do: Start with your top 10 most-asked questions and build from there

Pitfall 2: Bypassing the Semantic Layer

Don’t: Give the agent access to raw BigQuery for “flexibility”
Do: Make it impossible to execute queries outside the registry

Pitfall 3: Forgetting About Cost Controls

Don’t: Let agents run unbounded date range queries
Do: Enforce max date ranges and require date filters

Pitfall 4: No Monitoring

Don’t: Deploy and forget
Do: Track query patterns, costs, accuracy, and user satisfaction

Pitfall 5: Metric Sprawl

Don’t: Let anyone add metrics without review
Do: Implement an approval process for new metrics (even a PR review works)

Resources


Building a Reliable AI Analytics Agent with BigQuery, a Semantic Layer, and Google ADK was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Liked Liked