Production Observability for Spring AI Agents on Amazon Bedrock Without Writing Tracing code
You shipped your first AI agent to production last quarter. It works. Customers love it. Then your finance lead pings you on Slack:
“Our Bedrock bill went up 4x last week. Which feature is burning the tokens?”
- You stare at CloudWatch. You stare at your code. You realize you have no idea, because you never instrumented per-request token usage.
- A week later, a customer complaint lands on your desk: “the assistant gave me wrong information at 2:47 PM yesterday.” You need to find that exact request, see the prompt, see the model response, and figure out what went sideways. You can’t, because your logs don’t carry session IDs and your APM doesn’t know what a Bedrock request ID is.
- A month later, your security team runs a routine scan and finds customer emails and SSNs in your Datadog traces, copy-pasted there because somebody decided to log the prompt for debugging. Your CISO is now in your one-on-one.
Three problems. One root cause: AI agents are not normal HTTP services, and the standard observability stack doesn’t know what to do with them.
This tutorial walks through a Spring Boot starter that fixes all three at once: spring-ai-agentcore-observability. You add one dependency, set two properties, and every Bedrock call ships out fully enriched OpenTelemetry spans with PII already redacted. We’re going to build a working agent, validate it against live Amazon Bedrock, and inspect the span output byte-for-byte.
Everything you read below has been verified against amazon.nova-lite-v1:0 running in us-east-1. The full validation report is open-source.
The five problems you hit in production
If you’re running a Spring AI agent on Amazon Bedrock at any scale, you will eventually hit all five of these:

The thing the diagram is hinting at: these aren’t five separate engineering tickets. They share the same answer. You need a single, drop-in instrumentation layer that knows what an LLM request looks like, where the AWS-side correlation IDs come from, and how to scrub strings before they hit the wire. Hand-instrumenting each agent is a tax that compounds over time.
That’s what this starter is.
What you actually get

That’s the deal. Two dependencies and three properties for five capabilities you’d otherwise wire by hand in every microservice.
Quick refresher: what is observability with OpenTelemetry?
If you’ve been writing services for a while, skim this. If LLM observability is new to you, this section is the reason the rest of the article makes sense.
Observability is the practice of inferring what a running system is doing from the data it emits. Three signals carry that data:
A trace is the story of one request as it travels through your system. It’s made of one or more spans, where each span is a unit of work (an HTTP call, a database query, a Bedrock invocation) with a start time, end time, attributes, and events.
OpenTelemetry (OTel) is the vendor-neutral standard for emitting these signals. It defines:
- A wire protocol (OTLP) so any backend can receive your data
- A set of SDKs (Java, Python, Go, Node, …) for producing it
- Semantic conventions — agreed-upon attribute names so a span from one service looks the same as a span from another. The HTTP convention says use
http.request.method. The database convention saysdb.system. The GenAI convention saysgen_ai.usage.input_tokens.
Once your service speaks OTel, you can change backends without changing code. Move from Jaeger to Datadog to X-Ray to Grafana Tempo with a config flip. That’s the value proposition.
What semantic conventions mean for LLM apps
Until recently, OTel didn’t have a story for AI. Every team rolled their own attribute names: tokens_used, prompt_size, model_name, input_count. Dashboards didn’t survive a refactor. Alerts on “tokens spent” couldn’t aggregate across services because they used different keys.
In 2024 the GenAI semantic conventions landed. Now there’s a standard set of names:
| Attribute | Meaning |
|—-|—-|
| gen_ai.system | Provider system (aws.bedrock, openai, anthropic, …) |
| gen_ai.request.model | Model the caller asked for |
| gen_ai.response.model | Model the provider actually used |
| gen_ai.usage.input_tokens | Prompt tokens billed |
| gen_ai.usage.output_tokens | Completion tokens billed |
| gen_ai.response.finish_reasons | Why the model stopped (stop, length, tool_use, …) |
| gen_ai.operation.name | What kind of call (chat, execute_tool, embeddings) |
The starter we’re building with emits exactly these names. That’s not an implementation detail it’s the difference between “my dashboard works” and “my dashboard works on every backend, forever, even if I switch providers.”
Why every Spring Boot LLM app needs this
A traditional Spring Boot REST service has well-understood failure modes. The DB is slow. The cache is cold. The downstream service is throttling. You instrument once, build standard dashboards, move on.
LLM-backed services break that pattern in five ways at once. Each one creates a kind of blindness:
Concretely, here is what each blindness costs you in production, and what an OTel-instrumented LLM app gives you instead:
| Blindness | What you actually need to see |
|—-|—-|
| Cost is invisible per endpoint | Token histogram with gen_ai.request.model and gen_ai.token.type so you can group spend by feature, model, or customer tier |
| Latency mixes inference time with HTTP overhead | Span hierarchy showing the Bedrock call as its own child span with provider-side latency |
| Errors look like generic 5xx | error.type classified to rate_limit, timeout, authentication_failure, invalid_request, server_error so alerts route to the right team |
| Prompts disappear into logs | Opt-in span events for prompt and completion, masked before export, queryable from your APM |
| No way to reproduce a bad answer | gen_ai.response.finish_reasons plus the captured completion plus a request-id pivot to provider-side logs |
The defenses you get from a properly instrumented LLM service
When this is wired up, the operational defenses you gain look like this:
Each one of those defenses is the difference between “my LLM app works” and “I can run my LLM app in regulated production at scale and sleep at night.”
Where this starter fits in the OTel ecosystem
The starter does one thing: takes Spring AI’s response metadata and turns it into OTel signals that follow the GenAI semantic conventions. Everything downstream the SDK, OTLP, the collector, the backend is standard OTel infrastructure that already exists in most production stacks. You’re not buying into a new tool; you’re filling in the LLM-shaped hole in the one you already use.
Architecture: how the magic happens
Before we write code, let’s understand what’s about to run inside your JVM. The starter is built around three moving parts: an AOP aspect, a PII masker, and an exporter wrapper. They cooperate without you ever seeing them.
End-to-end request flow
Plain English: a client posts a prompt to /invocations. The AgentCore HTTP controller dispatches to your @AgentCoreInvocation method. The aspect (the dotted line) is wrapping the controller transparently, so you never write tracing code. Your method calls Spring AI, which calls Bedrock, and the response flows back. The aspect reads token counts from the response metadata, builds the span, and hands it to the OTel SDK. Just before bytes leave the JVM, the masker scrubs sensitive strings. Whatever sits at the right edge Datadog, Jaeger, X-Ray only ever sees redacted content.
What the aspect actually does, step by step
Notice step 13 versus step 14. The aspect ends the span with the raw prompt content on it. The masker doesn’t run until the SDK hands the span to the exporter. That’s deliberate masking is a transformation on the way out the door, not a mutation on the way in. It means the in-memory span is always inspectable for debugging, but nothing crosses the network boundary unredacted.
The class topology
Two bits worth pointing out:
MaskingSpanDataextendsDelegatingSpanDataand lazy-masks. It only computes the masked attributes the first time the exporter accesses them. Cheap.PiiMaskeris a@ConditionalOnMissingBeandefine your own bean and the auto-config steps aside. You can swap in healthcare-specific (HIPAA) or non-US patterns without forking the library.
Hands-on: building the agent
Enough theory. Let’s wire one up.
Project layout
my-agent/
├── pom.xml
└── src/
└── main/
├── java/
│ └── com/example/demo/
│ ├── DemoApplication.java
│ └── BedrockAgentService.java
└── resources/
└── application.properties
That’s the whole project. Three Java files, one properties file. Watch.
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>3.5.9</version>
<relativePath/>
</parent>
<groupId>com.example</groupId>
<artifactId>my-agent</artifactId>
<version>1.0.0-SNAPSHOT</version>
<properties>
<java.version>17</java.version>
<spring-ai.version>1.1.2</spring-ai.version>
<opentelemetry-instrumentation.version>2.14.0</opentelemetry-instrumentation.version>
</properties>
<repositories>
<repository>
<id>spring-snapshots</id>
<url>https://repo.spring.io/snapshot</url>
<snapshots><enabled>true</enabled></snapshots>
</repository>
<repository>
<id>central-portal-snapshots</id>
<url>https://central.sonatype.com/repository/maven-snapshots/</url>
<snapshots><enabled>true</enabled></snapshots>
</repository>
</repositories>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-bom</artifactId>
<version>${spring-ai.version}</version>
<type>pom</type>
<scope>import</scope>
</dependency>
<dependency>
<groupId>io.opentelemetry.instrumentation</groupId>
<artifactId>opentelemetry-instrumentation-bom</artifactId>
<version>${opentelemetry-instrumentation.version}</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<!-- The observability starter -->
<dependency>
<groupId>org.springaicommunity</groupId>
<artifactId>spring-ai-agentcore-observability</artifactId>
<version>1.1.0-SNAPSHOT</version>
</dependency>
<!-- Spring AI -> Amazon Bedrock Converse -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-model-bedrock-converse</artifactId>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
</plugin>
</plugins>
</build>
</project>
Two <dependency> blocks. One is the observability starter, one is Spring AI’s Bedrock starter. Spring Boot pulls in the rest embedded Tomcat, AOP, OTel SDK, the AgentCore HTTP controller, the AWS SDK.
application.properties
spring.application.name=my-agent
# --- Spring AI Bedrock ---
spring.ai.bedrock.converse.chat.options.model=amazon.nova-lite-v1:0
spring.ai.bedrock.aws.region=${AWS_REGION:us-east-1}
# --- Observability: opt in to prompt/completion capture ---
spring.ai.agentcore.observability.capture-content=true
spring.ai.agentcore.observability.masking.enabled=true
# Custom regex patterns for things only your org cares about
spring.ai.agentcore.observability.masking.custom-regex[0]=\bAKIA[0-9A-Z]{16}\b
spring.ai.agentcore.observability.masking.custom-regex[1]=\bsk-[A-Za-z0-9]{20,}\b
# --- OpenTelemetry SDK ---
otel.traces.exporter=logging
otel.metrics.exporter=logging
otel.logs.exporter=none
The custom regex examples are real. The first redacts AWS access keys (anything starting with AKIA followed by 16 base32 characters). The second redacts OpenAI secret keys (sk- followed by 20+ alphanumerics). If somebody pastes a credential into a prompt by accident, your APM never sees it.
Gotcha number one: never set
otel.traces.exporter=none. That disables the OTel SDK entirely. The starter’s auto-configuration runs through the SDK’s customizer chain, and if the SDK is off, our wiring never happens. For local development uselogging. For production useotlp.
The handler
package com.example.demo;
import org.springaicommunity.agentcore.annotation.AgentCoreInvocation;
import org.springframework.ai.chat.messages.UserMessage;
import org.springframework.ai.chat.model.ChatModel;
import org.springframework.ai.chat.model.ChatResponse;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.stereotype.Service;
@Service
public class BedrockAgentService {
private final ChatModel chatModel;
public BedrockAgentService(ChatModel chatModel) {
this.chatModel = chatModel;
}
@AgentCoreInvocation
public ChatResponse handle(String prompt) {
return this.chatModel.call(new Prompt(new UserMessage(prompt)));
}
}
That’s it. The @AgentCoreInvocation annotation registers your method with the AgentCore HTTP controller, so POST /invocations lands here. The aspect wraps the controller’s dispatch path, not your method directly that’s a deliberate design choice so AOP proxies don’t strip the annotation off your bean.
Main class
package com.example.demo;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
@SpringBootApplication
public class DemoApplication {
public static void main(String[] args) {
SpringApplication.run(DemoApplication.class, args);
}
}
Build and run:
mvn -B package
export AWS_REGION=us-east-1
java -jar target/my-agent-1.0.0-SNAPSHOT.jar
The smoke test that proves everything
This is the test I wrote against real Amazon Bedrock to validate every claim in the docs. Save it as RealBedrockEndToEndTest.java:
package com.example.demo;
import static org.assertj.core.api.Assertions.assertThat;
import io.opentelemetry.api.common.AttributeKey;
import io.opentelemetry.sdk.trace.data.SpanData;
import java.util.List;
import java.util.Optional;
import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.Test;
import org.springaicommunity.agentcore.observability.telemetry.GenAiTelemetrySupport;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.autoconfigure.web.servlet.AutoConfigureMockMvc;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.context.annotation.Import;
import org.springframework.http.MediaType;
import org.springframework.test.web.servlet.MockMvc;
import org.springframework.test.web.servlet.request.MockMvcRequestBuilders;
import org.springframework.test.web.servlet.result.MockMvcResultMatchers;
@SpringBootTest(classes = DemoApplication.class)
@AutoConfigureMockMvc
@Import(InMemoryExporterConfig.class)
class RealBedrockEndToEndTest {
@Autowired
private MockMvc mockMvc;
@AfterEach
void resetExporter() {
InMemoryExporterConfig.SPAN_EXPORTER.reset();
}
@Test
void everything() throws Exception {
String body = "Customer profile:n"
+ "Email: jane.doe@acme-corp.comn"
+ "SSN: 123-45-6789n"
+ "Visa: 4532-0151-1283-0366n"
+ "Phone: 555-234-5678n"
+ "AWS key: AKIAIOSFODNN7EXAMPLEn"
+ "OpenAI key: sk-abc123def456ghi789jkl012mno345n"
+ "Reply with just the word OK.";
mockMvc.perform(MockMvcRequestBuilders.post("/invocations")
.contentType(MediaType.TEXT_PLAIN)
.header("x-amzn-bedrock-agentcore-session-id", "session-real-bedrock-1")
.header("x-amzn-request-id", "req-real-bedrock-1")
.content(body))
.andExpect(MockMvcResultMatchers.status().isOk());
List<SpanData> spans = InMemoryExporterConfig.SPAN_EXPORTER.getFinishedSpanItems();
SpanData span = spans.stream()
.filter(s -> s.getAttributes().get(GenAiTelemetrySupport.GEN_AI_PROVIDER_NAME) != null)
.findFirst().orElseThrow();
// GenAI semantic conventions
assertThat(span.getAttributes().get(GenAiTelemetrySupport.GEN_AI_PROVIDER_NAME)).isEqualTo("aws.bedrock");
assertThat(span.getAttributes().get(GenAiTelemetrySupport.GEN_AI_REQUEST_MODEL)).isEqualTo("amazon.nova-lite-v1:0");
assertThat(span.getAttributes().get(GenAiTelemetrySupport.GEN_AI_USAGE_INPUT_TOKENS)).isGreaterThan(0L);
assertThat(span.getAttributes().get(GenAiTelemetrySupport.GEN_AI_USAGE_OUTPUT_TOKENS)).isGreaterThan(0L);
// AWS correlation
assertThat(span.getAttributes().get(GenAiTelemetrySupport.AWS_BEDROCK_AGENTCORE_SESSION_ID))
.isEqualTo("session-real-bedrock-1");
assertThat(span.getAttributes().get(GenAiTelemetrySupport.AWS_REQUEST_ID))
.isEqualTo("req-real-bedrock-1");
// Masking
String prompt = span.getEvents().stream()
.filter(e -> e.getName().equals(GenAiTelemetrySupport.EVENT_GEN_AI_CONTENT_PROMPT))
.map(e -> e.getAttributes().get(AttributeKey.stringKey("gen_ai.prompt")))
.findFirst().orElseThrow();
assertThat(prompt).doesNotContain("jane.doe@acme-corp.com");
assertThat(prompt).doesNotContain("123-45-6789");
assertThat(prompt).doesNotContain("4532-0151-1283-0366");
assertThat(prompt).doesNotContain("AKIAIOSFODNN7EXAMPLE");
assertThat(prompt).contains("j***@***.com");
assertThat(prompt).contains("###-##-####");
assertThat(prompt).contains("4532-****-****-0366");
assertThat(prompt).containsPattern("\[REDACTED]");
}
}
Plus a tiny test config that reroutes the OTel exporter to an in-memory one (still wrapped with the masker, so we assert on what would have hit the wire):
package com.example.demo;
import io.opentelemetry.sdk.autoconfigure.spi.AutoConfigurationCustomizerProvider;
import io.opentelemetry.sdk.testing.exporter.InMemorySpanExporter;
import org.springaicommunity.agentcore.observability.masking.PiiMasker;
import org.springaicommunity.agentcore.observability.masking.PiiMaskingSpanExporter;
import org.springframework.boot.test.context.TestConfiguration;
import org.springframework.context.annotation.Bean;
@TestConfiguration
public class InMemoryExporterConfig {
public static final InMemorySpanExporter SPAN_EXPORTER = InMemorySpanExporter.create();
@Bean
AutoConfigurationCustomizerProvider routeToInMemory(PiiMasker piiMasker) {
return customizer -> customizer.addSpanExporterCustomizer(
(delegate, unused) -> new PiiMaskingSpanExporter(SPAN_EXPORTER, piiMasker));
}
}
Run it:
export AWS_REGION=us-east-1
mvn -B test
What real Bedrock returned
Here’s the actual span captured from the live Bedrock call (amazon.nova-lite-v1:0, us-east-1, May 2026):
=== REAL BEDROCK SPAN SUMMARY ===
model = amazon.nova-lite-v1:0
input_tokens = 121
output_tokens = 51
finish_reason = end_turn
session_id = session-real-bedrock-1
request_id = req-real-bedrock-1
masked prompt =
Customer profile:
Email: j***@***.com
SSN: ###-##-####
Visa: 4532-****-****-0366
Phone: ###-###-####
AWS key: [REDACTED]
OpenAI key: [REDACTED]
Reply with just the word OK.
masked completion = Sorry, but I cannot respond to a request that might
involve sharing personal information about an individual. ...
================================
Read that output again, slowly. Six pieces of sensitive data went in (email, SSN, Luhn-valid card, phone, AWS key, OpenAI key). All six came out masked. Token counts, finish reason, model name, both AWS correlation IDs are present and correct. Zero hand-instrumentation code.
The completion is interesting too. Bedrock’s safety stack saw the unmasked prompt (because, per the architecture, masking is export-time, not request-time) and refused. That tells us two things at once: the prompt actually reached Bedrock, and Bedrock acted on its real content. Then the masker did its job before anything left the JVM.
What’s behind each masked value
The PII patterns aren’t naive regexes. Here’s what’s running for the credit card mask, which is the trickiest one:
That’s why 1234567890123456 doesn’t get masked: it fails Luhn. Order numbers, tracking IDs, random hashes survive intact. Your false positive rate stays low. I tested this explicitly:
String body = "Tracking number: 1234567890123456. Reply OK.";
// Assertion: prompt event still contains "1234567890123456" verbatim
Passed. The masker doesn’t touch fake-looking 16-digit strings.
For emails the strategy is different: keep the first character of the local part, drop the rest, keep the TLD. jane.doe@acme-corp.com becomes j***@***.com. You preserve aggregate analytics (.com vs .gov vs .mil) while making individual identity unrecoverable. Phones are even simpler: every US format collapses to ###-###-####.
How to think about this in production
Once the starter is in place, you have building blocks. Here are the four most common ways teams actually use them:
1. Per-model token cost dashboards (no custom code)
The aspect records the gen_ai.client.token.usage histogram with gen_ai.token.type=input|output and gen_ai.request.model as dimensions. Point it at OTLP, build a Grafana dashboard, slice by model. Suddenly the question “which model is eating our budget” has a one-query answer.
2. Cross-system request correlation
Both AWS headers (x-amzn-bedrock-agentcore-session-id and x-amzn-request-id) get copied onto every span. When a customer complains, you can:
This is the workflow you want at 2 AM during an incident.
3. Compliance posture, by default
The crucial property: third parties (Datadog, New Relic, your collector) never see unredacted prompts. The architecture makes the bad thing hard to do by accident.
4. Error classification for alerting
The aspect maps exception class names to one of five error.type values:
| error.type | Triggers when class name contains |
|—-|—-|
| rate_limit | Throttling, ThrottlingException |
| invalid_request | ValidationException, InvalidParameter, IllegalArgument |
| timeout | Timeout, RequestTimeout, ReadTimeout |
| authentication_failure | Auth, Security, AccessDenied |
| server_error | AWS SDK ServiceException, default |
Now your alerts can fire differently for “Bedrock is throttling us” (capacity issue, escalate to AWS) vs “we have a code bug” (page the on-call dev). One dashboard query, no manual exception classification anywhere.
What it looks like in CI
The library itself ships with a 81-test suite plus JaCoCo coverage gates (96.5% line, 83% branch). I ran mvn verify on a clean clone:
Tests run: 81, Failures: 0, Errors: 0, Skipped: 2
[INFO] All coverage checks have been met.
[INFO] BUILD SUCCESS
The two skipped tests are the live-Bedrock ones, gated behind RUN_REAL_BEDROCK_TESTS=true so CI doesn’t accidentally bill anyone’s account. They cover the same ground as the test I wrote above.
Your own CI strategy probably wants:
Mock the ChatModel in your fast tests. Keep one nightly job that hits live Bedrock to catch behavior drift in the model itself (token counts, finish reasons, PII-handling). Cap that job’s spend with a model + token budget so it can never run away.
Production export targets
Swap otel.traces.exporter=logging for otlp and you’re done:
otel.traces.exporter=otlp
otel.metrics.exporter=otlp
otel.exporter.otlp.endpoint=http://otel-collector:4317
otel.exporter.otlp.protocol=grpc
otel.service.name=my-agent
The masking wrapper still applies. Common targets:
| Backend | Endpoint |
|—-|—-|
| OpenTelemetry Collector | http://otel-collector:4317 |
| Jaeger (1.35+, native OTLP) | http://jaeger:4317 |
| Grafana Tempo | http://tempo:4317 |
| Datadog agent | http://datadog-agent:4317 |
| New Relic | https://otlp.nr-data.net:4317 (with API key header) |
| AWS X-Ray via ADOT collector | http://adot-collector:4317 |
For X-Ray specifically: the gen_ai.* attributes show up as regular span attributes in the X-Ray UI, and the aws.request_id lets you pivot directly to CloudWatch Logs and Bedrock model invocation logging.
Real-world adoption playbook
If you’re rolling this out across multiple agents in an organization, here’s the order I’d recommend:
Step 3 is the one teams skip and regret. The default patterns cover US PII and major card networks. Your business probably has internal account numbers, customer IDs, regional formats (UK NI numbers, EU VAT IDs, Indian Aadhaar) that nobody outside your org has ever heard of. Add them once, in the central starter properties, and you’re done.
Conclusion
Let’s tie this together.
Where we started
Three real production headaches, all caused by the same gap: standard observability tooling was built for HTTP services and databases, not for LLM workloads. Per-request token cost is invisible. Customer complaints can’t be traced back to specific provider calls. Prompts and completions, full of PII, leak into third-party APMs because nobody put a redaction layer in the way. Errors all look the same, so a throttling incident and a code bug page the same on-call engineer.
What we did about it
We walked through spring-ai-agentcore-observability end to end:
- Set the foundation. Refreshed what observability with OpenTelemetry actually means – traces, metrics, logs, and the GenAI semantic conventions that finally give LLM workloads a standard vocabulary.
- Mapped the architecture. Saw how a single Spring Boot AOP aspect wraps the AgentCore HTTP boundary, enriches spans with the GenAI attributes Spring AI already returns, and hands them to a masking exporter that scrubs PII just before bytes leave the JVM.
- Built a working agent. Two dependencies in pom.xml. Three properties in application.properties. One annotated method. That is the whole footprint a developer has to remember.
- Validated against real Amazon Bedrock. Posted a prompt loaded with email, SSN, Luhn-valid Visa, US phone, AWS access key, and OpenAI secret key. Confirmed the exported span carries every promised attribute, both AWS correlation IDs, positive token counts from the live model, and that every PII type was redacted before export. The Luhn-failing tracking number survived intact, proving the false-positive rejection works.
- Looked at production patterns. Per-model cost dashboards, complaint-to-CloudWatch pivots, compliance-by-default, error classification routing, and a six-step adoption playbook for rolling this across an org.
What you walk away with
Day one - clone the starter, add deps, see spans in stdoutWeek one - point OTLP at your APM, build a token cost dashboardMonth one - alerts on error type and rate per model, custom regex for org PIIQuarter one - all agents standardized, security signs off on prompt captureYear one - drift detection on token deltas, model behavior changes caught in hours not weeks
That trajectory is what good infrastructure looks like – cheap to start, compounding returns over time, no rewrite when requirements grow.
The bigger picture
LLM-backed services are not going to get less complex. Multi-model routing, tool calling, agent-to-agent workflows, fine-tuned variants, on-device inference – all of it is showing up in production this year. Every one of those moves makes the cost-correlation-compliance-reliability problem worse, not better.
The only sustainable answer is observability that speaks the same language as the workload. OpenTelemetry’s GenAI conventions are that language. A starter that emits them automatically, redacts on the way out, and stays out of your way is the right place to spend a Friday afternoon if you have one Spring AI service in production.
What to do next
If you take one thing away, take this: spans are cheap, dashboards compound, and PII leaks are forever. Wire this in before you need it.
A few concrete moves:
- Today. Clone the example in this article. Point it at a low-traffic dev environment. Watch a span land in your logs.
- This week. Switch the exporter to OTLP and point it at whatever APM your team already uses. Build a per-model token-spend dashboard. Show it to your finance lead.
- This month. Add your org-specific custom regex patterns – internal account numbers, regional ID formats, anything your security team would flag in a trace.
- This quarter. Roll the same three properties to every Spring AI service you run. Define alerts on error type. Stop writing tracing code by hand.
Resources
- Code and docs. github.com/vaquarkhan/spring-ai-agentcore-observability
- The original tutorial that this article validated end to end: OBSERVABILITY-TUTORIAL.md
- Validation receipts. Every assertion in this article is reproduced in
VALIDATION-REPORT.mdalongside the source repo – environment, commands, exact span output, and the test that proved each claim. - OpenTelemetry GenAI semantic conventions. opentelemetry.io/docs/specs/semconv/gen-ai/
- Spring AI reference. docs.spring.io/spring-ai/reference/
- Amazon Bedrock Converse API. docs.aws.amazon.com/bedrock/latest/userguide/conversation-inference.html
Closing thought
Three years ago, “observability for AI” mostly meant “we logged the prompt.” That era is over. The standards are here, the tooling is here, the GenAI semantic conventions are stable enough to bet on. There is no longer a good reason to ship an LLM service to production without proper telemetry.
The fix is two dependencies and three properties. The downside of skipping it is a finance Slack message you can’t answer, a customer complaint you can’t trace, and a security review you can’t pass. Pick the cheaper option.
Now go check what your Bedrock bill is doing.
If you’re working on agent observability, GenAI semantic conventions, or Bedrock at scale, I’d love to hear what’s broken in your stack. Drop a comment below.
Code: github.com/vaquarkhan/spring-ai-agentcore-observability
n
:::info
base64 images have been removed. Instead, use an URL or a file from your device
:::











