OpenTelemetry Graduates CNCF and Brings Standardized LLM Observability with GenAI Conventions

The CNCF announced OpenTelemetry's graduation on May 21, 2026, formalizing its status as the default observability standard for production infrastructure. In the past twelve months, the OTel JavaScript API package crossed 1.36 billion downloads; Python surpassed 1.3 billion. The question isn't whether to adopt OpenTelemetry — it's whether your LLM and agent workloads are instrumented yet.

The graduation coincides with meaningful progress on GenAI Semantic Conventions, a standardized attribute schema and metric definitions for AI workloads. Here's what they cover and how to put them to work.

Why LLM Workloads Need Different Conventions

Standard HTTP and database instrumentation captures request duration, status codes, and error rates. That's insufficient for LLM pipelines, where the relevant signals are:

Input, output, and reasoning token counts (for cost attribution)
Model identity and provider (for performance comparison across model versions)
Time to first token (TTFT) versus end-to-end latency (different failure modes)
Subtask delegation between agents (to trace the full work tree)
Tool call invocations and results (to debug agent decisions)

GenAI Semantic Conventions defines a consistent attribute schema for each of these dimensions, so tools across vendors speak the same language and you don't re-instrument when you switch backends.

GenAI Semantic Conventions: What's Covered

Four primary areas are standardized as of early 2026.

LLM Client Spans

Covers individual calls to an LLM provider. Key attributes:

span.set_attribute("gen_ai.system", "anthropic")
span.set_attribute("gen_ai.request.model", "claude-opus-4-8")
span.set_attribute("gen_ai.usage.input_tokens", 1234)
span.set_attribute("gen_ai.usage.output_tokens", 512)
span.set_attribute("gen_ai.request.max_tokens", 2048)
span.set_attribute("gen_ai.response.finish_reasons", ["end_turn"])

Agent Spans

Covers orchestration and delegation in multi-agent systems:

span.set_attribute("gen_ai.agent.id", "review-agent-001")
span.set_attribute("gen_ai.agent.name", "CodeReviewAgent")
span.set_attribute("gen_ai.operation.name", "invoke_agent")

Events

Conversation turns and tool invocations are captured as span events with structured payloads:

gen_ai.user.message
gen_ai.assistant.message
gen_ai.tool.message

Metrics

Standard metric definitions that observability backends can query consistently:

gen_ai.client.token.usage — cumulative token consumption (Counter)
gen_ai.client.operation.duration — total call duration (Histogram)
gen_ai.server.time_to_first_token — TTFT per call (Histogram)

A Minimal Instrumentation Pattern

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
 
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
tracer = provider.get_tracer("ai-service")
 
def call_llm(prompt: str, model: str = "claude-opus-4-8") -> str:
    with tracer.start_as_current_span("gen_ai.request") as span:
        span.set_attribute("gen_ai.system", "anthropic")
        span.set_attribute("gen_ai.request.model", model)
 
        response = client.messages.create(
            model=model,
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
 
        span.set_attribute("gen_ai.usage.input_tokens", response.usage.input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", response.usage.output_tokens)
        span.set_attribute("gen_ai.response.finish_reasons", [response.stop_reason])
 
        return response.content[0].text

Trace context propagates through your service graph automatically. A frontend trace that triggers an LLM call produces a single trace spanning both, without manual correlation.

Traces vs. Logs: The Resolution Time Gap

The CNCF graduation announcement includes data showing that teams adopting distributed tracing reduce incident resolution time from over four hours to approximately fifteen minutes. The mechanism is straightforward: logs require manually correlating timestamps across services; traces record causality explicitly.

For AI workloads this gap is larger. An agent calling three sub-agents, each invoking an LLM, generates layered causality that's unworkable to debug from logs alone. A single distributed trace shows the full tree — which subtask took longest, which LLM call returned an unexpected stop reason, where the latency spike originated.

TTFT is a specific example: you can see from output token counts that a request completed, but only TTFT tracking reveals whether the model is slow to start generating, which indicates a different infrastructure problem than slow total duration.

Vendor Support

The major observability backends already support GenAI Semantic Conventions:

Datadog: LLM Observability with token cost dashboards and per-model latency histograms
Honeycomb: Query UI built around AI span attributes, with TTFT as a first-class column
New Relic: AI Monitoring with trace-level LLM span inspection

Because conventions standardize the attribute names, switching backends only requires changing the exporter — the instrumentation code stays unchanged.

Where to Start

Priority 1: Instrument your LLM client calls with gen_ai.system, gen_ai.request.model, and both token count attributes. These three additions power cost attribution dashboards immediately and require minimal code change.

Priority 2: Add TTFT tracking per model and environment. TTFT degrades under load in ways that output token counts don't reveal — it's the first signal that your LLM service is queuing requests.

Priority 3: Propagate trace context into agent subtasks. Multi-step workflows should produce a single trace tree rather than disconnected spans. This requires passing the trace context explicitly if your agent framework doesn't do it automatically.

Summary

OpenTelemetry's CNCF graduation marks the end of the observability standardization debate. GenAI Semantic Conventions brings the same discipline to LLM and agent workloads — consistent attribute names, standardized metrics, and a backend-agnostic instrumentation layer. If you're running AI services in production without LLM span instrumentation, you're operating without the signals needed to diagnose performance problems or attribute costs accurately. The conventions are stable enough to build on now, and the instrumentation overhead is low.