After shipping the first production LLM feature, the team had no visibility into what was happening inside the model. Failures were invisible until users complained.
An observability dashboard for LLM apps using Arize Phoenix — traces, evaluations, and drift detection — that became the team's default debugging surface.
After shipping the first production LLM feature, the team had no visibility into what was happening inside the model. Failures were invisible until users complained.
Instrumented all LLM calls with Arize Phoenix's OpenTelemetry-compatible tracing. Built a unified dashboard showing span-level traces, per-query LLM-as-judge evaluations, retrieval chunk quality scores, and cost/latency trends. Added drift detection alerts when eval scores dropped below threshold.
All LLM calls decorated with OpenTelemetry spans. Zero app-code changes needed — middleware layer handles it.
Arize Phoenix ingests spans in real-time. Each span carries input, output, model, latency, cost, and retrieval chunks.
Python eval pipeline runs correctness, relevance, and toxicity checks on sampled outputs using a judge prompt.
Grafana watches rolling eval score averages. Drops below threshold trigger PagerDuty.
LLM-as-judge evaluation at scale requires careful prompt design — the judge prompt is just as important as the feature prompt.
OpenTelemetry-compatible tracing means you can swap the backend without re-instrumenting app code.
Retrieval quality is the most common root cause of LLM output failures — eval dashboards that ignore the retrieval layer miss 60% of bugs.
Showing developers traces (not just scores) is what drives actual debugging; aggregate metrics alone don't change behaviour.
Core observability platform — traces, evals, dashboards
Vendor-agnostic instrumentation for LLM spans
Application framework being monitored
Eval pipeline and custom metric computation
Alerting on eval score drift