§ 00 — LOADING STACK
◆
LangGraphLangGraph
Azure OpenAIAzure OpenAI
QdrantQdrant
Arize PhoenixArize Phoenix
LLangfuse
MMastra
Next.jsNext.js
SSupabase
LANGGRAPH ◆ AZURE ◆ QDRANT ◆ ARIZE ◆ LANGFUSE ◆ MASTRA ◆ NEXT.JS ◆ SUPABASE ◆ LANGGRAPH ◆ AZURE ◆ QDRANT ◆ ARIZE ◆ LANGFUSE ◆ MASTRA ◆ NEXT.JS ◆ SUPABASE
KKomal Vardhan.
HomeWorkAboutWritingResourcesContact
HomeWorkWritingResourcesAboutContact
Build like an engineer. Teach like a friend.

© 2026 Komal Vardhan Lolugu

Sitemap
  • Home
  • Work
  • About
  • Writing
  • Contact
  • Resources
Elsewhere
  • LinkedIn · 3.5K
  • Medium · Writing
  • Instagram
  • GitHub
  • Topmate
Newsletter

A field note every other Sunday. No hype, no AI spam. Unsubscribe anytime.

Designed & built by Komal. Made in India.
← All work
2023 · ObservabilityInternal toolArize Phoenix

LLM Monitoring Dashboard

An observability dashboard for LLM apps using Arize Phoenix — traces, evaluations, and drift detection — that became the team's default debugging surface.

3Eval dimensions: correctness, relevance, toxicity
OTelVendor-agnostic — swap backend without re-instrumentation
Real-timeSpan-level traces visible within seconds of request
#1Team's default debugging surface after launch
§ 01

The Problem

After shipping the first production LLM feature, the team had no visibility into what was happening inside the model. Failures were invisible until users complained.

§ 02

The Solution

Instrumented all LLM calls with Arize Phoenix's OpenTelemetry-compatible tracing. Built a unified dashboard showing span-level traces, per-query LLM-as-judge evaluations, retrieval chunk quality scores, and cost/latency trends. Added drift detection alerts when eval scores dropped below threshold.

§ 02b

How it works

01
Instrumentation

All LLM calls decorated with OpenTelemetry spans. Zero app-code changes needed — middleware layer handles it.

02
Trace collection

Arize Phoenix ingests spans in real-time. Each span carries input, output, model, latency, cost, and retrieval chunks.

03
LLM-as-judge eval

Python eval pipeline runs correctness, relevance, and toxicity checks on sampled outputs using a judge prompt.

04
Drift alerting

Grafana watches rolling eval score averages. Drops below threshold trigger PagerDuty.

§ 03

What I Learnt

  • 01

    LLM-as-judge evaluation at scale requires careful prompt design — the judge prompt is just as important as the feature prompt.

  • 02

    OpenTelemetry-compatible tracing means you can swap the backend without re-instrumenting app code.

  • 03

    Retrieval quality is the most common root cause of LLM output failures — eval dashboards that ignore the retrieval layer miss 60% of bugs.

  • 04

    Showing developers traces (not just scores) is what drives actual debugging; aggregate metrics alone don't change behaviour.

§ 04

Technologies Used

Arize PhoenixArize Phoenix

Core observability platform — traces, evals, dashboards

OpenTelemetryOpenTelemetry

Vendor-agnostic instrumentation for LLM spans

LangChainLangChain

Application framework being monitored

PythonPython

Eval pipeline and custom metric computation

GrafanaGrafana

Alerting on eval score drift

Arize PhoenixArize Phoenix
OpenTelemetryOpenTelemetry
LangChainLangChain
PythonPython
GrafanaGrafana
← All workWork together ↗
← PreviousEmployee Referral SystemNext →AI Universe