LangSmith
LangSmith is the observability and evaluation layer for LLM applications. When your AI system gives a wrong answer or behaves unexpectedly, LangSmith tells you exactly why — which step failed, what the model saw, what it returned.
The Problem LangSmith Solves
Traditional debugging:
Function fails → check the stack trace → fix the bug
LLM system "fails":
User says "the answer was wrong"
You have no idea why. Was it:
- The wrong chunks retrieved from the vector store?
- The LLM ignored the context?
- The prompt was ambiguous?
- The output parser failed?
- A tool returned bad data?
Without observability: you're guessing.
With LangSmith: you see every step, every input, every output.
Core Concepts Mindmap
mindmap
root((LangSmith))
Tracing
Runs
Each LLM call tool call chain
Input output latency tokens cost
Traces
Full execution tree
Parent and child runs
Nested structure
Projects
Group traces by application
Filter tag search
Evaluation
Datasets
Curated input-output examples
Ground truth for testing
Evaluators
LLM-as-judge
Exact match
Custom Python evaluator
Experiments
Run dataset through chain
Compare before and after
Track metrics over time
Playground
Test prompts interactively
Compare model outputs side by side
Edit and replay any trace
Monitoring
Production metrics dashboard
Latency p50 p95 p99
Token usage and cost
Error rate
Feedback collection
Annotation
Human labelling interface
Thumbs up down on runs
Add to datasets from production
Setup (One-Time)
import os
# Set these environment variables
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls__your_api_key"
os.environ["LANGCHAIN_PROJECT"] = "my-rag-app" # groups traces in UI
# That's it. LangChain + LangGraph automatically send traces.
# No code changes to your chains required.
# Or use the context manager for selective tracing:
from langsmith import traceable
@traceable(name="my_custom_function")
def process_document(doc: str) -> str:
# This function call will appear in LangSmith traces
return llm.invoke(f"Summarise: {doc}")
Traces & Runs
A trace is the full execution record of one user request — the entire tree of calls. A run is a single node in that tree (one LLM call, one tool call, one chain step).
Trace: "User asked: What is RAG?"
│
├── Run: RAGChain (chain) [200ms total]
│ │
│ ├── Run: Retriever (retriever) [45ms]
│ │ └── Inputs: "What is RAG?"
│ │ Outputs: [chunk1, chunk2, chunk3]
│ │
│ ├── Run: ChatOpenAI (llm) [140ms]
│ │ └── Inputs: [system_msg, context, question]
│ │ Outputs: "RAG stands for Retrieval-Augmented..."
│ │ Tokens: prompt=842, completion=156, total=998
│ │
│ └── Run: StrOutputParser (parser) [<1ms]
│ └── Outputs: "RAG stands for Retrieval-Augmented..."
In the LangSmith UI, you can click into any run and see the exact prompt the model received — formatted with all variables substituted. This is invaluable for debugging prompt issues.
Datasets & Evaluation
Datasets are ground-truth input/output pairs. You run your chain against them and measure how well it performs — before and after changes.
Creating a Dataset
from langsmith import Client
client = Client()
# Create dataset
dataset = client.create_dataset(
dataset_name="RAG Q&A Eval",
description="Test cases for our documentation RAG system"
)
# Add examples (input + expected output)
client.create_examples(
inputs=[
{"question": "What is RAG?"},
{"question": "How does chunking work?"},
{"question": "What is cosine similarity?"},
],
outputs=[
{"answer": "RAG stands for Retrieval-Augmented Generation..."},
{"answer": "Chunking splits documents into smaller pieces..."},
{"answer": "Cosine similarity measures the angle between two vectors..."},
],
dataset_id=dataset.id
)
Running an Evaluation Experiment
from langsmith.evaluation import evaluate, LangChainStringEvaluator
# Define your chain to evaluate
def rag_chain(inputs: dict) -> dict:
answer = my_rag_chain.invoke(inputs["question"])
return {"answer": answer}
# Choose evaluators
evaluators = [
LangChainStringEvaluator("qa"), # LLM judges correctness
LangChainStringEvaluator("cot_qa"), # CoT reasoning before judging
LangChainStringEvaluator("criteria", # Custom criteria
config={"criteria": "Is the answer concise and factual?"}),
]
# Run evaluation
results = evaluate(
rag_chain,
data="RAG Q&A Eval", # dataset name
evaluators=evaluators,
experiment_prefix="v2-hybrid-retrieval", # label for this run
)
Evaluator Types
| Evaluator | How it works | Good for |
|---|---|---|
| Exact match | String equality | Structured output, classification |
| QA evaluator | LLM grades answer vs reference | Open-ended Q&A |
| CoT QA | LLM reasons step-by-step before grading | Complex answers |
| Criteria | LLM judges against custom rubric | Tone, conciseness, factuality |
| Custom Python | Your own scoring function | Domain-specific metrics |
Comparing Experiments
Experiment A: v1-basic-retrieval → QA score: 0.72, latency: 320ms
Experiment B: v2-hybrid-retrieval → QA score: 0.89, latency: 380ms ← winner on quality
Experiment C: v3-reranking → QA score: 0.91, latency: 520ms
LangSmith shows these side-by-side in a table:
Which examples improved? Which regressed? Where is it still failing?
Production Monitoring
Metrics available in LangSmith dashboard:
Latency:
p50 (median), p95, p99 response times
Breakdown by step (retrieval vs LLM vs parsing)
Cost:
Token usage per run, per project
Cost estimates by model
Quality:
User feedback (thumbs up/down)
Evaluator scores from online evaluation
Errors:
Error rate over time
Stack traces for failed runs
Which inputs tend to cause failures
Collecting User Feedback
from langsmith import Client
client = Client()
# Log feedback after user rates the answer
client.create_feedback(
run_id=run.id, # ID from the trace
key="user_rating",
score=1, # 1 = positive, 0 = negative
comment="This answer was very helpful!"
)
# View in LangSmith → filter runs by feedback score
# Low-rated runs → add to dataset → improve the system
The Flywheel: Production → Dataset → Improvement
┌─────────────────────────────────────────────────────────────┐
│ The Improvement Loop │
│ │
│ Production traces ──▶ Review bad answers ──▶ Add to dataset│
│ ▲ │ │
│ │ ▼ │
│ Deploy improved chain ◀── Evaluate experiment ◀── Fix chain│
└─────────────────────────────────────────────────────────────┘
- Users interact with your app → traces captured in LangSmith
- Review traces with low ratings or wrong answers
- Add those cases to your evaluation dataset
- Improve your chain (better prompt, better retrieval, etc.)
- Run the dataset through the new chain → compare experiment scores
- Deploy the improved version
- Repeat
LangSmith vs Alternatives
┌─────────────────┬───────────────────┬────────────────────────────────────┐
│ Tool │ Type │ Notes │
├─────────────────┼───────────────────┼────────────────────────────────────┤
│ LangSmith │ Full platform │ Native LangChain integration │
│ Langfuse │ Open-source │ Self-hostable, LangChain support │
│ Weights&Biases │ ML experiment mgmt│ Great if already using W&B │
│ Helicone │ LLM proxy │ Lightweight, any LLM client │
│ Arize Phoenix │ Open-source │ Strong on evaluation, local-first │
└─────────────────┴───────────────────┴────────────────────────────────────┘
Quick Reference
Setup:
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=ls__...
LANGCHAIN_PROJECT=my-project
→ All LangChain/LangGraph calls auto-traced
Key URLs in LangSmith:
/projects → list of projects (each is an app/env)
/projects/<id> → traces for that project
/datasets → your evaluation datasets
/experiments → evaluation run results
Evaluation workflow:
1. Create dataset (curated Q&A pairs)
2. evaluate(chain, data="dataset", evaluators=[...])
3. Compare experiments side-by-side
4. Fix what's failing, repeat
Production loop:
Bad trace → add to dataset → improve chain → run experiment → deploy