Talki Academy
Technical22 min de lecture

LangChain & LangGraph: Advanced Production Patterns for AI Engineers

Deep-dive into advanced LangGraph production patterns: persistent state management, streaming with context windows, cost optimization, and deploying graph agents to AWS Lambda via Tailscale. Includes real code and a case study cutting costs by 60%.

Par Talki Academy·Mis a jour le 28 avril 2026

Running a LangGraph agent in a Jupyter notebook is easy. Running one in production at scale — with persistent state, streaming, and cost controls — is where most teams hit a wall. This guide covers the advanced patterns that separate demo agents from production-grade systems.

Why LangGraph for Production Agents?

LangGraph models agent logic as a stateful directed graph where nodes are Python functions and edges are conditional routing rules. Unlike a simple LLM chain, a graph can loop, branch, and remember — exactly what you need for agents that plan, execute tools, observe results, and adapt.

By April 2026, LangGraph 0.2.x ships with first-class support for async streaming, pluggable checkpointers (SQLite, Postgres, DynamoDB), and a visual studio (langgraph-studio) for debugging state transitions locally.

1. Persistent State Management

The most common production mistake is treating each API call as stateless. In LangGraph, state is a typed dictionary that flows through every node. Persisting it lets you resume interrupted workflows, implement human-in-the-loop approvals, and debug production failures by replaying checkpoints.

Defining a typed state schema

# state.py
from typing import Annotated, TypedDict
from langgraph.graph.message import add_messages

class AgentState(TypedDict):
    # add_messages merges lists instead of overwriting (required for chat history)
    messages: Annotated[list, add_messages]
    # Custom fields for your domain
    task_id: str
    tool_calls_remaining: int
    cost_usd: float  # track spend per graph run

DynamoDB checkpointer (production-grade)

SQLite works for local dev. For Lambda, use the DynamoDB checkpointer from langgraph-checkpoint-dynamodb (community package, compatible with LangGraph 0.2.x):

import boto3
from langgraph.checkpoint.dynamodb import DynamoDBSaver
from langgraph.graph import StateGraph, END
from state import AgentState

# One table, two GSIs: thread_id + checkpoint_id
dynamodb = boto3.resource("dynamodb", region_name="eu-west-1")
checkpointer = DynamoDBSaver(
    table_name="langgraph-checkpoints",
    dynamodb_resource=dynamodb,
)

builder = StateGraph(AgentState)
builder.add_node("planner", planner_node)
builder.add_node("executor", executor_node)
builder.add_node("critic", critic_node)
builder.set_entry_point("planner")
builder.add_conditional_edges(
    "executor",
    route_after_execution,        # returns "critic" or END
    {"critic": "critic", END: END},
)
builder.add_edge("critic", "planner")  # feedback loop

graph = builder.compile(checkpointer=checkpointer)

# Resume a conversation: just pass the same thread_id
config = {"configurable": {"thread_id": "user-session-abc123"}}
result = await graph.ainvoke({"messages": [user_message]}, config=config)
Pro tip: Set a TTL attribute on your DynamoDB checkpoint table (e.g., 7 days) to avoid runaway storage costs. The checkpointer writes one item per graph invocation per thread.

2. Streaming with Context Window Management

Users abandon interfaces that feel frozen. Streaming token-by-token is no longer optional for conversational agents. LangGraph provides two streaming APIs: astream() for state snapshots and astream_events() for granular token-level events.

Streaming a graph response via FastAPI

# api.py
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import json

app = FastAPI()

@app.post("/chat/{thread_id}")
async def chat(thread_id: str, body: dict):
    config = {"configurable": {"thread_id": thread_id}}
    message = {"role": "user", "content": body["content"]}

    async def token_stream():
        async for event in graph.astream_events(
            {"messages": [message]},
            config=config,
            version="v2",
        ):
            kind = event["event"]
            # Stream tokens from any LLM node
            if kind == "on_chat_model_stream":
                chunk = event["data"]["chunk"]
                if chunk.content:
                    # Server-Sent Events format
                    yield f"data: {json.dumps({'token': chunk.content})}\n\n"
            # Signal node transitions for UI progress indicators
            elif kind == "on_chain_start":
                node = event.get("name", "")
                if node in ("planner", "executor", "critic"):
                    yield f"data: {json.dumps({'node': node})}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(token_stream(), media_type="text/event-stream")

Context window budgeting

Long-running agents accumulate message history until they exceed the model's context window — and costs spike. Implement a sliding window node that trims history before any LLM call:

from langchain_core.messages import trim_messages
from langchain_anthropic import ChatAnthropic

model = ChatAnthropic(model="claude-sonnet-4-5", max_tokens=4096)

def trim_context_node(state: AgentState) -> AgentState:
    """Keep the system message + last N tokens of history."""
    trimmed = trim_messages(
        state["messages"],
        strategy="last",
        token_counter=model,
        max_tokens=60_000,   # leave 4k headroom for completion
        start_on="human",    # always start with a human turn
        include_system=True,
    )
    return {"messages": trimmed}

# Insert before every LLM call node
builder.add_node("trim_context", trim_context_node)
builder.add_edge("trim_context", "planner")

3. Cost Optimization: Tiered LLM Routing

Not every graph node needs Claude or GPT-4o. A three-tier routing strategy consistently cuts per-request costs by 50–70%:

Task TypeModelCost / 1M tokensLatency
Classification, routing decisionsOllama qwen3:8b (local)$0.00~200ms
Tool selection, structured extractionclaude-haiku-4-5$0.80 / $4.00~600ms
Final synthesis, complex reasoningclaude-sonnet-4-5$3.00 / $15.00~1.5s

Implementing tiered routing in LangGraph

from langchain_anthropic import ChatAnthropic
from langchain_ollama import ChatOllama

# Models instantiated once at module level (Lambda container reuse)
cheap_model = ChatOllama(
    model="qwen3:8b",
    base_url="http://100.x.x.x:11434",  # Tailscale IP of home GPU
)
mid_model = ChatAnthropic(model="claude-haiku-4-5-20251001")
strong_model = ChatAnthropic(model="claude-sonnet-4-5-20251001")

def route_by_complexity(state: AgentState) -> str:
    """Use cheap model to classify, then route to appropriate tier."""
    last_msg = state["messages"][-1].content
    classification = cheap_model.invoke(
        f"Classify this task (one word): SIMPLE | MODERATE | COMPLEX\n{last_msg}"
    ).content.strip()

    if "SIMPLE" in classification:
        return "planner_cheap"
    elif "MODERATE" in classification:
        return "planner_mid"
    else:
        return "planner_strong"

builder.add_conditional_edges("router", route_by_complexity, {
    "planner_cheap": "planner_cheap",
    "planner_mid": "planner_mid",
    "planner_strong": "planner_strong",
})
Warning: Ollama via Tailscale adds ~5–15ms network latency compared to localhost. For sub-100ms SLA requirements, run the small model inside the Lambda container itself (use ollama-python with a bundled GGUF file) — but expect a 2–3 GB container image.

4. Deploying to AWS Lambda via Tailscale

The canonical production setup: a Lambda function runs the graph logic while a Tailscale userspace network daemon connects it to your private inference servers without exposing any public endpoints. This removes the need for a VPC, NAT gateway, and the associated costs (~$45/month saved per VPC endpoint).

Dockerfile (arm64, Lambda Web Adapter)

FROM public.ecr.aws/lambda/python:3.12-arm64

# Lambda Web Adapter for streaming support
COPY --from=public.ecr.aws/awsguru/aws-lambda-adapter:0.8.4      /lambda-adapter /opt/extensions/lambda-adapter

WORKDIR /var/task

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Tailscale userspace binary (no kernel module needed)
RUN curl -fsSL https://pkgs.tailscale.com/stable/tailscale_1.78.0_arm64.tgz | tar -xz
RUN mv tailscale_1.78.0_arm64/tailscale tailscale_1.78.0_arm64/tailscaled /usr/local/bin/

COPY app/ .
COPY entrypoint.sh .
RUN chmod +x entrypoint.sh

# Lambda Web Adapter expects PORT env var
ENV PORT=8080
EXPOSE 8080

CMD ["./entrypoint.sh"]

entrypoint.sh — Tailscale bootstrap

#!/bin/bash
set -e

# Start Tailscale daemon in userspace mode (no TUN device needed in Lambda)
tailscaled --state=/tmp/tailscaled.state --tun=userspace-networking &
TS_PID=$!

# Authenticate (auth key from Secrets Manager, injected as env var)
tailscale up \
  --authkey="${TS_AUTHKEY}" \
  --hostname="lambda-agent-${AWS_LAMBDA_FUNCTION_NAME}" \
  --advertise-tags=tag:lambda \
  --accept-routes \
  --timeout=15s

echo "Tailscale connected: $(tailscale ip)"

# Now start FastAPI (reachable from Lambda Web Adapter)
exec uvicorn api:app --host 0.0.0.0 --port 8080

serverless.yml fragment

functions:
  graph-agent:
    image:
      uri: 874735685088.dkr.ecr.eu-west-1.amazonaws.com/langgraph-agent:latest
    architecture: arm64
    memorySize: 1024    # Tailscale + Python fits in 1 GB
    timeout: 120        # allow long graph runs
    environment:
      TS_AUTHKEY: ${ssm:/talki/tailscale/lambda-authkey}
      LANGGRAPH_CHECKPOINT_TABLE: langgraph-checkpoints
    events:
      - http:
          path: /chat/{thread_id}
          method: post
          cors: true
    provisionedConcurrency: 1   # keep one warm for <2s cold start

Case Study: 60% Cost Reduction at a SaaS Startup

A B2B SaaS startup used a LangGraph agent to power their customer support automation. Their initial setup: every message went to claude-sonnet-4-5, state was stored in a Redis-compatible ElastiCache cluster, and the stack ran on a dedicated EC2 instance.

Monthly bill before: $2,340/month (EC2 + ElastiCache + Anthropic API calls averaging 12,000 input tokens per conversation).

After migrating to LangGraph on Lambda with tiered routing and DynamoDB checkpoints:

  • 73% of requests now route to Ollama qwen3:8b on a home server (connected via Tailscale) — cost: $0.
  • Context trimming cut average input tokens from 12,000 to 4,800 per conversation (60% reduction).
  • Lambda arm64 + Graviton2 cut compute cost 34% vs x86.
  • DynamoDB On-Demand replaced ElastiCache — $12/month vs $180/month.

Monthly bill after: $920/month. A 61% reduction with no degradation in answer quality (measured by CSAT scores before and after).

5. Graph Debugging in Production

When an agent produces a wrong answer in production, you need to reproduce the exact state at failure time. LangGraph's checkpointer makes this straightforward:

import asyncio
from your_graph import graph

async def replay_failed_session(thread_id: str, checkpoint_id: str | None = None):
    """Replay a production failure locally for debugging."""
    config = {
        "configurable": {
            "thread_id": thread_id,
            # Omit checkpoint_id to start from latest, or pin to a specific one
            **({"checkpoint_id": checkpoint_id} if checkpoint_id else {}),
        }
    }

    # Get state snapshot at that checkpoint
    state = await graph.aget_state(config)
    print("State at checkpoint:")
    print(f"  messages: {len(state.values['messages'])} items")
    print(f"  next nodes: {state.next}")
    print(f"  cost so far: ${state.values.get('cost_usd', 0):.4f}")

    # Re-run from that checkpoint
    result = await graph.ainvoke(None, config)  # None = resume, don't add new message
    return result

asyncio.run(replay_failed_session("user-session-abc123"))

Implementation Checklist

  • State schema: typed with TypedDict, include a cost tracker field from day one.
  • Checkpointer: DynamoDB for Lambda deployments (serverless, no connection pooling needed), Postgres for dedicated servers.
  • Context trimming: insert a trim node before every LLM call, targeting 80% of the model's context window.
  • Tiered routing: classify task complexity with a local model before invoking expensive frontier models.
  • Streaming: expose astream_events() via SSE for any user-facing endpoint.
  • Tailscale networking: use userspace mode in Lambda — no VPC, no NAT, no public IP for your inference server.
  • Replay tooling: always verify you can replay a checkpoint before deploying to production.

FAQ

What is the difference between LangChain and LangGraph?

LangChain is a toolkit for building LLM-powered applications with chains and retrieval. LangGraph is a layer on top that models agent logic as a stateful directed graph, enabling cycles, branching, and persistent checkpoints — essential for complex multi-step agents.

Does LangGraph support streaming responses?

Yes. LangGraph exposes astream() and astream_events() for token-level streaming. You can stream both the final response and intermediate node outputs, which is critical for low-latency user experiences.

How do I persist graph state across Lambda invocations?

Use LangGraph's built-in checkpointer with a DynamoDB or PostgreSQL backend. Each invocation loads the checkpoint by thread_id, executes the next graph node, and writes the updated state back before returning.

What is the cheapest way to run LangGraph agents in production?

Route short reasoning tasks to a local Ollama model (via Tailscale), reserve Claude or GPT-4o only for final synthesis, and cache repeated sub-graph results. Teams using this tiered approach typically cut token costs by 50–70%.

Can LangGraph agents run on AWS Lambda?

Yes. Package the graph as a Docker container (arm64), connect to your home inference server via Tailscale userspace networking, and expose the Lambda behind API Gateway. Cold-start is typically under 2 seconds with a pre-warmed provisioned concurrency.

Formez votre equipe a l'IA

Nos formations sont financables OPCO — reste a charge potentiel : 0€.

Voir les formationsVerifier eligibilite OPCO