Introduction
In this blog post, we’ll explore how developers and teams can speed up development, debugging, and performance analysis of AI-powered applications by running models locally—using tools like Docker Model Runner, MCP (Model Context Protocol), and an observability stack.
Running everything locally not only removes the need for costly cloud calls during development, but also gives you production-like visibility into your system—so you can catch issues early, understand latency, analyze errors, and optimize performance before shipping anything.
One key part of this setup is MCP, a simple but powerful middleware layer that connects your frontend or APIs to local AI models. For example, in a document analysis app, the MCP server handles incoming requests, extracts content from files (like PDFs), and sends prompts to the local model running inside a Docker container. Combined with observability tools (like OpenTelemetry, Jaeger, and Prometheus), this creates a self-contained environment that feels like production—just without the cost or complexity.
Why Are Traces and Metrics Important for LLM Applications?
Challenge | Explanation |
---|---|
Non-determinism | The same input can produce different outputs due to randomness in LLMs. |
Subjective Quality | Quality is not just about being correct—it includes tone, relevance, and coherence, which are harder to measure. |
Multiple Processing Steps | LLM apps often involve several steps (e.g., input processing → model call → post-processing), making it harder to track what’s slow or broken. |
Resource Usage | LLMs can be very heavy on CPU, GPU, memory, and storage, especially when running locally. |
Cost | Token usage costs for cloud models or hardware/infrastructure costs for local models can add up quickly. |
Concurrency | As user volume increases, it becomes important to monitor how well the system handles multiple requests at once without degrading performance. |
Observability Value | Traces and metrics help developers understand performance, detect errors, control costs, and manage scalability in a reliable and informed way. |
Traces
Trace Element | What It Shows | Why It’s Valuable |
---|---|---|
Full Request Trace | Tracks the journey of a user request through different parts of the system. | Helps measure total latency and identify which step (e.g., input handling, model processing) is slow. |
Backend Processing Span | Measures time spent handling the logic in the backend service. | Shows how backend handles concurrent requests. |
Input Processing Span | Tracks time taken for tasks like parsing, formatting, or validation before sending to the model. | Useful for optimizing under high concurrency when pre-processing queues build up. |
Model Inference Span | Measures how long it takes the model to respond to a given prompt or input. | Useful for tuning batching or managing queueing when concurrency is high. |
Output Handling Span | Measures time for post-processing (e.g., formatting output). | Ensures that final steps are efficient. |
Input/Output Attributes | Stores prompt, response, token count, etc. for each request span. | Useful for correlating long inputs or outputs with performance drops. |
Error Traces | Captures when and where errors occur (e.g., failed model call, input error). | Helps diagnose issues that might only occur under concurrency stress (e.g., timeouts, rate limits). |
Metrics
Metric | What It Measures | Why It’s Valuable (Includes Concurrency Aspects) |
---|---|---|
Request Latency (p50/p90/p99) | Time taken to complete a request at different percentiles. | Tracks how fast the system is, and how speed degrades under load. |
Throughput (Requests/sec) | Number of requests the system can handle per second. | Critical for understanding how concurrency affects system load. |
Error Rate (%) | Percentage of requests that fail or return errors. | Helps detect instability or bugs. |
Resource Usage (CPU, GPU, Memory) | How much system resources are being consumed. | Helps in scaling decisions and resource optimization. |
Token Usage | Number of tokens processed in requests (input and output). | Useful for cost tracking and understanding load. |
Quality Scores | Metrics that measure relevance, accuracy, or usefulness of responses. | Helps ensure output quality stays high under different loads. |
User Feedback | Ratings or other direct user opinions. | Detects satisfaction trends and also helps in understanding production datasets for training or fine-tuning |
Safety/Compliance Scores | Measures sensitive data, or policy violations. | Ensures safe operation. |
Concurrency
-
In Traces: Concurrency issues can show up as overlapping spans, delayed model responses, or backend queuing delays.
-
In Metrics: Look for increased p99 latency, rising error rates, or CPU/GPU spikes when traffic increases.
-
Why It Matters: As LLM apps scale, tracking how multiple simultaneous users affect performance, quality, and stability becomes critical.
Let’s Build !!
We’re creating a Todo web application that:
- Allows users to upload PDF documents
- Extracts text from these documents
- Uses a locally running LLM to analyze the content
- Provides insights and summaries about the document
- Enables chat-based interaction with the document content
- Includes comprehensive monitoring and observability
The Technology Stack
Local AI Development Workflow (with MCP + Docker + Observability)
[ User Frontend / API ]
|
v
┌────────────────────┐
│ MCP Server │ ◄── Observability: Traces, Logs, Metrics
└────────────────────┘
|
v
┌────────────────────────────┐
│ Local Docker Model Runner │ ◄── LLM (e.g., LLaMA, Mistral)
└────────────────────────────┘
|
v
┌────────────────────────────┐
│ Observability Tools │ ◄── Jaeger, Prometheus, Grafana
└────────────────────────────┘
Key components:
-
Frontend/API: Triggers an analysis request.
-
MCP Server: Extracts, formats, and sends data to the model; acts as a smart controller.
-
Docker Model Runner: Hosts the LLM locally and responds to prompts.
-
Observability Layer: Collects performance data, traces, and error logs across all steps.
Setting Up the Environment
1. Docker Model Runner
Docker Desktop now includes Model Runner, which allows you to run AI models locally without depending on external API services.
# Enable Docker Model Runner
$ docker desktop enable model-runner
# Pull the Llama 3 model
$ docker model pull ai/llama3.2:1B-Q8_0
# Verify the model is availabledocker model list
2. Project Structure
Project Github Skeleton:
https://github.com/kubetoolsca/docker-model-runner-observability
├── backend/ # Express backend
│ ├── routes/ # API routes
│ │ └── document.js # Document processing endpoints
│ ├── observability.js # OpenTelemetry setup
│ ├── server.js # Express server setup
│ └── Dockerfile # Backend container config
├── src/ # React frontend
│ ├── components/ # UI components
│ └── App.tsx # Main application component
├── observability/ # Observability configuration
│ ├── otel-collector-config.yaml # OpenTelemetry Collector config
│ ├── prometheus.yml # Prometheus config
│ └── grafana/ # Grafana dashboards
└── docker-compose.yml # Multi-container orchestration
Backend
The backend service handles document uploads, text extraction, and communication with the local LLM via Docker Model Runner.
Document Routes
document.js routes file handles two main operations:
- /analyze – Upload and analyze a document
- /chat – Chat with a document that has already been analyzed
The document analysis flow works like this:
- Receive the uploaded PDF file
- Store it temporarily
- Extract text using pdf-parse
- Send extracted text to the local LLM for analysis
- Return results to the user
Docker Model Runner Integration
The challenging part of this implementation was connecting to Docker Model Runner correctly.
Model Runner exposes an OpenAI-compatible API, which means we need to format our requests accordingly:
// Example of calling the Model Runner API
const response = await axios.post(
`${baseUrl}/chat/completions`,
{
model: targetModel,
messages: [
{
role: “system”,
content: “You are a helpful document analysis assistant.”
},
{
role: “user”,
content: `Analyze this document: ${extractedText}`
}
],
temperature: 0.7,
max_tokens: 1024
},
{
headers: { ‘Content-Type’: ‘application/json’ }
}
);
The key to making this work is:
- Using the correct hostname: model-runner.docker.internal
- Using the correct endpoint path: /engines/v1/chat/completions
- Formatting the request body to match the OpenAI chat completions API
- Properly handling the response structure
Additionally, we implemented multiple fallback mechanisms to ensure our application stays responsive even if the Model Runner service is unavailable:
- Multiple endpoint URLs to try (for different Docker configurations)
- Graceful error handling with useful feedback
- Extraction-only mode when AI services are unavailable
Setting Up the Observability Stack
A key aspect of our sample application is the observability stack, which helps monitor the system’s performance and identify issues.
OpenTelemetry Configuration
The observability.js file sets up OpenTelemetry in the Node.js backend:
function setupObservability(serviceName = ‘document-analysis-service’) {
const resource = new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: serviceName,
});
// Configure OTel exporter
const traceExporter = process.env.OTEL_EXPORTER_OTLP_ENDPOINT
? new OTLPTraceExporter({
url: `${process.env.OTEL_EXPORTER_OTLP_ENDPOINT}/v1/traces`,
})
: undefined;
const sdk = new NodeSDK({
resource,
traceExporter,
instrumentations: [
getNodeAutoInstrumentations({
‘@opentelemetry/instrumentation-http’: { enabled: true },
‘@opentelemetry/instrumentation-express’: { enabled: true },
}),
],
});
sdk.start();
// Setup process exit handlers
}
OpenTelemetry Collector
The OpenTelemetry Collector acts as a central aggregation point for our observability data. Our configuration in otel-collector-config.yaml defines:
- Receivers: How data enters the collector (OTLP over HTTP)
- Processors: How data is processed (batching)
- Exporters: Where data is sent (Prometheus and Jaeger)
- Pipelines: How data flows through the collector
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
processors:
batch:
exporters:
prometheus:
endpoint: 0.0.0.0:8889
namespace: document_analysis
logging:
verbosity: detailed
otlp:
endpoint: jaeger:14250
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [logging, otlp]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [logging, prometheus]
Visualization with Grafana
Grafana provides dashboards to visualize the metrics and traces.
The pre-configured dashboard includes:
- Document analysis request metrics
- Response time statistics
This gives visibility into:
- How many documents are being processed
- How long analysis takes
- Success and failure rates
Containerizing the Application
The docker-compose.yml file orchestrates all services:
services:
# Frontend Application
frontend:
build:
context: .
dockerfile: Dockerfile
ports:
– “8080:8080”
# MCP Server (Backend)
mcp-server:
build: ./backend
ports:
– “3000:3000”
environment:
– DMR_API_ENDPOINT=http://model-runner.docker.internal/engines/v1
– TARGET_MODEL=ai/llama3.2:1B-Q8_0
– OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
# OpenTelemetry Collector
otel-collector:
image: otel/opentelemetry-collector-contrib
volumes:
– ./observability/otel-collector-config.yaml:/etc/otel-collector-config.yaml
# Jaeger for trace visualization
jaeger:
image: jaegertracing/all-in-one
# Prometheus for metrics storage
prometheus:
image: prom/prometheus
# Grafana for dashboards
grafana:
image: grafana/grafana
The Frontend Components
Key components:
- DocumentUploader: Handles file uploads with drag-and-drop
- DocumentAnalysisResult: Displays analysis results
- ChatInterface: Allows users to chat about the document
Challenges and Solutions
1. Docker Model Runner Connectivity
Challenge: The backend container couldn’t connect to the Docker Model Runner service
Solution: Below implemented hostname
- model-runner.docker.internal
2. Error Handling and Fallbacks
Challenge: AI services can be unreliable or unavailable.
Solution: Implement graceful degradation:
- Always provide basic text extraction even if analysis fails
- Clear error messages for debugging
- Multiple endpoint fallbacks
3. Observability Integration
Challenge: Tracking performance across multiple services.
Solution: OpenTelemetry integration with:
- Automatic instrumentation for HTTP and Express
- Custom span creation for important operations
- Centralized collection and visualization
Performance Observations
With this implementation, users can observe for example:
- Processing Speed: PDFs under 10MB typically process in 1-3 seconds for text extraction
- AI Analysis Time: The local Llama 3 model analysis takes 3-8 seconds depending on document length
- Memory: The backend uses approximately 150-250MB RAM
- Response Size: Analysis results average 1-5KB of text
Conclusion
This implementation demonstrates how an AI application can be built and tested using locally-run models via Docker Model Runner and MCP. The addition of OpenTelemetry, Jaeger, Prometheus, and Grafana provides comprehensive visibility into the application’s performance and local models insights
By running LLMs locally with Docker Model Runner, we get:
- Privacy: Document data never leaves your infrastructure
- Cost efficiency: No per-token API charges
- Reliability: No dependency on external API availability
- Control: Choose appropriate models for your use case
& valuable insights into:
- Performance bottlenecks: Identify slow components
- Error patterns: Detect recurring issues
- Resource utilization: Optimize container resources
- User behavior: Understand how the application is used
Next Steps
To extend this , consider:
- Supporting more file formats (DOC, DOCX, TXT)
- Fine-tuning the LLM for specific document types
- Use a benchmarking tool like locust / K6
- Sky is the limit 🙂