Running Local LLMs with Docker Model Runner: A Deep Dive with Full Observability and Sample Application

Introduction

In this blog post, we’ll explore how developers and teams can speed up development, debugging, and performance analysis of AI-powered applications by running models locally—using tools like Docker Model Runner, MCP (Model Context Protocol), and an observability stack.

Running everything locally not only removes the need for costly cloud calls during development, but also gives you production-like visibility into your system—so you can catch issues early, understand latency, analyze errors, and optimize performance before shipping anything.

One key part of this setup is MCP, a simple but powerful middleware layer that connects your frontend or APIs to local AI models. For example, in a document analysis app, the MCP server handles incoming requests, extracts content from files (like PDFs), and sends prompts to the local model running inside a Docker container. Combined with observability tools (like OpenTelemetry, Jaeger, and Prometheus), this creates a self-contained environment that feels like production—just without the cost or complexity.

Why Are Traces and Metrics Important for LLM Applications?

Challenge	Explanation
Non-determinism	The same input can produce different outputs due to randomness in LLMs.
Subjective Quality	Quality is not just about being correct—it includes tone, relevance, and coherence, which are harder to measure.
Multiple Processing Steps	LLM apps often involve several steps (e.g., input processing → model call → post-processing), making it harder to track what’s slow or broken.
Resource Usage	LLMs can be very heavy on CPU, GPU, memory, and storage, especially when running locally.
Cost	Token usage costs for cloud models or hardware/infrastructure costs for local models can add up quickly.
Concurrency	As user volume increases, it becomes important to monitor how well the system handles multiple requests at once without degrading performance.
Observability Value	Traces and metrics help developers understand performance, detect errors, control costs, and manage scalability in a reliable and informed way.

Traces

Trace Element	What It Shows	Why It’s Valuable
Full Request Trace	Tracks the journey of a user request through different parts of the system.	Helps measure total latency and identify which step (e.g., input handling, model processing) is slow.
Backend Processing Span	Measures time spent handling the logic in the backend service.	Shows how backend handles concurrent requests.
Input Processing Span	Tracks time taken for tasks like parsing, formatting, or validation before sending to the model.	Useful for optimizing under high concurrency when pre-processing queues build up.
Model Inference Span	Measures how long it takes the model to respond to a given prompt or input.	Useful for tuning batching or managing queueing when concurrency is high.
Output Handling Span	Measures time for post-processing (e.g., formatting output).	Ensures that final steps are efficient.
Input/Output Attributes	Stores prompt, response, token count, etc. for each request span.	Useful for correlating long inputs or outputs with performance drops.
Error Traces	Captures when and where errors occur (e.g., failed model call, input error).	Helps diagnose issues that might only occur under concurrency stress (e.g., timeouts, rate limits).

Metrics

Metric	What It Measures	Why It’s Valuable (Includes Concurrency Aspects)
Request Latency (p50/p90/p99)	Time taken to complete a request at different percentiles.	Tracks how fast the system is, and how speed degrades under load.
Throughput (Requests/sec)	Number of requests the system can handle per second.	Critical for understanding how concurrency affects system load.
Error Rate (%)	Percentage of requests that fail or return errors.	Helps detect instability or bugs.
Resource Usage (CPU, GPU, Memory)	How much system resources are being consumed.	Helps in scaling decisions and resource optimization.
Token Usage	Number of tokens processed in requests (input and output).	Useful for cost tracking and understanding load.
Quality Scores	Metrics that measure relevance, accuracy, or usefulness of responses.	Helps ensure output quality stays high under different loads.
User Feedback	Ratings or other direct user opinions.	Detects satisfaction trends and also helps in understanding production datasets for training or fine-tuning
Safety/Compliance Scores	Measures sensitive data, or policy violations.	Ensures safe operation.

Concurrency

In Traces: Concurrency issues can show up as overlapping spans, delayed model responses, or backend queuing delays.
In Metrics: Look for increased p99 latency, rising error rates, or CPU/GPU spikes when traffic increases.
Why It Matters: As LLM apps scale, tracking how multiple simultaneous users affect performance, quality, and stability becomes critical.

Let’s Build !!

We’re creating a Todo web application that:

Allows users to upload PDF documents

Extracts text from these documents

Uses a locally running LLM to analyze the content

Provides insights and summaries about the document

Enables chat-based interaction with the document content

Includes comprehensive monitoring and observability

The Technology Stack

Local AI Development Workflow (with MCP + Docker + Observability)

[ User Frontend / API ]
|
v
┌────────────────────┐
│ MCP Server │ ◄── Observability: Traces, Logs, Metrics
└────────────────────┘
|
v
┌────────────────────────────┐
│ Local Docker Model Runner │ ◄── LLM (e.g., LLaMA, Mistral)
└────────────────────────────┘
|
v
┌────────────────────────────┐
│ Observability Tools │ ◄── Jaeger, Prometheus, Grafana
└────────────────────────────┘

Key components:

Frontend/API: Triggers an analysis request.
MCP Server: Extracts, formats, and sends data to the model; acts as a smart controller.
Docker Model Runner: Hosts the LLM locally and responds to prompts.
Observability Layer: Collects performance data, traces, and error logs across all steps.

Setting Up the Environment

1. Docker Model Runner

Docker Desktop now includes Model Runner, which allows you to run AI models locally without depending on external API services.

# Enable Docker Model Runner

$ docker desktop enable model-runner

# Pull the Llama 3 model

$ docker model pull ai/llama3.2:1B-Q8_0

# Verify the model is availabledocker model list

2. Project Structure

Project Github Skeleton:

https://github.com/kubetoolsca/docker-model-runner-observability

├── backend/ # Express backend

│ ├── routes/ # API routes

│ │ └── document.js # Document processing endpoints

│ ├── observability.js # OpenTelemetry setup

│ ├── server.js # Express server setup

│ └── Dockerfile # Backend container config

├── src/ # React frontend

│ ├── components/ # UI components

│ └── App.tsx # Main application component

├── observability/ # Observability configuration

│ ├── otel-collector-config.yaml # OpenTelemetry Collector config

│ ├── prometheus.yml # Prometheus config

│ └── grafana/ # Grafana dashboards

└── docker-compose.yml # Multi-container orchestration

Backend

The backend service handles document uploads, text extraction, and communication with the local LLM via Docker Model Runner.

Document Routes

document.js routes file handles two main operations:

/analyze – Upload and analyze a document

/chat – Chat with a document that has already been analyzed

The document analysis flow works like this:

Receive the uploaded PDF file

Store it temporarily

Extract text using pdf-parse

Send extracted text to the local LLM for analysis

Return results to the user

Docker Model Runner Integration

The challenging part of this implementation was connecting to Docker Model Runner correctly.

Model Runner exposes an OpenAI-compatible API, which means we need to format our requests accordingly:

// Example of calling the Model Runner API

const response = await axios.post(

  `${baseUrl}/chat/completions`,

  {

    model: targetModel,

    messages: [

      {

        role: “system”,

        content: “You are a helpful document analysis assistant.”

      },

      {

        role: “user”,

        content: `Analyze this document: ${extractedText}`

      }

    ],

    temperature: 0.7,

    max_tokens: 1024

  },

  {

    headers: { ‘Content-Type’: ‘application/json’ }

  }

);

The key to making this work is:

Using the correct hostname: model-runner.docker.internal

Using the correct endpoint path: /engines/v1/chat/completions

Formatting the request body to match the OpenAI chat completions API

Properly handling the response structure

Additionally, we implemented multiple fallback mechanisms to ensure our application stays responsive even if the Model Runner service is unavailable:

Multiple endpoint URLs to try (for different Docker configurations)

Graceful error handling with useful feedback

Extraction-only mode when AI services are unavailable

Setting Up the Observability Stack

A key aspect of our sample application is the observability stack, which helps monitor the system’s performance and identify issues.

OpenTelemetry Configuration

The observability.js file sets up OpenTelemetry in the Node.js backend:

function setupObservability(serviceName = ‘document-analysis-service’) {

  const resource = new Resource({

    [SemanticResourceAttributes.SERVICE_NAME]: serviceName,

  });

  // Configure OTel exporter

  const traceExporter = process.env.OTEL_EXPORTER_OTLP_ENDPOINT

    ? new OTLPTraceExporter({

        url: `${process.env.OTEL_EXPORTER_OTLP_ENDPOINT}/v1/traces`,

      })

    : undefined;

  const sdk = new NodeSDK({

    resource,

    traceExporter,

    instrumentations: [

      getNodeAutoInstrumentations({

        ‘@opentelemetry/instrumentation-http’: { enabled: true },

        ‘@opentelemetry/instrumentation-express’: { enabled: true },

      }),

    ],

  });

  sdk.start();

  // Setup process exit handlers

}

OpenTelemetry Collector

The OpenTelemetry Collector acts as a central aggregation point for our observability data. Our configuration in otel-collector-config.yaml defines:

Receivers: How data enters the collector (OTLP over HTTP)

Processors: How data is processed (batching)

Exporters: Where data is sent (Prometheus and Jaeger)

Pipelines: How data flows through the collector

receivers:

  otlp:

    protocols:

      http:

        endpoint: 0.0.0.0:4318

processors:

  batch:

exporters:

  prometheus:

    endpoint: 0.0.0.0:8889

    namespace: document_analysis

  logging:

    verbosity: detailed

  otlp:

    endpoint: jaeger:14250

    tls:

      insecure: true

service:

  pipelines:

    traces:

      receivers: [otlp]

      processors: [batch]

      exporters: [logging, otlp]

    metrics:

      receivers: [otlp]

      processors: [batch]

      exporters: [logging, prometheus]

Visualization with Grafana

Grafana provides dashboards to visualize the metrics and traces.

The pre-configured dashboard includes:

Document analysis request metrics

Response time statistics

This gives visibility into:

How many documents are being processed

How long analysis takes

Success and failure rates

Containerizing the Application

The docker-compose.yml file orchestrates all services:

services:

  # Frontend Application

  frontend:

    build:

      context: .

      dockerfile: Dockerfile

    ports:

      – “8080:8080”

  # MCP Server (Backend)

  mcp-server:

    build: ./backend

    ports:

      – “3000:3000”

    environment:

      – DMR_API_ENDPOINT=http://model-runner.docker.internal/engines/v1

      – TARGET_MODEL=ai/llama3.2:1B-Q8_0

      – OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318

  # OpenTelemetry Collector

  otel-collector:

    image: otel/opentelemetry-collector-contrib

    volumes:

      – ./observability/otel-collector-config.yaml:/etc/otel-collector-config.yaml

  # Jaeger for trace visualization

  jaeger:

    image: jaegertracing/all-in-one

  # Prometheus for metrics storage

  prometheus:

    image: prom/prometheus

  # Grafana for dashboards

  grafana:

    image: grafana/grafana

The Frontend Components

Key components:

DocumentUploader: Handles file uploads with drag-and-drop

DocumentAnalysisResult: Displays analysis results

ChatInterface: Allows users to chat about the document

Challenges and Solutions

1. Docker Model Runner Connectivity

Challenge: The backend container couldn’t connect to the Docker Model Runner service

Solution: Below implemented hostname

model-runner.docker.internal

2. Error Handling and Fallbacks

Challenge: AI services can be unreliable or unavailable.

Solution: Implement graceful degradation:

Always provide basic text extraction even if analysis fails
Clear error messages for debugging
Multiple endpoint fallbacks

3. Observability Integration

Challenge: Tracking performance across multiple services.

Solution: OpenTelemetry integration with:

Automatic instrumentation for HTTP and Express

Custom span creation for important operations

Centralized collection and visualization

Performance Observations

With this implementation, users can observe for example:

Processing Speed: PDFs under 10MB typically process in 1-3 seconds for text extraction

AI Analysis Time: The local Llama 3 model analysis takes 3-8 seconds depending on document length

Memory: The backend uses approximately 150-250MB RAM

Response Size: Analysis results average 1-5KB of text

Conclusion

This implementation demonstrates how an AI application can be built and tested using locally-run models via Docker Model Runner and MCP. The addition of OpenTelemetry, Jaeger, Prometheus, and Grafana provides comprehensive visibility into the application’s performance and local models insights

By running LLMs locally with Docker Model Runner, we get:

Privacy: Document data never leaves your infrastructure

Cost efficiency: No per-token API charges

Reliability: No dependency on external API availability

Control: Choose appropriate models for your use case

& valuable insights into:

Performance bottlenecks: Identify slow components

Error patterns: Detect recurring issues

Resource utilization: Optimize container resources

User behavior: Understand how the application is used

Next Steps

To extend this , consider:

Supporting more file formats (DOC, DOCX, TXT)
Fine-tuning the LLM for specific document types
Use a benchmarking tool like locust / K6
Sky is the limit 🙂