Running Local LLMs with Docker Model Runner: A Deep Dive with Full Observability and Sample Application

Running Local LLMs with Docker Model Runner: A Deep Dive with Full Observability and Sample Application

 

Introduction

 

In this blog post, we’ll explore how developers and teams can speed up development, debugging, and performance analysis of AI-powered applications by running models locally—using tools like Docker Model Runner, MCP (Model Context Protocol), and an observability stack.

 

Running everything locally not only removes the need for costly cloud calls during development, but also gives you production-like visibility into your system—so you can catch issues early, understand latency, analyze errors, and optimize performance before shipping anything.

 

One key part of this setup is MCP, a simple but powerful middleware layer that connects your frontend or APIs to local AI models. For example, in a document analysis app, the MCP server handles incoming requests, extracts content from files (like PDFs), and sends prompts to the local model running inside a Docker container. Combined with observability tools (like OpenTelemetry, Jaeger, and Prometheus), this creates a self-contained environment that feels like production—just without the cost or complexity.

 


 

Why Are Traces and Metrics Important for LLM Applications?

Challenge Explanation
Non-determinism The same input can produce different outputs due to randomness in LLMs.
Subjective Quality Quality is not just about being correct—it includes tone, relevance, and coherence, which are harder to measure.
Multiple Processing Steps LLM apps often involve several steps (e.g., input processing → model call → post-processing), making it harder to track what’s slow or broken.
Resource Usage LLMs can be very heavy on CPU, GPU, memory, and storage, especially when running locally.
Cost Token usage costs for cloud models or hardware/infrastructure costs for local models can add up quickly.
Concurrency As user volume increases, it becomes important to monitor how well the system handles multiple requests at once without degrading performance.
Observability Value Traces and metrics help developers understand performance, detect errors, control costs, and manage scalability in a reliable and informed way.

 


 

 Traces

Trace Element What It Shows Why It’s Valuable 
Full Request Trace Tracks the journey of a user request through different parts of the system. Helps measure total latency and identify which step (e.g., input handling, model processing) is slow.
Backend Processing Span Measures time spent handling the logic in the backend service.  Shows how backend handles concurrent requests.
Input Processing Span Tracks time taken for tasks like parsing, formatting, or validation before sending to the model.  Useful for optimizing under high concurrency when pre-processing queues build up.
Model Inference Span Measures how long it takes the model to respond to a given prompt or input. Useful for tuning batching or managing queueing when concurrency is high.
Output Handling Span Measures time for post-processing (e.g., formatting output). Ensures that final steps are efficient.
Input/Output Attributes Stores prompt, response, token count, etc. for each request span. Useful for correlating long inputs or outputs with performance drops.
Error Traces Captures when and where errors occur (e.g., failed model call, input error). Helps diagnose issues that might only occur under concurrency stress (e.g., timeouts, rate limits).

 


 

 Metrics

Metric What It Measures Why It’s Valuable (Includes Concurrency Aspects)
Request Latency (p50/p90/p99) Time taken to complete a request at different percentiles. Tracks how fast the system is, and how speed degrades under load. 
Throughput (Requests/sec) Number of requests the system can handle per second.  Critical for understanding how concurrency affects system load.
Error Rate (%) Percentage of requests that fail or return errors. Helps detect instability or bugs.
Resource Usage (CPU, GPU, Memory) How much system resources are being consumed.  Helps in scaling decisions and resource optimization.
Token Usage Number of tokens processed in requests (input and output). Useful for cost tracking and understanding load. 
Quality Scores Metrics that measure relevance, accuracy, or usefulness of responses. Helps ensure output quality stays high under different loads. 
User Feedback Ratings or other direct user opinions. Detects satisfaction trends and also helps in understanding production datasets for training or fine-tuning
Safety/Compliance Scores Measures sensitive data, or policy violations. Ensures safe operation. 

 

Concurrency 

  • In Traces: Concurrency issues can show up as overlapping spans, delayed model responses, or backend queuing delays.

  • In Metrics: Look for increased p99 latency, rising error rates, or CPU/GPU spikes when traffic increases.

  • Why It Matters: As LLM apps scale, tracking how multiple simultaneous users affect performance, quality, and stability becomes critical.

 


 

Let’s Build !!

 

We’re creating a Todo web application that:

  • Allows users to upload PDF documents

  • Extracts text from these documents

  • Uses a locally running LLM to analyze the content

  • Provides insights and summaries about the document

  • Enables chat-based interaction with the document content

  • Includes comprehensive monitoring and observability

 

The Technology Stack

 

Local AI Development Workflow (with MCP + Docker + Observability)

 

[ User Frontend / API ]
|
v
┌────────────────────┐
│ MCP Server │ ◄── Observability: Traces, Logs, Metrics
└────────────────────┘
|
v
┌────────────────────────────┐
│ Local Docker Model Runner │ ◄── LLM (e.g., LLaMA, Mistral)
└────────────────────────────┘
|
v
┌────────────────────────────┐
│ Observability Tools │ ◄── Jaeger, Prometheus, Grafana
└────────────────────────────┘

Key components:

  • Frontend/API: Triggers an analysis request.

  • MCP Server: Extracts, formats, and sends data to the model; acts as a smart controller.

  • Docker Model Runner: Hosts the LLM locally and responds to prompts.

  • Observability Layer: Collects performance data, traces, and error logs across all steps.

 

Setting Up the Environment

 

1. Docker Model Runner

 

Docker Desktop now includes Model Runner, which allows you to run AI models locally without depending on external API services.

 

# Enable Docker Model Runner

$ docker desktop enable model-runner

# Pull the Llama 3 model

$ docker model pull ai/llama3.2:1B-Q8_0

# Verify the model is availabledocker model list

2. Project Structure

 

Project Github Skeleton:

https://github.com/kubetoolsca/docker-model-runner-observability

├── backend/                # Express backend

│   ├── routes/             # API routes

│   │   └── document.js     # Document processing endpoints

│   ├── observability.js    # OpenTelemetry setup

│   ├── server.js           # Express server setup

│   └── Dockerfile          # Backend container config

├── src/                    # React frontend

│   ├── components/         # UI components

│   └── App.tsx             # Main application component

├── observability/          # Observability configuration

│   ├── otel-collector-config.yaml  # OpenTelemetry Collector config

│   ├── prometheus.yml      # Prometheus config

│   └── grafana/            # Grafana dashboards

└── docker-compose.yml      # Multi-container orchestration


 

Backend

 

The backend service handles document uploads, text extraction, and communication with the local LLM via Docker Model Runner.

 

Document Routes

 

document.js routes file handles two main operations:

  • /analyze – Upload and analyze a document

  • /chat – Chat with a document that has already been analyzed

The document analysis flow works like this:

  • Receive the uploaded PDF file

  • Store it temporarily

  • Extract text using pdf-parse

  • Send extracted text to the local LLM for analysis

  • Return results to the user


 

Docker Model Runner Integration

The challenging part of this implementation was connecting to Docker Model Runner correctly.

Model Runner exposes an OpenAI-compatible API, which means we need to format our requests accordingly:

// Example of calling the Model Runner API

const response = await axios.post(

  `${baseUrl}/chat/completions`,

  {

    model: targetModel,

    messages: [

      {

        role: “system”,

        content: “You are a helpful document analysis assistant.”

      },

      {

        role: “user”,

        content: `Analyze this document: ${extractedText}`

      }

    ],

    temperature: 0.7,

    max_tokens: 1024

  },

  {

    headers: { ‘Content-Type’: ‘application/json’ }

  }

);

 

The key to making this work is:

  • Using the correct hostname: model-runner.docker.internal

  • Using the correct endpoint path: /engines/v1/chat/completions

  • Formatting the request body to match the OpenAI chat completions API

  • Properly handling the response structure

 

Additionally, we implemented multiple fallback mechanisms to ensure our application stays responsive even if the Model Runner service is unavailable:

  • Multiple endpoint URLs to try (for different Docker configurations)

  • Graceful error handling with useful feedback

  • Extraction-only mode when AI services are unavailable

 

 

 

 

 

 

 

 

 

 


 

Setting Up the Observability Stack

 

A key aspect of our sample application is the observability stack, which helps monitor the system’s performance and identify issues.

 

OpenTelemetry Configuration

 

The observability.js file sets up OpenTelemetry in the Node.js backend:

function setupObservability(serviceName = ‘document-analysis-service’) {

  const resource = new Resource({

    [SemanticResourceAttributes.SERVICE_NAME]: serviceName,

  });

  // Configure OTel exporter

  const traceExporter = process.env.OTEL_EXPORTER_OTLP_ENDPOINT 

    ? new OTLPTraceExporter({

        url: `${process.env.OTEL_EXPORTER_OTLP_ENDPOINT}/v1/traces`,

      })

    : undefined;

  const sdk = new NodeSDK({

    resource,

    traceExporter,

    instrumentations: [

      getNodeAutoInstrumentations({

        ‘@opentelemetry/instrumentation-http’: { enabled: true },

        ‘@opentelemetry/instrumentation-express’: { enabled: true },

      }),

    ],

  });

  sdk.start();

  // Setup process exit handlers

}


 

OpenTelemetry Collector

 

The OpenTelemetry Collector acts as a central aggregation point for our observability data. Our configuration in otel-collector-config.yaml defines:

  • Receivers: How data enters the collector (OTLP over HTTP)

  • Processors: How data is processed (batching)

  • Exporters: Where data is sent (Prometheus and Jaeger)

  • Pipelines: How data flows through the collector

receivers:

  otlp:

    protocols:

      http:

        endpoint: 0.0.0.0:4318

processors:

  batch:

exporters:

  prometheus:

    endpoint: 0.0.0.0:8889

    namespace: document_analysis

  logging:

    verbosity: detailed

  otlp:

    endpoint: jaeger:14250

    tls:

      insecure: true

service:

  pipelines:

    traces:

      receivers: [otlp]

      processors: [batch]

      exporters: [logging, otlp]

    metrics:

      receivers: [otlp]

      processors: [batch]

      exporters: [logging, prometheus]


 

Visualization with Grafana

Grafana provides dashboards to visualize the metrics and traces.

The pre-configured dashboard includes:

  • Document analysis request metrics

  • Response time statistics

This gives visibility into:

  • How many documents are being processed

  • How long analysis takes

  • Success and failure rates

 

 

 

 

 

 

 

 


 

Containerizing the Application

The docker-compose.yml file orchestrates all services:

services:

  # Frontend Application

  frontend:

    build: 

      context: .

      dockerfile: Dockerfile

    ports:

      – “8080:8080”

  # MCP Server (Backend)

  mcp-server:

    build: ./backend

    ports:

      – “3000:3000”

    environment:

      – DMR_API_ENDPOINT=http://model-runner.docker.internal/engines/v1

      – TARGET_MODEL=ai/llama3.2:1B-Q8_0

      – OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318

  # OpenTelemetry Collector

  otel-collector:

    image: otel/opentelemetry-collector-contrib

    volumes:

      – ./observability/otel-collector-config.yaml:/etc/otel-collector-config.yaml

  # Jaeger for trace visualization

  jaeger:

    image: jaegertracing/all-in-one

  # Prometheus for metrics storage

  prometheus:

    image: prom/prometheus

  # Grafana for dashboards

  grafana:

    image: grafana/grafana


 

 

 

 

 


 

The Frontend Components

 

 Key components:

  • DocumentUploader: Handles file uploads with drag-and-drop

  • DocumentAnalysisResult: Displays analysis results

  • ChatInterface: Allows users to chat about the document

 

Challenges and Solutions

 

1. Docker Model Runner Connectivity

 

Challenge: The backend container couldn’t connect to the Docker Model Runner service 

Solution: Below implemented hostname

  • model-runner.docker.internal

2. Error Handling and Fallbacks

 

Challenge: AI services can be unreliable or unavailable.

Solution: Implement graceful degradation:

  • Always provide basic text extraction even if analysis fails
  • Clear error messages for debugging
  • Multiple endpoint fallbacks

3. Observability Integration

 

Challenge: Tracking performance across multiple services.

Solution: OpenTelemetry integration with:

  • Automatic instrumentation for HTTP and Express

  • Custom span creation for important operations

  • Centralized collection and visualization

 

Performance Observations

 

With this implementation, users can observe for example:

  • Processing Speed: PDFs under 10MB typically process in 1-3 seconds for text extraction

  • AI Analysis Time: The local Llama 3 model analysis takes 3-8 seconds depending on document length

  • Memory: The backend uses approximately 150-250MB RAM

  • Response Size: Analysis results average 1-5KB of text


 

Conclusion

 

This implementation demonstrates how an AI application can be built and tested using locally-run models via Docker Model Runner and MCP. The addition of OpenTelemetry, Jaeger, Prometheus, and Grafana provides comprehensive visibility into the application’s performance and local models insights 

 

By running LLMs locally with Docker Model Runner, we get:

  • Privacy: Document data never leaves your infrastructure

  • Cost efficiency: No per-token API charges

  • Reliability: No dependency on external API availability

  • Control: Choose appropriate models for your use case

 

& valuable insights into:

  • Performance bottlenecks: Identify slow components

  • Error patterns: Detect recurring issues

  • Resource utilization: Optimize container resources

  • User behavior: Understand how the application is used


 

Next Steps

To extend this , consider:

  • Supporting more file formats (DOC, DOCX, TXT)
  • Fine-tuning the LLM for specific document types
  • Use a benchmarking tool like locust / K6
  • Sky is the limit 🙂

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *