Building a Full RAG Stack on Docker with Ollama, Open WebUI, Qdrant, and VectorAdmin

TL;DR (Executive Summary)

This project implements a local, privacy-focused Retrieval-Augmented Generation (RAG) stack using Docker, combining Ollama (TinyLlama and nomic-embed-text), Open WebUI, Qdrant, and VectorAdmin. The system delivers end-to-end capabilities for LLM chat, document-aware question answering, vector storage, and visual management of embeddings, while also exposing a Python/LangChain integration and a custom document loader that pushes content directly into Qdrant for flexible and repeatable workflows.

Project Goals and Learning Objectives

The primary goals of this project were:

  • To design and deploy a fully local RAG system without external cloud dependencies
  • To understand how LLMs, embeddings, vector databases, and UI layers interact in a production-style pipeline
  • To compare UI-based ingestion with programmatic ingestion using Python and LangChain
  • To explore short-term and long-term memory concepts using Redis
  • To document the system in a way that supports reproducibility and future extension

1. Introduction

This project documents the design and implementation of a local, private Large Language Model (LLM) environment using Docker.
The core objectives are:

  • Run all components locally in a containerized setup.
  • Use Ollama to serve the tinyllama model for chat.
  • Use Open WebUI as the user-facing chat interface.
  • Integrate Qdrant as the vector database for Retrieval-Augmented Generation (RAG).
  •  Add Redis to enable conversation memory 
  • Connect the stack to Python (via LangChain) and provide a custom document loader for direct ingestion into Qdrant.

2. Prerequisites and Project Setup

To support this architecture, the following tools and components are required:

  • Docker Desktop (latest release)
  • Visual Studio Code (VS Code) as the primary code editor
  • A terminal for running commands (e.g., zsh on macOS)
  • Homebrew & pyenv for Python version management
  • Python 3.10 for compatibility with LangChain and supporting libraries

2.1 Architecture Overview

Below is the system architecture of the complete RAG Stack, showing how all components interact:

Figure 1 — Full RAG Stack Architecture (Ollama, OpenWebUI, Qdrant, Redis, PostgreSQL, VectorAdmin, LangChain Scripts)

3. Docker Infrastructure

The core of this setup relies on docker-compose.yml  to manage the Ollama LLM server and the Open WebUI frontend

3.1 Design Rationale

  1. Ollama Service: Runs the Ollama server, exposing its API on port 11434. It uses a named volume (ollama) to persistently store downloaded models.
  1. Open WebUI Service: Runs the web interface, accessible on localhost:3000. It connects to Ollama using Docker’s internal network (http://ollama:11434). It also uses a named volume (openwebui) to save user accounts, chat history, and settings.
  1. Service Dependency: The openwebui service includes a depends_on block. This ensures that the ollama container is started before the openwebui container attempts to start. 

The Configuration: The docker-compose.yml file below used to deploy the stack. Note the use of depends_on to ensure the database is ready before the admin panel starts

3.2 Starting the Stack

The full stack is started with:

After initialization:

  • Service status is verified using:
  • All containers should appear as healthy before proceeding with further configuration.

4. Loading the Models

Once the stack is running, the LLM and embedding models are loaded into Ollama.

The stack uses:

  • Chat model: tinyllama
  • Embedding model: nomic-embed-text

These models serve distinct roles:

  • tinyllama handles text generation and conversation.
  • nomic-embed-text produces vector embeddings used by Qdrant and Open WebUI for RAG.

Models are pulled using:

Model installation can be verified with:

5. RAG Ingestion Approaches Used in This Project

This project demonstrates two complementary RAG ingestion approaches:

1. UI-Based Ingestion (Open WebUI)
Used for interactive document uploads and rapid experimentation through the browser interface.

2. Script-Based Ingestion (Python + LangChain)
Used for repeatable, automated ingestion workflows where documents are processed, chunked, embedded, and inserted directly into Qdrant.

This dual approach mirrors real-world systems, where both UI-driven exploration and automated pipelines coexist.

6. RAG Setup via Open WebUI

With the models loaded, RAG capabilities are configured through Open WebUI (http://localhost:3000).

  • Accessed UI: Navigate to http://localhost:3000and create an admin account.
  • Configured Embedding Model: Go to Settings (by clicking my name in the bottom-left) then to Knowledge Base and set the Default Embedding Model to nomic-embed-text:latest. This is a critical step to ensure documents were processed correctly.
  • Uploaded Documents: Go to Workspace then Knowledge then add New Knowledge and upload files (e.g., “LLMS and RAGS.pdf“).
  • Tested RAG: Start a new chat, selected tinyllama:latest as the chat model, clicked the # button to select the “LLMS And RAGS.pdf” collection, and asked a question about it. The UI successfully retrieved a source and answered it based on the document.

6. Setting Up the Viewer Tools

To validate that embeddings are stored correctly and to inspect collections, two viewer tools are used:

  • The native Qdrant dashboard
  • VectorAdmin for enhanced visual management

6.1 Qdrant Dashboard (Primary Viewer)

The Qdrant container itself includes a simple dashboard.

  • URL http://localhost:6333/dashboard
  • Results: After uploading documents via OpenWebUI, I could refresh this page and see the new collections listed “Machine Learning“, confirming the data was saved with 81 Points.

6.2 VectorAdmin (Advanced Viewer)

To verify that our data was actually being stored, set up VectorAdmin (http://localhost:3001) as a management dashboard, created an admin user and organization.

  • Navigated to http://localhost:3001 and completed the on-screen setup (creating an admin user, organization).
  • (Connect your vector database), I selected Qdrant and entered the URL http://qdrant:6333.
  • After setup, I navigated to the settings > Data Sources, selected my Qdrant connection, and clicked “Sync”.
  • This successfully pulled the collection list (“Machine Learning”, etc.) from Qdrant into the VectorAdmin UI.
  • VectorAdmin shows the synced “Machine Learning” workspace.

6.3 Redis Integration for Caching and Future Memory

To extend the capabilities of OpenWebUI, Redis was added to the Docker stack. Redis functions as:

  • A caching backend for OpenWebUI
  • A conversation-memory store
  • Storage for:
    • recent_messages (short-term chat memory using LPUSH)
    • summaries (long-term memory using RPUSH)

6.3.1 Installing the Python Redis Client

To run the memory demo script:

  • pip install redis

This installs the official Redis Python client, allowing Python to connect to the Redis container and store conversation memory.

6.3.2 Adding Redis to the Docker Stack

A new Redis service was added to docker-compose.yml:

OpenWebUI was configured to use Redis via:

This enables OpenWebUI to store internal runtime data in Redis automatically.

6.3.3 Verifying Redis is Running

Check Redis status from inside the container:

Expected output:

  • PONG

This confirms Redis is up and responding.

6.3.4 Exploring Redis Keys

Inside the Redis CLI:

Typical keys you’ll see:

  • tool_servers → used internally by OpenWebUI
  • user:123:recent_messages → short-term memory
  • user:123:summaries → long-term memory

6.3.5 Inspecting Key Contents

Short-term memory (most recent 20 messages):

Long-term summaries:

Both lists should return structured timestamped entries.

6.4 Python Memory Prototype (Short-term + Long-term)

A Python demo script (redis_memory_demo.py) was created to simulate AI memory:

  • Short-term memoryLPUSH
  • Long-term memoryRPUSH

Short-term:

this keeps the newest message. at index 0.

Long-term:

This preserves chronological order

6.4.1 Running the Script

It prints:

  • Recent messages
  • Long-term summaries

6.4.2 Redis Confirmation

Verify the script saved the data correctly in Redis:

output:

7. OpenWebUI Redis Memory Filter (Core Feature)

A custom Redis memory filter was implemented inside OpenWebUI to manage memory ingestion and recall.

The filter was added through:

The filter performs:

  • Fact extraction
  • Recent message tracking
  • Long-term summarization
  • Context injection on every request

Once enabled, the filter runs automatically for all chat sessions.

7.1 Verifying Memory Inside OpenWebUI

After restarting OpenWebUI:

A test conversation was conducted.

Test Input:

Follow-up Question:

The assistant correctly recalled:

  • The user’s name
  • The project context
  • Without re-entering the information

7.2 Verifying Stored Memory in Redis

To confirm memory persistence beyond the UI, Redis was queried directly.

The output confirmed:

  • Facts stored correctly
  • Recent messages appended
  • Long-term summaries preserved

This verifies that OpenWebUI memory is backed by Redis and persists independently of the UI.

8. Connecting via Python (LangChain Integration)

The stack also exposes programmatic access via Python and LangChain, enabling integration with custom applications and workflows.

8.1 Upgrading Python with pyenv:

To ensure compatibility with LangChain and associated libraries:

  • Installed pyenv using Homebrew (brew install pyenv)
  • Configured pyenv in ~/.zshrc and restarted the terminal.
  • Installed Python 3.10.13 (pyenv install 3.10.13), 
  •  Set Python 3.10 as the local version (pyenv local 3.10.13).

8.2 Setting Up the Virtual Environment

A dedicated virtual environment isolates dependencies for the project:

8.3 Installing LangChain Ollama Integration

Additional adapters or integrations required for Ollama can be installed using pip, enabling LangChain to:

  • Create LLM instances bound to the Ollama API
  • Send prompts through standardized interfaces like .invoke()

8.4 Test Script:

A Python script named simple_langchain_ollama_original.py provides a smoke test for LangChain and Ollama connectivity. Key responsibilities of the script:

  • Initialize an LLM via LangChain using the Ollama backend.
  • Submit a sample prompt using .invoke().
  • Print the response to validate end-to-end interaction.

8.5 Executing the Script

The script is executed with:

Expected outcome:

  • Successful connection to Ollama.
  • A textual response generated by the tinyllama model, confirming that LangChain is correctly wired to the stack.

9. Creating a Python Document Loader (Manual RAG)

To complement the UI-based ingestion, a custom Python document loader is used to directly push vectors into Qdrant.

9.1 Installing Additional Libraries

Within the existing virtual environment, text and document processing libraries are installed:

These support reading PDFs and DOCX files and preparing text chunks for embedding.

9.2 Execution

A script named load_all_files.py automates the process of:

  • Reading all supported files (PDF, DOCX, TXT, etc.) from a designated folder.
  • Splitting documents into manageable text chunks.
  • Generating embeddings for each chunk.
  • Upserting the embeddings into a Qdrant collection named “Machine Learning”.

9.3 Execution & Verification

The loader is executed using:

During execution, the script:

  • Processes all documents found in the target directory.
  • Produces 81 text chunks in this example.
  • Inserts 81 vectors into the “Machine Learning” collection in Qdrant.

10. Health Checks and Useful Commands

  • Check Docker Containers: docker compose ps
  • Check Ollama API Version: curl -s http://localhost:11434/api/version && echo
  • List Models Inside Ollama: docker compose exec ollama ollama list
  • View Container Logs: docker compose logs -f ollama (or qdrant, openwebui,)
  • Stop Stack (Keep Data): docker compose down
  • Stop Stack (Delete ALL Data): docker compose down –v

11. Operational Validation

The system was validated through multiple operational checks:

  • Container health verification using docker compose ps
  • Model availability checks via Ollama CLI
  • Successful RAG responses retrieved from indexed documents
  • Verification of vectors inside Qdrant collections
  • Redis connectivity confirmed via CLI commands

These checks ensured that each layer of the stack was operational before proceeding.

12. Project Structure

13. Conclusion

The implemented architecture delivers a complete local RAG stack built around Docker, Ollama, Open WebUI, Qdrant, and VectorAdmin. The solution provides:

  • A containerized LLM backend with persistent model storage.
  • A user-friendly UI for chat and RAG-based question answering.
  • A robust vector database with both native and advanced visualization interfaces.
  • A Python/LangChain integration path for programmatic interaction.
  • Redis to cache and remember future conversations.
  • A custom document loader that enables repeatable, script-driven ingestion of documents into Qdrant.

14. References & Credits

Author

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *