
TL;DR (Executive Summary)
This project implements a local, privacy-focused Retrieval-Augmented Generation (RAG) stack using Docker, combining Ollama (TinyLlama and nomic-embed-text), Open WebUI, Qdrant, and VectorAdmin. The system delivers end-to-end capabilities for LLM chat, document-aware question answering, vector storage, and visual management of embeddings, while also exposing a Python/LangChain integration and a custom document loader that pushes content directly into Qdrant for flexible and repeatable workflows.
Project Goals and Learning Objectives
The primary goals of this project were:
- To design and deploy a fully local RAG system without external cloud dependencies
- To understand how LLMs, embeddings, vector databases, and UI layers interact in a production-style pipeline
- To compare UI-based ingestion with programmatic ingestion using Python and LangChain
- To explore short-term and long-term memory concepts using Redis
- To document the system in a way that supports reproducibility and future extension
1. Introduction
This project documents the design and implementation of a local, private Large Language Model (LLM) environment using Docker.
The core objectives are:
- Run all components locally in a containerized setup.
- Use Ollama to serve the
tinyllamamodel for chat. - Use Open WebUI as the user-facing chat interface.
- Integrate Qdrant as the vector database for Retrieval-Augmented Generation (RAG).
- Add Redis to enable conversation memory
- Connect the stack to Python (via LangChain) and provide a custom document loader for direct ingestion into Qdrant.
2. Prerequisites and Project Setup
To support this architecture, the following tools and components are required:
- Docker Desktop (latest release)
- Visual Studio Code (VS Code) as the primary code editor
- A terminal for running commands (e.g.,
zshon macOS) - Homebrew & pyenv for Python version management
- Python 3.10 for compatibility with LangChain and supporting libraries
2.1 Architecture Overview
Below is the system architecture of the complete RAG Stack, showing how all components interact:

Figure 1 — Full RAG Stack Architecture (Ollama, OpenWebUI, Qdrant, Redis, PostgreSQL, VectorAdmin, LangChain Scripts)
3. Docker Infrastructure
The core of this setup relies on docker-compose.yml to manage the Ollama LLM server and the Open WebUI frontend
3.1 Design Rationale
- Ollama Service: Runs the Ollama server, exposing its API on port 11434. It uses a named volume (ollama) to persistently store downloaded models.
- Open WebUI Service: Runs the web interface, accessible on localhost:3000. It connects to Ollama using Docker’s internal network (http://ollama:11434). It also uses a named volume (openwebui) to save user accounts, chat history, and settings.
- Service Dependency: The openwebui service includes a depends_on block. This ensures that the ollama container is started before the openwebui container attempts to start.
The Configuration: The docker-compose.yml file below used to deploy the stack. Note the use of depends_on to ensure the database is ready before the admin panel starts

3.2 Starting the Stack
The full stack is started with:

After initialization:
- Service status is verified using:

- All containers should appear as healthy before proceeding with further configuration.

4. Loading the Models
Once the stack is running, the LLM and embedding models are loaded into Ollama.
The stack uses:
- Chat model:
tinyllama - Embedding model:
nomic-embed-text
These models serve distinct roles:
tinyllamahandles text generation and conversation.nomic-embed-textproduces vector embeddings used by Qdrant and Open WebUI for RAG.
Models are pulled using:

Model installation can be verified with:

5. RAG Ingestion Approaches Used in This Project
This project demonstrates two complementary RAG ingestion approaches:
1. UI-Based Ingestion (Open WebUI)
Used for interactive document uploads and rapid experimentation through the browser interface.
2. Script-Based Ingestion (Python + LangChain)
Used for repeatable, automated ingestion workflows where documents are processed, chunked, embedded, and inserted directly into Qdrant.
This dual approach mirrors real-world systems, where both UI-driven exploration and automated pipelines coexist.
6. RAG Setup via Open WebUI
With the models loaded, RAG capabilities are configured through Open WebUI (http://localhost:3000).
- Accessed UI: Navigate to http://localhost:3000and create an admin account.
- Configured Embedding Model: Go to Settings (by clicking my name in the bottom-left) then to Knowledge Base and set the Default Embedding Model to nomic-embed-text:latest. This is a critical step to ensure documents were processed correctly.
- Uploaded Documents: Go to Workspace then Knowledge then add New Knowledge and upload files (e.g., “LLMS and RAGS.pdf“).
- Tested RAG: Start a new chat, selected tinyllama:latest as the chat model, clicked the # button to select the “LLMS And RAGS.pdf” collection, and asked a question about it. The UI successfully retrieved a source and answered it based on the document.

6. Setting Up the Viewer Tools
To validate that embeddings are stored correctly and to inspect collections, two viewer tools are used:
- The native Qdrant dashboard
- VectorAdmin for enhanced visual management
6.1 Qdrant Dashboard (Primary Viewer)
The Qdrant container itself includes a simple dashboard.
- URL http://localhost:6333/dashboard
- Results: After uploading documents via OpenWebUI, I could refresh this page and see the new collections listed “Machine Learning“, confirming the data was saved with 81 Points.

6.2 VectorAdmin (Advanced Viewer)
To verify that our data was actually being stored, set up VectorAdmin (http://localhost:3001) as a management dashboard, created an admin user and organization.
- Navigated to http://localhost:3001 and completed the on-screen setup (creating an admin user, organization).
- (Connect your vector database), I selected Qdrant and entered the URL http://qdrant:6333.
- After setup, I navigated to the settings > Data Sources, selected my Qdrant connection, and clicked “Sync”.
- This successfully pulled the collection list (“Machine Learning”, etc.) from Qdrant into the VectorAdmin UI.
- VectorAdmin shows the synced “Machine Learning” workspace.

6.3 Redis Integration for Caching and Future Memory
To extend the capabilities of OpenWebUI, Redis was added to the Docker stack. Redis functions as:
- A caching backend for OpenWebUI
- A conversation-memory store
- Storage for:
recent_messages(short-term chat memory using LPUSH)summaries(long-term memory using RPUSH)
6.3.1 Installing the Python Redis Client
To run the memory demo script:
- pip install redis
This installs the official Redis Python client, allowing Python to connect to the Redis container and store conversation memory.
6.3.2 Adding Redis to the Docker Stack
A new Redis service was added to docker-compose.yml:

OpenWebUI was configured to use Redis via:

This enables OpenWebUI to store internal runtime data in Redis automatically.
6.3.3 Verifying Redis is Running
Check Redis status from inside the container:

Expected output:
- PONG
This confirms Redis is up and responding.
6.3.4 Exploring Redis Keys
Inside the Redis CLI:

Typical keys you’ll see:
- tool_servers → used internally by OpenWebUI
- user:123:recent_messages → short-term memory
- user:123:summaries → long-term memory
6.3.5 Inspecting Key Contents
Short-term memory (most recent 20 messages):

Long-term summaries:

Both lists should return structured timestamped entries.
6.4 Python Memory Prototype (Short-term + Long-term)
A Python demo script (redis_memory_demo.py) was created to simulate AI memory:
- Short-term memory →
LPUSH - Long-term memory →
RPUSH
Short-term:

this keeps the newest message. at index 0.
Long-term:

This preserves chronological order
6.4.1 Running the Script

It prints:
- Recent messages
- Long-term summaries
6.4.2 Redis Confirmation
Verify the script saved the data correctly in Redis:

output:

7. OpenWebUI Redis Memory Filter (Core Feature)
A custom Redis memory filter was implemented inside OpenWebUI to manage memory ingestion and recall.
The filter was added through:

The filter performs:
- Fact extraction
- Recent message tracking
- Long-term summarization
- Context injection on every request

Once enabled, the filter runs automatically for all chat sessions.
7.1 Verifying Memory Inside OpenWebUI
After restarting OpenWebUI:

A test conversation was conducted.
Test Input:

Follow-up Question:

The assistant correctly recalled:
- The user’s name
- The project context
- Without re-entering the information

7.2 Verifying Stored Memory in Redis
To confirm memory persistence beyond the UI, Redis was queried directly.

The output confirmed:
- Facts stored correctly
- Recent messages appended
- Long-term summaries preserved
This verifies that OpenWebUI memory is backed by Redis and persists independently of the UI.

8. Connecting via Python (LangChain Integration)
The stack also exposes programmatic access via Python and LangChain, enabling integration with custom applications and workflows.
8.1 Upgrading Python with pyenv:
To ensure compatibility with LangChain and associated libraries:
- Installed pyenv using Homebrew (brew install pyenv)
- Configured pyenv in ~/.zshrc and restarted the terminal.
- Installed Python 3.10.13 (pyenv install 3.10.13),
- Set Python 3.10 as the local version (pyenv local 3.10.13).
8.2 Setting Up the Virtual Environment
A dedicated virtual environment isolates dependencies for the project:

8.3 Installing LangChain Ollama Integration
Additional adapters or integrations required for Ollama can be installed using pip, enabling LangChain to:
- Create LLM instances bound to the Ollama API
- Send prompts through standardized interfaces like .invoke()

8.4 Test Script:
A Python script named simple_langchain_ollama_original.py provides a smoke test for LangChain and Ollama connectivity. Key responsibilities of the script:
- Initialize an LLM via LangChain using the Ollama backend.
- Submit a sample prompt using .invoke().
- Print the response to validate end-to-end interaction.

8.5 Executing the Script
The script is executed with:

Expected outcome:
- Successful connection to Ollama.
- A textual response generated by the tinyllama model, confirming that LangChain is correctly wired to the stack.

9. Creating a Python Document Loader (Manual RAG)
To complement the UI-based ingestion, a custom Python document loader is used to directly push vectors into Qdrant.
9.1 Installing Additional Libraries
Within the existing virtual environment, text and document processing libraries are installed:

These support reading PDFs and DOCX files and preparing text chunks for embedding.
9.2 Execution
A script named load_all_files.py automates the process of:
- Reading all supported files (PDF, DOCX, TXT, etc.) from a designated folder.
- Splitting documents into manageable text chunks.
- Generating embeddings for each chunk.
- Upserting the embeddings into a Qdrant collection named “Machine Learning”.

9.3 Execution & Verification
The loader is executed using:

During execution, the script:
- Processes all documents found in the target directory.
- Produces 81 text chunks in this example.
- Inserts 81 vectors into the “Machine Learning” collection in Qdrant.

10. Health Checks and Useful Commands
- Check Docker Containers: docker compose ps
- Check Ollama API Version: curl -s http://localhost:11434/api/version && echo
- List Models Inside Ollama: docker compose exec ollama ollama list
- View Container Logs: docker compose logs -f ollama (or qdrant, openwebui,)
- Stop Stack (Keep Data): docker compose down
- Stop Stack (Delete ALL Data): docker compose down –v
11. Operational Validation
The system was validated through multiple operational checks:
- Container health verification using docker compose ps
- Model availability checks via Ollama CLI
- Successful RAG responses retrieved from indexed documents
- Verification of vectors inside Qdrant collections
- Redis connectivity confirmed via CLI commands
These checks ensured that each layer of the stack was operational before proceeding.
12. Project Structure

13. Conclusion
The implemented architecture delivers a complete local RAG stack built around Docker, Ollama, Open WebUI, Qdrant, and VectorAdmin. The solution provides:
- A containerized LLM backend with persistent model storage.
- A user-friendly UI for chat and RAG-based question answering.
- A robust vector database with both native and advanced visualization interfaces.
- A Python/LangChain integration path for programmatic interaction.
- Redis to cache and remember future conversations.
- A custom document loader that enables repeatable, script-driven ingestion of documents into Qdrant.
14. References & Credits
- Ollama. (n.d.). Ollama documentation. https://ollama.com/docs Ollama+1
- Open WebUI. (n.d.). Open WebUI documentation. https://docs.openwebui.com Open WebUI+1
- Qdrant. (n.d.). Qdrant documentation. https://qdrant.tech/documentation/ Qdrant+2Qdrant+2
- VectorAdmin. (n.d.). VectorAdmin – vector database UI. https://vectoradmin.com VectorAdmin+1
- LangChain. (n.d.). LangChain documentation. https://docs.langchain.com
