Text-to-SQL Service Agent with Docker and LLM model runner

Docker containerization, SQL database, and LLM model runner like Ollama in a AI pipeline

A containerized approach to natural language database queries with built-in safety and auditability

Author: Vishesh Sharma

Date: October 27th, 2025

CORE IDEA

  • This project shows how natural English questions can be safely turned into SQL queries, with every step logged and validated before running.
  • Unsafe commands like DROP or DELETE never make it through, thanks to a built-in SQL firewall and a read-only database layer.
  • Everything is fully containerized with Docker, so you can reproduce the setup on any machine and trust that the behavior will be consistent.
  • In benchmark tests, the system produced correct results for most queries, while blocking unsafe SQL commands.

Problem & Context

Accessing data is critical for modern organizations, yet most employees struggle with SQL queries. Business users want quick answers to everyday questions such as “Which customers placed the most orders last quarter?” but they often have to wait for overloaded data teams. 

Large language models (LLMs) offer a way forward by converting natural language into SQL automatically, lowering the barrier to accessing data.However, giving models direct access to production databases creates serious risks. An LLM can hallucinate table names, misinterpret requests, or generate unsafe commands like DROP TABLE. Beyond these data-related technical issues, another concern arises: without logs or safeguards, there is no way to prove what queries were run or to prevent misuse. This project tackles both challenges by introducing a containerized pipeline that enables natural language access to data while enforcing strict safety and auditability.

Solution Overview & Architecture

System Architecture Design for AI-SQL pipeline
Figure 1 — System Architecture: The modular design with separate components for prompt building, LLM execution, SQL validation, query execution, and logging, all coordinated through Docker.

The system is packaged as a 

  • FastAPI service inside a Docker container, with internal components handling each stage of the Text-to-SQL pipeline. 
  • When a user submits a question, the prompt builder loads the schema from metastore.yaml and constructs strict rules for SQL generation. 
  • The LLM runner sends this prompt to a Hugging Face model through the OpenAI-compatible API, and
  • The resulting SQL is then checked by the SQL firewall. The firewall makes sure the query is a single SELECT, blocks dangerous commands like DROP or DELETE, and can enforce an allowlist of tables.

If the query passes validation, the read-only database executor runs it against the database and returns results as a Pandas DataFrame.

Every interaction is logged by the artifact store, which records the original question, generated SQL, results, and status in artifacts.log. The API endpoint /query exposes this full pipeline, while the docker-compose.yml file starts the container and saves the database and logs on your machine. 

This setup is easy to run and still enforces strong safety rules, so even non-technical users can query data without risking changes or data loss

API Endpoint Details

The system exposes a single REST API endpoint for submitting natural language queries and retrieving results.

POST /query

  • Query Parameters: question (string, required) — The natural language question to be translated into SQL.
  • Response Schema:JSON object containing the generated SQL and the query results.

Example:

JSON output showing AI-SQL model runner query results for customers in Boston
Figure 2: Output from the AI-SQL model runner showing an executed SQL query and returned JSON results for customers in Boston.
  • Example curl Request:
curl -X 'POST'   'http://localhost:8000/query?question=Show%20the%20customers%20in%20Boston'  
 -H 'accept: application/json'   -d ''

Environment & Prerequisites

This experiment is designed to be fully reproducible. The only requirement to run it is a working installation of Docker Desktop. All Python libraries, models, and dependencies are managed within the containerized environment.

System Configuration  

  • Tested on: Windows 11 with WSL 2, Docker Desktop 4.41.2
  • Container OS: python:3.10-slim  
  • Key Libraries: sqlglot>=23.0.0 ,pandas>=2.2.0, ollama==0.2.1  
  • Generative Model: mistral:latest (Mistral 7B Instruct) served via the ollama/ollama Docker image

Dataset Details

The system works against a structured database schema.

  • Database: data/demo.db (SQLite) is initialised using init_db.py includes random generated dataset
  • Schema definition is stored at  data/metastore.yaml
  • Tables: Customers, Orders, Products (descriptions + allowed joins defined in YAML)
  • Mode: Read-only — no insert/update/delete operations are permitted.
  • Language: Natural language queries in English (e.g., “Which customers placed the most orders in 2024?”).

Preprocessing Steps

Before a user’s natural language query is processed, the system applies some lightweight preprocessing:

  • Schema injection: The query is combined with table/column descriptions from metastore.yaml so the model has strict boundaries.
  • Rule injection: The prompt builder enforces guardrails (e.g., “Only SELECT queries”, “No hallucinated tables”).
  • Caching check: The system looks up the question in artifacts.log — if an identical SQL was generated before, it reuses that result.
  • Normalization: SQL queries are normalized during scoring (lowercased, punctuation stripped) to make keyword matching fair across models
  • Step-by-Step Implementation
    The project mainly relies on our docker-compose.yml,which writes the whole application stack


Query Workflow for the AI-SQL pipeline
Figure 3 — Query Workflow: From user input to final output. The pipeline builds a strict SQL prompt, generates candidate SQL through the LLM runner, validates it via the SQL firewall, executes only safe SELECT queries on a read-only database, and logs every step in the artifact store. Unsafe queries are blocked with an error message.

Key Points:

  • Image: ollama/ollama:latest
    • Uses the official Ollama Docker image, which comes with the runtime required to serve models.
  • Volumes
    • Mounts ollama_data to /root/.ollama inside the container.
    • This persists downloaded models between container restarts, so you don’t re-download mistral:latest every time.
  • Healthcheck
    • Runs ollama list | grep mistral:latest inside the container.
    • If the model is not available, it exits with 1, marking the container as unhealthy.
    • Docker Compose will retry until the model is ready (interval: 30s, timeout: 10s, max 5 retries).
    • This ensures the pipeline doesn’t start until the model is available
model-runner:
    image: ollama/ollama:latest
    container_name: model-runner
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    healthcheck:
      test: ["CMD", "bash", "-lc", "ollama list | grep -q 'mistral:latest'"]
      interval: 30s
      timeout: 5s
      retries: 5
volumes:
  ollama_data:

The entire project is launched with a single command:

docker-compose up --build

Docker Model Runner

To keep everything simple and consistent, we used a Docker model runner to handle the language model separately from the main API. It’s based on the official ollama/ollama image and runs the mistral:latest model inside its own container.

The API talks to it through a local endpoint, so all the model work happens inside Docker

AI-SQL Docker Desktop view displaying the running ai_sql container with CPU usage and port details.

Figure 4 — AI_SQL container running successfully in Docker Desktop, confirming a healthy and active setup.

Once our containers appear in Docker Desktop, there’s no need to use the terminal at all. You just open Docker Desktop and hit the Play button on the model-runner, then do the same for the api or the cli container. Docker automatically makes sure the model runner starts first and is healthy before the API begins.

The first time it runs, the model downloads once and is stored in a persistent volume called ollama_data. After that, it’s reused automatically, so you don’t have to wait for downloads again. The setup also works fully offline once the model is ready, which makes it really handy for local testing or demos.

You can manage everything directly from Docker Desktop and view logs, restart containers, or stop them when you’re done. It’s a one-click setup that makes the whole system feel lightweight and easy to use.

AI-SQL Docker Desktop view displaying the services inside the container
Figure 5 — All containers running in Docker Desktop, with the API returning query results through the CLI.

Command Line Interface (CLI)

The CLI acts as a lightweight companion to the API, designed for users who prefer working directly in the terminal. It connects with the same Docker services, sending natural language queries to the API and returning formatted SQL results instantly.

Once the containers are running in Docker Desktop, the CLI can be launched to query without opening a browser. It’s especially handy for people who want fast feedback.

Each CLI interaction is logged in the same artifact store, keeping a consistent record of questions, SQL statements, and outputs across both the API and the web interface.

Snippet of our CLI running queries  in our pipeline
Figure 6 — CLI mode of our program: showing a user query and formatted SQL results, confirming live interaction with the API.

Results & Evaluation

We measured performance across two metrics:

  • Soft accuracy: keyword coverage of expected SQL terms.
  • Latency: average query generation time.
Summary of model accuracy and latency
Figure 7. Summary of model accuracy and latency:All models achieved 100% accuracy, with Hugging Face models responding fastest and Ollama Mistral showing the highest latency.

The models achieved similar accuracy, though with different latency.

The Llama model balanced accuracy and fast efficiency, In contrast, the OpenAI OSS model was the slowest of the Hugging Face backends, while Ollama was the slowest overall, averaging five seconds per query.

One additional observation concerns caching. While the system does reuse previous results by looking up identical SQL strings in the artifact log, this is not a smart caching layer. For small test runs it works fine, but when thousands of queries accumulate, string-based lookup will become inefficient. 

Limitations 

This project provides a modular, containerized pipeline for safe natural language to SQL. Each part of the system, prompt builder, LLM runner, SQL firewall, database executor, and artifact store  are isolated in its own module and coordinated through Docker. 

  • First, the evaluation metrics can be made more solid. Currently, queries are scored using keyword coverage (soft accuracy), latency, and status. While this helps to highlight differences between models, it does not fully capture accuracy. Equivalent SQL constructs, such as MAX(price) versus ORDER BY price DESC LIMIT 1, may be undercounted. A better approach would compare execution results directly, ensuring accuracy is measured on outcomes rather than string matches.
  • Second, while the system is modular, it currently runs in a local Docker environment for reproducibility. In production, these components could be deployed as separate services in a cloud-native environment. For example, the API, model runner, firewall, executor, and logging system can each be containerized and scaled independently using Kubernetes or serverless runtimes. 
  • Third, while an artifact log and lookup mechanism already exists, it is not a smart caching system. Queries are matched by SQL strings, which works for small datasets but will become inefficient as logs grow into thousands of entries. A more scalable approach would be to add indexed storage, semantic query matching, or a dedicated caching service to return frequent results instantly.

Finally, the current scope is limited to basic text-to-SQL translation. There is room to improve functionality into query explanation, summarization of results in natural language, and handling of more complex schemas or multiple databases.

Next Steps

Future improvements could include:

  • Improved evaluation: Move from keyword-based scoring to execution-based metrics that compare query outputs.
  • Cloud-native deployment: Deploy each module as an independent containerized service with orchestration tools like Kubernetes.
  • Multi-database support: Extend beyond SQLite to include PostgreSQL and managed cloud databases in read-only mode.
  • Enhanced features: Add result summarization, query explanations, and cross-database joins.
  • Smart Caching:Adding a dedicated caching service to smartly cache hot queries.

Contributing and Getting Involved

  • Go to the GitHub repository to see the full project code.
  • If you find a problem or a bug, open an issue so it can be fixed.
  • If you have an idea for improving the project, suggest it by raising a feature request.
  • Check the issues page to read what others have reported or suggested.
  • If you want to help, you can contribute by sending a pull request. This can be for fixing errors, updating documentation, or adding new features.

How to Check

This experiment is fully containerized and managed by Docker Compose. The following steps detail the project structure and the commands needed to replicate the entire analysis.

Project Structure

The repository is structured with a clear separation between application code, container configurations, and data are organized into different directories. This layout improves maintainability, reduces complexity, and makes the environment straightforward to manage.

Repository Tree
Figure 8— Project File Structure: Organized layout of the repository , showing modular separation of controllers, models, views, data, and configuration files for the text-to-SQL system.

Prerequisite

To recreate this project, you only need to have Docker Desktop installed

Setup & Execution

Run the Full Experiment

To build and run everything (API + LLM+CLI), simply run

docker-compose up - - build

This will:

  • Pull/build all required images
  • Start docker model runner and ensure the mistral:latest model is available
  • Launch the FastAPI service (api)
  • Run evaluation (tests/test_evaluation.py) and save results into results/
  • After it finishes, shut everything down with:
docker-compose down

Run Only the Interactive API

If you want to just test queries interactively (via Swagger UI /docs or Thunder Client), run:

docker-compose up  – -build api

This will start:

  • docker model runner 
  • api (FastAPI service with /query endpoint)

Run Only the CLI

If you want to use the command-line interface, first build it with:

docker-compose up --build cli

This command builds the CLI container but runs it in detached mode inside Docker Desktop. To open and interact with it directly in the terminal, use:

docker-compose run --rm cli

Stop them when done with Ctrl+C, then clean up:

Before running, make sure Docker is running and the model is available:

docker exec -it model-runner ollama list

Output Artifacts

After running the full experiment, the following outputs are generated and stored locally:

  • data/demo.db: The read-only SQLite database used for executing validated queries.
  • data/artifacts.log: An append-only log of all user questions, generated SQL, execution status, and results. This ensures auditability and allows repeated queries to be cached.
  • results/detailed_results.csv: A row-by-row record of each query execution in test mode,running the file test_evaluation.py in the container, including question, model used, generated SQL, latency, and score.
Detailed results showing detailed view of the metrics
Figure 9. Detailed evaluation results:Every query was accurately translated into SQL, with all models achieving full correctness (soft score = 1). Hugging Face models showed consistently lower latency, while Ollama Mistral exhibited higher response times
  • results/summary_results.csv: An aggregated summary of model performance across all test cases (soft accuracy, total queries, average latency).
Summary of model accuracy and latency
Figure 10. Summary of model accuracy and latency:All models achieved 100% accuracy, with Hugging Face models responding fastest and Ollama Mistral showing the highest latency.


Conclusion

This project shows how natural language queries can be converted into safe SQL statements inside a containerized setup that runs directly through Docker Desktop. Each part of the system, including the prompt builder, SQL firewall, and artifact logger, helps maintain security and traceability. The Docker model runner makes deployment simple, requiring only one click to start.

The results confirm that the approach works as intended. It generates accurate queries, blocks unsafe ones, and keeps detailed logs for review. Overall, it provides a secure and convenient way to access structured data locally while maintaining full control and reliability.

References & Credits

Authors

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *