Post

Dockerizing a RAG Application with FastAPI, LlamaIndex, Qdrant and Ollama

Introduction

In the previous article, we built a local RAG api using FastAPI, LlamaIndex and Qdrant to query documents with the help of a local LLM running via Ollama.

While the application worked locally, setting it up each time, installing dependencies, ensuring the right versions in the right environment, running background services..quickly became tedious and error-prone.

In this post, we’ll walk through how we Dockerized the entire setup to make it fully containerized, reproducible, and easy to deploy with a single command. By the end, we’ll have a clean multi-service Docker Compose setup running this RAG app in isolated containers.

Project Structure Overview

So far this is the project structure of our project:

1
2
3
4
5
6
7
8
9
10
├── api/                     # FastAPI app
│   ├── main.py               # FastAPI entry point
│   └── rag.py                # RAG logic, used by the api
├── data/                    # Data ingestion / document handling
│   ├── data.py               # Script to download, load and process documents
├── ollama/                  # Local model setup with Ollama
│   ├── Modelfile             # Ollama model definition (e.g., Zephyr, llama2, mistral)
│   └── run_ollama.sh         # Script to start Ollama and serve the model
├── pyproject.toml           # Project metadata and dependencies
└── uv.lock                  # Lockfile with pinned dependency versions

Creating Dockerfiles

Dockerfiles contains the instructions on how to build the image, such as: base image, dependencies and commands to run.

To dockerize our entire application, we need to create a Dockerfile for each service.

In our case, we need to build two custom images:

  • API image
  • Data ingestion tool image

For Qdrant & Ollama, we only need to pull the official images and run them directly.

Let’s first ensure our dependencies are defined correctly. Since we are using uv and pyproject.toml file to manage dependencies, the typical setup looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
[project]
name = "local-rag-llamaindex"
version = "0.1.0"
description = "Local llamaindex RAG api to assist researchers quickly navigate research papers"
readme = "README.md"
requires-python = ">=3.13"
dependencies = [
    "arxiv>=2.2.0",
    "fastapi[standard]>=0.115.12",
    "langchain>=0.3.24",
    "langchain-community>=0.3.22",
    "llama-index>=0.12.33",
    "llama-index-embeddings-huggingface>=0.5.3",
    "llama-index-embeddings-langchain>=0.3.0",
    "llama-index-llms-ollama>=0.5.4",
    "llama-index-vector-stores-qdrant>=0.6.0",
    "ollama>=0.4.8",
    "pydantic>=2.11.3",
    "pyyaml>=6.0.2",
    "qdrant-client>=1.14.2",
    "sentence-transformers>=4.1.0",
    "structlog>=25.3.0",
    "tqdm>=4.67.1",
]

One challenge is that a typical Dockerfile requires installing the necessary dependencies, which means copying the pyproject.toml file and installing everything listed in it.

However, since we only have a single dependencies file, and each service uses both common and unique libraries, it’s not a good practice to install all dependencies, including unused ones into every Docker image we build.

To address this, we’ll separate the dependencies as follows:

  • Shared dependencies: Used by both services, e.g., llama-index, qdrant-client, etc. These will go into a base image.
  • API specific dependencies: e.g., FastAPI, ollama, etc.
  • Data ingestion tool dependencies: e.g., arxiv, etc.

To do that, we’ll keep the shared depedencies in the main dependencies section, and the service-specific ones under [project.optional-dependencies] in pyproject.toml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[project]
name = "local-rag-llamaindex"
version = "0.1.0"
description = "Local llamaindex RAG api to assist researchers quickly navigate research papers"
readme = "README.md"
requires-python = ">=3.13"
dependencies = [
    "hf-xet>=1.0.5",
    "llama-index>=0.12.33",
    "llama-index-embeddings-huggingface>=0.5.3",
    "llama-index-llms-ollama>=0.5.4",
    "llama-index-vector-stores-qdrant>=0.6.0",
    "python-dotenv>=1.1.0",
    "pyyaml>=6.0.2",
    "qdrant-client>=1.14.2",
    "sentence-transformers>=4.1.0",
    "structlog>=25.3.0",
]

[project.optional-dependencies]
data_ingest = [
    "arxiv>=2.2.0",
]
api = [
    "fastapi[standard]>=0.115.12",
    "pydantic>=2.11.3",
    "ollama>=0.4.8",
]

Base Image

This image will include all shared dependencies, mainly llama-index packages, qdrant and sentence-transformers, and it will be used as the base for both the API and data ingestion images.

1
2
3
4
5
6
7
FROM ghcr.io/astral-sh/uv:python3.13-bookworm-slim

WORKDIR /app

COPY pyproject.toml uv.lock ./

RUN uv sync

We’re starting from the official uv image, which comes with python3.13 and uv pre-installed. It is a good starting point since we’ll be using pyproject.toml and uv for managing dependencies.

Next, we setup a working directly, in this case we called it app/ and then we copy our .toml and .lock files into it

Finally, we run the command uv sync to install all the dependencies inside our container with the versions matching the ones from the lock file.

API Image

1
2
3
4
5
6
7
8
9
10
11
FROM rag-base-image

WORKDIR /api

COPY pyproject.toml uv.lock ./

RUN uv pip install --system ".[api]"

COPY api/ .

CMD ["fastapi", "run", "main.py", "--port", "80", "--host", "0.0.0.0"]

We start by using the base image we just created, meaning we won’t have to reinstall all the comon dependencies. Then we set /api as our working directory, so any relative paths after (e.g., in COPY instructions) will be based on that directory.

We then install the api specific dependencies, found in pyproject.toml file under [project.optional-dependencies] , in the “api” section

After that we copy the local api/ folder to the api/ directory inside the container.

Finally, we start the fastapi server on port 80.

Docker builds each instruction into an intermediate layer. If a layer hasn’t changed since the last build, Docker caches it, meaning it won’t re-run that step

If we only change the app code in api/, Docker won’t re-install dependencies, it’ll reuse the cached layer above.

That’s why we install dependencies before copying the project code: to avoid reinstalling packages every time we modify the code.

Data ingestion image

1
2
3
4
5
6
7
8
9
10
11
FROM rag-base-image

WORKDIR /app

COPY pyproject.toml uv.lock ./

RUN uv pip install --system ".[data_ingest]"

COPY data/ .

ENTRYPOINT ["uv", "run", "data.py"]

This one is a bit similar to the api image, but this time we install the “data_ingest” section dependencies, copy the data/ folder to the app/ directory in the container, then run the script with uv run data.py

Ollama image

For Ollama, we’ll directly use a prebuilt image ollama/ollama:latest, and since we want it to use a local llm with our created Modelfile, we need to prodive some instructions to copy the correct file and run the LLM we created.

We’ll do that with the help of a simple bash script that will be used later in our docker compose file.

Qdrant image

Same as Ollama, we’ll use the prebuilt image qdrant/qdrant

Now that we have all our Dockerfiles ready, let’s go ahead and create a Docker Compose file that will orchestrate our multi-container setup.

Docker Compose

Now that each component of our RAG system has its own Dockerfile and image, it’s time to bring them all together.

Docker Compose

In this section, we’ll define a docker-compose.yml file that orchestrates the entire setup, allowing us to spin up the FastAPI API, Ollama model server, Qdrant vector database, and our data ingestion tool all at once with a single command.

Building the Base image

1
2
3
4
5
6
services:
  base:
    build:
      context: .
      dockerfile: Dockerfile.base
    image: rag-base-image*

This service builds a base image that contains all shared dependencies across our project. The api and data_ingestion services will reference this image in their own builds.

This approach keeps the builds fast by avoiding repeated dependency installations across services

Api service

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
  api:
    build:
      context: .
      dockerfile: api/Dockerfile
    ports:
      - "8000:80"
    depends_on:
      - ollama
      - qdrant
      - base
    environment:
      - LLM_URL=http://ollama:11434
      - LLM_NAME=research_assistant
      - EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2
      - QDRANT_URL=http://qdrant:6333
      - COLLECTION_NAME=researchpapers
      - CHUNK_SIZE=1024
    restart: always
    networks:
      - rag-network
    volumes:
      - ./api:/api

This runs our main RAG API built with FastAPI, exposing it on localhost:8000. It depends on the base image, as well as the ollama and qdrant services, meaning docker compose won’t start this service unless the other services are ready.

Data ingestion tool

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
  data_ingestion:
    build:
      context: .
      dockerfile: data/Dockerfile
    profiles: ["data-ingest"]
    depends_on:
      - ollama
      - qdrant
    entrypoint: ["python", "data.py"]
    environment:
      - DATA_PATH=papers/
      - LLM_URL=http://ollama:11434
      - LLM_NAME=research_assistant
      - EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2
      - QDRANT_URL=http://qdrant:6333
      - COLLECTION_NAME=researchpapers
      - CHUNK_SIZE=1024
    networks:
      - rag-network
    volumes:
      - ./papers:/app/papers

This is a one-off job or service used to load and embed documents into Qdrant, we don’t need this container to be continously runnning unless we start it, load and embed the documents into our vector database and then turned off.

Qdrant service

1
2
3
4
5
6
7
8
  qdrant:
    image: qdrant/qdrant
    ports:
      - "6333:6333"
    volumes:
      - ~/qdrant_storage:/qdrant/storage:z
    networks:
      - rag-network

Runs the Qdrant vector DB, which stores the chunked and embedded documents locally. It exposes port 6333 and persists data using the specified volume on our machine.

Ollama service

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
  ollama:
    image: ollama/ollama:latest
    pull_policy: always
    container_name: ollama
    volumes:
      - ./ollama:/model_files
      - ./ollama:/root/.ollama
    tty: true
    entrypoint: ["/bin/sh", "./model_files/run_ollama.sh"]
    environment:
      - LLM_NAME=research_assistant
    restart: always
    extra_hosts:
      - host.docker.internal:host-gateway

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - "11434:11434"
    networks:
      - rag-network

This runs the Ollama LLM server locally and exposes it on port 11434. It runs a custom run_ollama.sh script to load the desired model.

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/bin/bash

echo "Starting Ollama server..."
ollama serve &  # Start Ollama in the background

echo "Ollama is ready, creating the model..."

if [ -z "$LLM_NAME" ]; then
  echo "Error: LLM_NAME environment variable is not set."
  exit 1
fi
ollama create $LLM_NAME -f ./model_files/Modelfile
ollama run $LLM_NAME

We added gpu configuration so Ollama can access it, For more details, see the official guide: Enable GPU access with Docker Compose

Network and Volumes

1
2
3
4
5
6
7
networks:
  rag-network:
    driver: bridge

volumes:
  qdrant_storage:
  ollama:
  • rag-network: A shared bridge network so services can talk to each other by name, eg: ollama:11434
  • volumes: Declares named volumes used by services in order to persist data between runs.

You can find the full docker-compose.yml file here: docker-compose.yml

What’s left to do now is build the images and run the containers:

1
docker compose up ollama qdrant api

That’s it, the command will build the base image, start Qdrant and Ollama services, then start the API service.

Then if we want to ingest more documents into our vector database:

1
docker compose run data_ingestion --query "LLMs" --max 10 --ingest

Docker containers

Api is now running on: http://localhost:8000/docs

Ollama: http://localhost:11434/

Qdrant ui: http://localhost:6333/dashboard#/collections

Finally, to stop all running containers and networks started by Docker Compose:

1
docker compose down

If we want to reset everything, including removing volumes (like Qdrant’s stored vectors) and networks, run:

1
docker compose down --volumes --remove-orphans

Conclusion

In this article, we learned how to dockerize a RAG application built with FastAPI, Llamaindex, Qdrant and locally served model via Ollama.

Explore the full project on Github repository

This post is licensed under CC BY 4.0 by the author.