logo
Published on

The Most Beautiful RAG: Starring ColPali, Qdrant, Minio and Friends

Authors
  • avatar
    Name
    Athos Georgiou
    Twitter
Vision RAG Template

A Follow‑up: From Little Scripts to a Full Template

A few weeks ago I shared: The Most Beautiful RAG: Starring Colnomic, Qdrant, Minio and Friends. That project explored late‑interaction retrieval with ColPali‑style embeddings, Qdrant, and MinIO, plus a bunch of performance tricks.

This post levels that up into a reusable template you can clone, run, and build on:

If you want a minimal, API‑first Vision RAG that does page‑level retrieval over PDFs, you're in the right place.

What It Is

  • Page‑level, multimodal RAG using a ColPali‑like approach
  • FastAPI backend with routes for index, search, chat (streaming), and maintenance
  • Optional Next.js frontend included (simple scaffold) on port 3000
  • Docker Compose spins up Qdrant, MinIO, the backend API, and (optionally) the frontend
  • ColPali Embedding API as a separate service (CPU/GPU modes or explicit base URL)
  • Optional binary quantization in Qdrant for memory/speed trade‑offs
  • Configuration via .env/config.py
  • Optional local Gradio UI for quick testing

High‑Level Architecture

The core pieces:

  • api/app.py + api/routers/*: Modular FastAPI app (meta, retrieval, chat, indexing, maintenance)
  • backend.py: Thin entrypoint that boots api.app.create_app()
  • local.py + ui.py: Optional local Gradio UI separate from the FastAPI server
  • frontend/ (Next.js): Simple UI scaffold (optional service)
  • colpali/: Embedding API service (CPU/GPU) used by the backend
  • clients/colpali.py: HTTP client to a ColPali‑style embedding API (queries, images, patch metadata)
  • clients/qdrant.py: Multivector prefetch (rows/cols) + re‑ranking using using="original"
  • clients/minio.py: Object storage for page images with public URLs
  • clients/openai.py: Thin wrapper for streaming completions
  • api/utils.py: PDF → image via pdf2image
  • config.py: All the knobs in one place
Architecture

Next.js Frontend (optional)

  • Runs at http://localhost:3000 when the frontend service is enabled.
  • Basic upload/search/chat UI; intended as a scaffold you can extend (no auth by default).
  • API docs remain at http://localhost:8000/docs.

Gradio UI

Main Screen

1755187073028
1755187099935

LLM Powered Q/A

1755187152628

Validate Results

1755187185850

Indexing Flow

  1. PDF → images (pdf2image.convert_from_path)
  2. Images → embeddings (external ColPali API)
  3. Save images to MinIO (public URLs)
  4. Upsert embeddings (original + mean‑pooled rows/cols) to Qdrant with payload metadata

Retrieval Flow

  1. Query → embedding (ColPali API)
  2. Qdrant multivector prefetch (rows/cols), then rerank with using="original"
  3. Fetch top‑k page images from MinIO
  4. Stream an OpenAI‑backed answer conditioned on user text + page images

Quickstart (Docker Compose)

# 1) Configure env
cp .env.example .env
# Set OPENAI_API_KEY / OPENAI_MODEL
# Choose COLPALI_MODE=cpu|gpu (or set COLPALI_API_BASE_URL to override)

# 2) Start the ColPali Embedding API (separate compose, from colpali/)
# CPU at http://localhost:7001 or GPU at http://localhost:7002
docker compose -f colpali/docker-compose.yml up -d api-cpu  # or api-gpu

# 3) Start backend services (and optional frontend)
docker compose up -d

# Services
# Qdrant:   http://localhost:6333  (Dashboard at /dashboard)
# MinIO:    http://localhost:9000  (Console: http://localhost:9001, user/pass: minioadmin/minioadmin)
# API:      http://localhost:8000  (OpenAPI: http://localhost:8000/docs)
# Frontend: http://localhost:3000   (if enabled)

Open the docs at http://localhost:8000/docs and try the endpoints.

Local Development (without Compose)

  • Install Poppler (needed by pdf2image). Ensure pdftoppm/pdftocairo are in PATH.
  • Create a venv, install requirements, run Qdrant/MinIO (Docker is fine), then:

A quick look at the Gradio view

cp .env.example .env
# set OPENAI_API_KEY, OPENAI_MODEL, QDRANT_URL, MINIO_URL, COLPALI_API_BASE_URL
uvicorn backend:app --host 0.0.0.0 --port 8000 --reload

Optional Local Gradio UI

python local.py
# Defaults to HOST=0.0.0.0, PORT=7860

Environment Variables (high‑value ones)

  • Core: LOG_LEVEL, HOST, PORT, ALLOWED_ORIGINS
  • OpenAI: OPENAI_API_KEY, OPENAI_MODEL
  • ColPali: COLPALI_MODE (cpu|gpu), COLPALI_CPU_URL, COLPALI_GPU_URL, COLPALI_API_BASE_URL (overrides), COLPALI_API_TIMEOUT
  • Qdrant: QDRANT_URL, QDRANT_COLLECTION_NAME, QDRANT_SEARCH_LIMIT, QDRANT_PREFETCH_LIMIT
  • Qdrant (storage/quantization): QDRANT_ON_DISK, QDRANT_ON_DISK_PAYLOAD, QDRANT_USE_BINARY, QDRANT_BINARY_ALWAYS_RAM, QDRANT_SEARCH_RESCORE, QDRANT_SEARCH_OVERSAMPLING, QDRANT_SEARCH_IGNORE_QUANT
  • MinIO: MINIO_URL, MINIO_PUBLIC_URL, MINIO_ACCESS_KEY, MINIO_SECRET_KEY, MINIO_BUCKET_NAME, MINIO_WORKERS
  • Processing: DEFAULT_TOP_K, BATCH_SIZE, WORKER_THREADS, MAX_TOKENS

See .env.example for a minimal starting point.

ColPali API contract (expected)

The backend expects a ColPali‑style embedding API with endpoints:

  • GET /health → 200 when healthy
  • GET /info → JSON including { "dim": <int> }
  • POST /patches with { "dimensions": [{"width": W, "height": H}, ...] }{ "results": [{"n_patches_x": int, "n_patches_y": int}, ...] }
  • POST /embed/queries with { "queries": ["...", ...] }{ "embeddings": [[[...], ...]] }
  • POST /embed/images (multipart) → objects per image including embedding, image_patch_start, image_patch_len

Ensure your embedding server matches this contract to avoid client/runtime errors.

Data model in Qdrant

Each point stores three vectors (multivector):

  • original: full token sequence
  • mean_pooling_rows: pooled by rows
  • mean_pooling_columns: pooled by columns

Payload example:

{ "index": 12, "page": "Page 3", "image_url": "http://localhost:9000/documents/images/<id>.png", "document_id": "<id>", "filename": "file.pdf", "file_size_bytes": 123456, "pdf_page_index": 3, "total_pages": 10, "page_width_px": 1654, "page_height_px": 2339, "indexed_at": "2025-01-01T00:00:00Z" }

Binary quantization (optional)

Enable Qdrant binary quantization to reduce memory and speed up search while preserving quality via rescore/oversampling.

  • Set in .env: QDRANT_USE_BINARY=True, QDRANT_BINARY_ALWAYS_RAM=True (optionally QDRANT_ON_DISK=True, QDRANT_ON_DISK_PAYLOAD=True)
  • Tune search: QDRANT_SEARCH_RESCORE=True, QDRANT_SEARCH_OVERSAMPLING=2.0, QDRANT_SEARCH_IGNORE_QUANT=False
  • Apply changes: clear the collection (POST /clear/qdrant) and re‑index

Using the API

  • GET /health — check dependencies
  • GET /search?q=...&k=5 — top‑k results with payload metadata
  • POST /index (multipart files[]) — upload and index PDFs
  • POST /chat — JSON body with query/options; returns full text and retrieved pages
  • POST /chat/stream — same body; streams text/plain tokens
  • POST /clear/qdrant | /clear/minio | /clear/all — maintenance

API Examples

# Search
curl "http://localhost:8000/search?q=What%20is%20the%20booking%20reference%3F&k=5"

# Chat (non‑streaming)
curl -X POST http://localhost:8000/chat \
  -H 'Content-Type: application/json' \
  -d '{
    "message": "What is the booking reference for case 002?",
    "k": 5,
    "ai_enabled": true
  }'

Why ColPali‑style Retrieval Here?

  • Handles interleaved text+images directly — no lossy OCR pipeline
  • Preserves layout structure (tables, charts, code blocks, equations)
  • Plays nicely with multivector late‑interaction search
  • Pairs well with Qdrant’s prefetch+rereank pattern

If you read my previous post, you’ll also recognize mean‑pooled vectors for fast prefetch and final reranking with full‑res embeddings — the same spirit is here.

Troubleshooting

  • OpenAI: Verify OPENAI_API_KEY and OPENAI_MODEL if responses error.
  • ColPali API: Ensure the service is up and reachable (GET /health) at COLPALI_API_BASE_URL or via mode URLs.
  • Patch metadata mismatch: Ensure image_patch_start/image_patch_len are returned by /embed/images.
  • Qdrant/MinIO reachability: Check docker compose ps and URLs.
  • Binary quantization toggles: Recreate the collection (e.g., POST /clear/qdrant) and re‑index after changing flags.
  • Poppler on Windows: Install Poppler and add bin/ to PATH so pdf2image can find pdftoppm.
  • Large PDFs on low VRAM: Reduce BATCH_SIZE in config.py.

I hope you find this useful! Let me know if you have any questions or run into any issues.

Just kidding, nobody makes it this far, lel