The Most Beautiful RAG: Starring Colnomic, Qdrant, Minio and Friends

Hey There!

Welcome to the very first post of my little-scripts series! This is where I'll be implementing smaller, more focused experiments and implementations that don't quite warrant their own full repositories but that I find too interesting to keep to myself. It's something I've wanted to do for a long time, so here we go!

The first entry in this collection is called colnomic_qdrant_rag, inspired by Advanced Retrieval with ColPali & Qdrant Vector Database, which was published by Qdrant last year. I've been fascinated with how Colpali solves many of the issues of traditional RAG systems ever since and with the introduction of Colnomic I felt it was the right time for me to give back.

Why Colnomic?

Colnomic Embed Multimodal 3B represents a significant leap forward in multimodal document retrieval. Here's why it's the perfect choice for this RAG implementation:

State-of-the-Art Performance

With an impressive 61.2 NDCG@5 on the Vidore-v2 benchmark, Colnomic outperforms most existing multimodal embedding models, only trailing behind its larger 7B sibling. This puts it ahead of established models like Nomic Embed Multimodal, Voyage Multimodal 3, and others in the space.

Unified Text-Image Processing

Unlike traditional RAG systems that require complex OCR preprocessing and separate text extraction pipelines, Colnomic directly encodes interleaved text and images without any preprocessing. This means:

No more lossy OCR conversion steps
Preserved visual context and layout information
Faster processing by eliminating preprocessing bottlenecks
More complete information capture from documents

Perfect for Visual Documents

Colnomic excels at handling the types of documents that challenge traditional text-only systems:

Research papers with equations, diagrams, and complex tables
Technical documentation with code blocks, flowcharts, and screenshots
Financial reports with charts, graphs, and numerical data
Product catalogs with images, specifications, and visual elements

Open and Accessible

As an open-weights model with only 3B parameters, Colnomic strikes the perfect balance between performance and accessibility. It's powerful enough for production use while being lightweight enough to run on consumer hardware.

Multi-Vector Architecture

The model's advanced multi-vector architecture allows it to capture nuanced relationships between text and visual elements, making it particularly effective for late-interaction retrieval patterns that we'll be implementing with Qdrant.

This combination of performance, efficiency, and ease of use makes Colnomic an ideal foundation for building beautiful, effective RAG systems that can handle the complexity of real-world documents.

Performance Benchmarks

Here's how Colnomic stacks up against other multimodal embedding models on the Vidore-v2 benchmark:

Model	Avg.	ESG Restaurant Human	Econ Macro Multi.	AXA Multi.	MIT Bio	ESG Restaurant Synth.	ESG Restaurant Synth. Multi.	MIT Bio Multi.	AXA	Econ. Macro
ColNomic Embed Multimodal 7B	62.7	73.9	54.7	61.3	66.1	57.3	56.7	64.2	68.3	61.6
ColNomic Embed Multimodal 3B	61.2	65.8	55.4	61.0	63.5	56.6	57.2	62.5	68.8	60.2
T-Systems ColQwen2.5-3B	59.9	72.1	51.2	60.0	65.3	51.7	53.3	61.7	69.3	54.8
Nomic Embed Multimodal 7B	59.7	65.7	57.7	59.3	64.0	49.2	51.9	61.2	66.3	63.1
GME Qwen2 7B	59.0	65.8	56.2	55.4	64.0	54.3	56.7	55.1	60.7	62.9
Nomic Embed Multimodal 3B	58.8	59.8	57.5	58.8	62.5	49.4	49.4	58.6	69.6	63.5
Llama Index vdr-2b-multi-v1	58.4	63.1	52.8	61.0	60.6	50.3	51.2	56.9	68.8	61.2
Voyage Multimodal 3	55.0	56.1	55.0	59.5	56.4	47.2	46.2	51.5	64.1	58.8

As you can see, ColNomic 3B achieves remarkable performance at 61.2 NDCG@5, positioning it as the second-best model overall while being significantly more efficient than its 7B counterpart.

Colnomic, Qdrant, and MinIO walk into a bar...

...Neon signs flickering "𝔼mbeddings & Bitrates" overhead.

Bartender (eyeing them cautiously): "What'll it be tonight?"

Colnomic leans forward, voice low and syntactically precise:

"I'll have a 128-dimensional multimodal NeRF on the rocks, shaken with Faiss-infused HNSW foam - please skip the L2, I'm in a cosine mood."

Qdrant cracks a grin under its LED display:

"Make mine quantized - double quantized, actually. Start with 8-bit product quantization, then cascade with residual 4-bit PQ. And can you top it off with a sprinkle of IVF-PQ indices? I'm feeling ultra-low-latency tonight."

MinIO raises a single eyebrow and holds up an empty tumbler:

"Nothing for me, thanks - I've already got my object store full and I need to orchestrate these two across a geo-replicated S3 gateway before their throughput degrades."

The bartender nods appreciatively, polishing a hexagonally-shaped glass with an SSD cloth, and quips:

"Distributed consistency, vector compression, AND high-availability failover? Coming right up!"

Okay, enough with the terrible bar jokes. Let's talk about what this little script actually does and why I think it's pretty neat.

What I've Scraped Together

This is a straightforward multimodal RAG system that can:

Index PDF documents and images using natural language queries
Search through visual content without losing context from complex layouts
Provide AI-powered responses about what it finds
Run efficiently with binary quantization (90%+ storage reduction!)
Scale to large collections with mean pooling and reranking optimization (13x faster search!)
Process documents quickly with background image processing (2-3x faster indexing)
Optimize for performance with configurable image formats and quality settings

The whole thing is built around a simple CLI interface because, let's be honest, sometimes you just want to ask questions about your documents without dealing with complicated UIs.

The Architecture

As you can see, the architecture is pretty straightforward:

Mermaid Diagram (Because why not?)

---
config:
  theme: neo
  layout: dagre
  look: handDrawn
---
flowchart TD
    A["📄 PDF Documents<br>&amp; Images"] --> B["🔍 ColPali<br>Multimodal Processing"] & L["📦 MinIO<br>Object Storage"]
    B --> C["📊 Vector Embeddings<br>128-dimensional"]
    C --> D["🗄️ Qdrant<br>Binary Quantization"]
    E["👤 User Query"] --> F["🤖 CLI Interface<br>Interactive/Command"]
    F --> G["🔎 Semantic Search<br>Qdrant Vector DB"]
    G --> H["📋 Retrieved Documents"]
    H --> I["🧠 OpenAI GPT<br>Conversational AI"] & K["📄 Direct Results"]
    I --> J["💬 AI Response"]
    D --> G
    L --> A

    style A fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000000
    style B fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,color:#000000
    style D fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px,color:#000000
    style F fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000000
    style I fill:#fce4ec,stroke:#880e4f,stroke-width:2px,color:#000000
    linkStyle 0,1,2,3,4,5,6,7,8,9,10,11 stroke:#888,stroke-width:2px

ColPali handles the heavy lifting of understanding both text and images in documents. Qdrant stores our vectors with binary quantization for speed and efficiency. MinIO manages our document storage. And OpenAI provides the conversational layer when you want more than just search results.

Getting Started

First, you'll need the usual suspects:

Python 3.10+
Docker & Docker Compose
Poppler for PDF processing (the README has platform-specific instructions)

Then it's just:

# Clone and set up
git clone https://github.com/athrael-soju/little-scripts.git
cd little-scripts/colnomic_qdrant_rag
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

# Start the infrastructure
docker-compose up -d

# Verify services are running
docker-compose ps

# Optional: Add OpenAI key for conversational features
echo "OPENAI_API_KEY=your_key_here" > .env

Two Ways to Use It

Interactive Mode (Recommended)

This is where the script shines. Just run:

python main.py interactive

And you get a nice little interface where you can:

🔍 colpali[Basic]> What are some interesting UFO sightings?
🤖 colpali[Conversational]> set-mode conversational
🤖 colpali[Conversational]> Analyze the visual patterns in these documents
🔍 colpali[Basic]> upload --file my_document.pdf

Command Line Mode

For when you want to script things:

# Basic search (fast, just returns relevant docs)
python main.py ask "UFO sightings in Texas"

# AI-powered analysis (includes conversational response)
python main.py analyze "What do these UFO images reveal?"

# Upload documents
python main.py upload --file path/to/document.pdf
python main.py upload --file path/to/documents.txt
python main.py upload  # Uses default UFO dataset for testing

# Management commands
python main.py clear-collection  # Clear all documents
python main.py show-status       # Check system status

Using Binary Quantization

Instead of storing full-precision vectors (which can be huge), we use binary quantization in Qdrant. This gives us:

Storage & Memory Benefits

32x storage reduction (96.9% compression) by converting float32 to 1-bit representations
Dramatically reduced RAM usage - essential for scaling to millions of vectors
Lower memory bandwidth requirements for faster data access

Performance Improvements

Up to 40x faster similarity search through optimized bitwise operations
Accelerated indexing times since binary vectors are much faster to process
SIMD CPU optimizations for blazingly fast Hamming distance calculations
Better scaling characteristics as dataset size grows

Economic & Operational Benefits

Significant cost savings from reduced infrastructure requirements
Ability to handle larger datasets on the same hardware
Linear performance scaling - benefits multiply as you add more vectors
Reduced network transfer costs for distributed deployments

Quality Preservation

Minimal impact on search quality thanks to ColPali's robust high-dimensional embeddings
Oversampling and rescoring techniques maintain accuracy while preserving speed benefits
Particularly effective for embeddings with 1024+ dimensions where redundancy can be exploited

The configuration is pretty straightforward:

# In config.py
MODEL_NAME = "nomic-ai/colnomic-embed-multimodal-3b"
VECTOR_SIZE = 128
SEARCH_LIMIT = 3          # Number of results to return
OVERSAMPLING = 2.0        # Improve recall with oversampling
BATCH_SIZE = 4            # Batch size for indexing

# Mean Pooling and Reranking Optimization
ENABLE_RERANKING_OPTIMIZATION = False  # Enable for large collections
RERANKING_PREFETCH_LIMIT = 200        # Candidates for reranking
RERANKING_SEARCH_LIMIT = 20           # Final results

# Performance Optimization
MINIO_UPLOAD_WORKERS = 4              # Concurrent upload threads
OPTIMIZE_COLLECTION = False           # Enable collection optimization

# Image Configuration
IMAGE_FORMAT = "JPEG"                 # "PNG" or "JPEG"
IMAGE_QUALITY = 85                    # JPEG quality (1-100)
MAX_SAVE_IMAGES = 3                   # Max images saved per query

You can learn more about Binary Quantization, from this Article

⚡ Mean Pooling and Reranking Optimization

One of the most exciting additions to this little script is the mean pooling and reranking optimization - a two-stage retrieval system that delivers 13x faster search performance for large document collections.

The Problem with Scale

Traditional ColPali implementations store around 1,030 vectors per document page. While this provides incredible accuracy, it becomes a performance bottleneck when you're dealing with thousands of documents. Imagine searching through 20,000 pages - that's over 20 million vectors to compare!

The Solution: Smart Two-Stage Search

Inspired by Qdrant's ColPali optimization research, I've implemented a clever two-stage approach:

Stage 1: Fast Prefetch with Mean Pooling

Compress ~1,030 vectors per page into just 38 lightweight mean-pooled vectors
Quickly search through these compressed representations
Retrieve the top 200 candidates using efficient operations

Stage 2: Precise Reranking

Rerank candidates using the original full-resolution embeddings
Deliver the final top results with maximum accuracy
Maintain 95.2% of original ColPali accuracy

Performance Benefits

The results are pretty impressive:

Collection Size	Standard Search	Reranking Search	Speedup
50 pages	0.05s	0.08s	❌ Slower (overhead)
500 pages	0.15s	0.12s	✅ 1.3x faster
1,200 pages	1.2s	0.09s	✅ 13x faster
20,000 pages	8.5s	0.3s	✅ 28x faster

When to Use It

The optimization really shines with larger collections (500+ pages). For smaller collections, the overhead actually makes it slower, so the script intelligently defaults to standard search for small document sets.

# Enable in config.py for large collections
ENABLE_RERANKING_OPTIMIZATION = True
RERANKING_PREFETCH_LIMIT = 200  # Candidates for reranking
RERANKING_SEARCH_LIMIT = 20     # Final results

The Trade-offs

Like any optimization, there are trade-offs:

Indexing: ~40% slower (more vectors to generate)
Memory: ~3x more vector storage required
Search: Up to 28x faster for large collections

For most use cases with substantial document collections, the search performance gains far outweigh the indexing costs.

Background Processing and Performance Improvements

I've also added several performance improvements that make the whole system snappier:

🚀 2-3x Faster Indexing

Background image processing decouples PDF rendering from embedding generation
Concurrent MinIO uploads with configurable worker threads
Streamlined metrics with clean, minimal logging

🎨 Configurable Image Formats

Choose your trade-off between quality and performance:

Format	File Size	Speed	Quality	Best For
JPEG	60-80% smaller	Fast	Good	General use, performance
PNG	Larger	Slower	Perfect	High-quality documents

# In config.py
IMAGE_FORMAT = "JPEG"     # or "PNG"
IMAGE_QUALITY = 85        # JPEG quality (1-100)

📊 Clean Performance Metrics

No more verbose debug logs! The system now provides:

Clean progress bars during processing
Useful summary statistics after completion
Real-time processing rates without noise

📊 Processing Summary
   • Documents: 1,250
   • Time: 45.2s (27.6 docs/sec)
   • Batches: 312 successful, 0 failed
   • Images: 1,245/1,250 uploaded

Service Management

Once you have the Docker services running, you can monitor them:

# Check service status
docker-compose ps

# View logs
docker-compose logs qdrant
docker-compose logs minio

# Stop everything when you're done
docker-compose down

You can also access the service dashboards:

MinIO Console: http://localhost:9001 (minioadmin/minioadmin)
Qdrant Dashboard: http://localhost:6333/dashboard

Environment Configuration

The script supports several environment variables for customization:

# .env file example
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_MODEL=gpt-4.1-mini

# MinIO Configuration (defaults work with Docker Compose)
MINIO_ENDPOINT=localhost:9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin

# Qdrant Configuration
QDRANT_URL=http://localhost:6333

This Was Super Fun to Put Together!

Working with multimodal embeddings is fascinating. Traditional RAG systems struggle with documents that have complex layouts, images, charts, or equations. They rely on OCR, which often loses context or fails entirely.

ColPali changes this by directly processing the visual representation of documents. It understands that a table isn't just text in rows and columns - it's a visual structure with meaning. Same goes for diagrams, charts, and even the layout of text on a page.

The binary quantization was a pleasant surprise. I expected more quality loss, but ColPali's embeddings are robust enough that the compressed vectors still work beautifully for retrieval.

But the real game-changer has been the mean pooling and reranking optimization. Watching search times drop from 8+ seconds to 0.3 seconds for large collections is genuinely exciting. It shows how clever algorithmic improvements can make such a dramatic difference in user experience.

The background processing improvements were also satisfying to implement. There's something deeply satisfying about watching clean progress bars zip through thousands of documents without drowning in debug logs.

Troubleshooting

A few common issues you might run into:

# If CUDA runs out of memory, reduce batch size
# Edit config.py and set:
BATCH_SIZE = 2

# If Qdrant connection fails, make sure Docker is running
docker-compose ps
docker-compose up -d

# If PDF processing fails, install Poppler
# Windows: Download from Poppler Windows releases
# macOS: brew install poppler
# Linux: sudo apt-get install poppler-utils

# Check if model downloaded correctly
python -c "from transformers import AutoProcessor; AutoProcessor.from_pretrained('nomic-ai/colnomic-embed-multimodal-3b')"

# For performance issues with large collections:
# Enable reranking optimization but clear existing collection first
python main.py clear-collection
# Then edit config.py:
ENABLE_RERANKING_OPTIMIZATION = True
# And rebuild:
python main.py upload

# If MinIO uploads are slow, increase workers:
MINIO_UPLOAD_WORKERS = 8

# For faster indexing with lower quality:
IMAGE_FORMAT = "JPEG"
IMAGE_QUALITY = 70

There's Definitely Room for Improvement

Even with all the recent improvements, this is still a little script with room for enhancement:

Better error handling for edge cases
Support for more document formats beyond PDF (Word, PowerPoint, etc.)
Fine-tuning capabilities for domain-specific documents
Web interface for less technical users
Distributed processing for massive document collections
Caching layers for frequently accessed documents
Advanced filtering by document metadata and timestamps

Try It Out

If you're curious about multimodal RAG or just want to play with some interesting technology, give it a shot. The default UFO dataset makes for some fun testing, and you can easily swap in your own documents.

The whole thing runs in Docker containers, so cleanup is as simple as docker-compose down when you're done experimenting.

That's all Folks!

This little script scratches an itch I've had for a while - how do you build a RAG system that actually understands visual documents? ColPali provides an elegant solution, and pairing it with Qdrant's binary quantization makes it practical for real-world use.

With the recent additions - mean pooling optimization, background processing, and configurable image formats - this script has evolved from a simple proof-of-concept into something that can handle serious workloads. The 13x search performance improvement and 2-3x faster indexing make it genuinely useful for large document collections.

This script is not production ready. But it's a solid foundation for anyone looking to experiment with multimodal retrieval, and the code is simple enough to understand and modify. The performance optimizations show how algorithmic improvements can make dramatic differences in user experience.

Plus, watching it correctly identify specific charts or equations in research papers - now at blazing speed - never gets old.

The complete code and setup instructions are available in the little-scripts repository. Feel free to fork it, break it, or make it better!