logo
Published on

The Most Beautiful RAG: Starring Colnomic, Qdrant, Minio and Friends

Authors
  • avatar
    Name
    Athos Georgiou
    Twitter
1

Hey There!

Welcome to the very first post of my little-scripts series! This is where I'll be implementing smaller, more focused experiments and implementations that don't quite warrant their own full repositories but that I find too interesting to keep to myself. It's something I've wanted to do for a long time, so here we go!

The first entry in this collection is called colnomic_qdrant_rag, inspired by Advanced Retrieval with ColPali & Qdrant Vector Database, which was published by Qdrant last year. I've been fascinated with how Colpali solves many of the issues of traditional RAG systems ever since and with the introduction of Colnomic I felt it was the right time for me to give back.

Why Colnomic?

Colnomic Embed Multimodal 3B represents a significant leap forward in multimodal document retrieval. Here's why it's the perfect choice for this RAG implementation:

State-of-the-Art Performance

With an impressive 61.2 NDCG@5 on the Vidore-v2 benchmark, Colnomic outperforms most existing multimodal embedding models, only trailing behind its larger 7B sibling. This puts it ahead of established models like Nomic Embed Multimodal, Voyage Multimodal 3, and others in the space.

Unified Text-Image Processing

Unlike traditional RAG systems that require complex OCR preprocessing and separate text extraction pipelines, Colnomic directly encodes interleaved text and images without any preprocessing. This means:

  • No more lossy OCR conversion steps
  • Preserved visual context and layout information
  • Faster processing by eliminating preprocessing bottlenecks
  • More complete information capture from documents

Perfect for Visual Documents

Colnomic excels at handling the types of documents that challenge traditional text-only systems:

  • Research papers with equations, diagrams, and complex tables
  • Technical documentation with code blocks, flowcharts, and screenshots
  • Financial reports with charts, graphs, and numerical data
  • Product catalogs with images, specifications, and visual elements

Open and Accessible

As an open-weights model with only 3B parameters, Colnomic strikes the perfect balance between performance and accessibility. It's powerful enough for production use while being lightweight enough to run on consumer hardware.

Multi-Vector Architecture

The model's advanced multi-vector architecture allows it to capture nuanced relationships between text and visual elements, making it particularly effective for late-interaction retrieval patterns that we'll be implementing with Qdrant.

This combination of performance, efficiency, and ease of use makes Colnomic an ideal foundation for building beautiful, effective RAG systems that can handle the complexity of real-world documents.

Performance Benchmarks

Here's how Colnomic stacks up against other multimodal embedding models on the Vidore-v2 benchmark:

ModelAvg.ESG Restaurant HumanEcon Macro Multi.AXA Multi.MIT BioESG Restaurant Synth.ESG Restaurant Synth. Multi.MIT Bio Multi.AXAEcon. Macro
ColNomic Embed Multimodal 7B62.773.954.761.366.157.356.764.268.361.6
ColNomic Embed Multimodal 3B61.265.855.461.063.556.657.262.568.860.2
T-Systems ColQwen2.5-3B59.972.151.260.065.351.753.361.769.354.8
Nomic Embed Multimodal 7B59.765.757.759.364.049.251.961.266.363.1
GME Qwen2 7B59.065.856.255.464.054.356.755.160.762.9
Nomic Embed Multimodal 3B58.859.857.558.862.549.449.458.669.663.5
Llama Index vdr-2b-multi-v158.463.152.861.060.650.351.256.968.861.2
Voyage Multimodal 355.056.155.059.556.447.246.251.564.158.8

As you can see, ColNomic 3B achieves remarkable performance at 61.2 NDCG@5, positioning it as the second-best model overall while being significantly more efficient than its 7B counterpart.

Colnomic, Qdrant, and MinIO walk into a bar...

...Neon signs flickering "𝔼mbeddings & Bitrates" overhead.

Bartender (eyeing them cautiously): "What'll it be tonight?"

Colnomic leans forward, voice low and syntactically precise:

"I'll have a 128-dimensional multimodal NeRF on the rocks, shaken with Faiss-infused HNSW foam - please skip the L2, I'm in a cosine mood."

Qdrant cracks a grin under its LED display:

"Make mine quantized - double quantized, actually. Start with 8-bit product quantization, then cascade with residual 4-bit PQ. And can you top it off with a sprinkle of IVF-PQ indices? I'm feeling ultra-low-latency tonight."

MinIO raises a single eyebrow and holds up an empty tumbler:

"Nothing for me, thanks - I've already got my object store full and I need to orchestrate these two across a geo-replicated S3 gateway before their throughput degrades."

The bartender nods appreciatively, polishing a hexagonally-shaped glass with an SSD cloth, and quips:

"Distributed consistency, vector compression, AND high-availability failover? Coming right up!"


Okay, enough with the terrible bar jokes. Let's talk about what this little script actually does and why I think it's pretty neat.

What I've Scraped Together

This is a straightforward multimodal RAG system that can:

  • Index PDF documents and images using natural language queries
  • Search through visual content without losing context from complex layouts
  • Provide AI-powered responses about what it finds
  • Run efficiently with binary quantization (90%+ storage reduction!)

The whole thing is built around a simple CLI interface because, let's be honest, sometimes you just want to ask questions about your documents without dealing with complicated UIs.

The Architecture

As you can see, the architecture is pretty straightforward:

Architecture diagram

Mermaid Diagram (Because why not?)

---
config:
  theme: neo
  layout: dagre
  look: handDrawn
---
flowchart TD
    A["πŸ“„ PDF Documents<br>&amp; Images"] --> B["πŸ” ColPali<br>Multimodal Processing"] & L["πŸ“¦ MinIO<br>Object Storage"]
    B --> C["πŸ“Š Vector Embeddings<br>128-dimensional"]
    C --> D["πŸ—„οΈ Qdrant<br>Binary Quantization"]
    E["πŸ‘€ User Query"] --> F["πŸ€– CLI Interface<br>Interactive/Command"]
    F --> G["πŸ”Ž Semantic Search<br>Qdrant Vector DB"]
    G --> H["πŸ“‹ Retrieved Documents"]
    H --> I["🧠 OpenAI GPT<br>Conversational AI"] & K["πŸ“„ Direct Results"]
    I --> J["πŸ’¬ AI Response"]
    D --> G
    L --> A

    style A fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000000
    style B fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,color:#000000
    style D fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px,color:#000000
    style F fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000000
    style I fill:#fce4ec,stroke:#880e4f,stroke-width:2px,color:#000000
    linkStyle 0,1,2,3,4,5,6,7,8,9,10,11 stroke:#888,stroke-width:2px

ColPali handles the heavy lifting of understanding both text and images in documents. Qdrant stores our vectors with binary quantization for speed and efficiency. MinIO manages our document storage. And OpenAI provides the conversational layer when you want more than just search results.

Getting Started

First, you'll need the usual suspects:

  • Python 3.10+
  • Docker & Docker Compose
  • Poppler for PDF processing (the README has platform-specific instructions)

Then it's just:

# Clone and set up
git clone https://github.com/athrael-soju/little-scripts.git
cd little-scripts/colnomic_qdrant_rag
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

# Start the infrastructure
docker-compose up -d

# Verify services are running
docker-compose ps

# Optional: Add OpenAI key for conversational features
echo "OPENAI_API_KEY=your_key_here" > .env

Two Ways to Use It

This is where the script shines. Just run:

python main.py interactive

And you get a nice little interface where you can:

πŸ” colpali[Basic]> What are some interesting UFO sightings?
πŸ€– colpali[Conversational]> set-mode conversational
πŸ€– colpali[Conversational]> Analyze the visual patterns in these documents
πŸ” colpali[Basic]> upload --file my_document.pdf

Command Line Mode

For when you want to script things:

# Basic search (fast, just returns relevant docs)
python main.py ask "UFO sightings in Texas"

# AI-powered analysis (includes conversational response)
python main.py analyze "What do these UFO images reveal?"

# Upload documents
python main.py upload --file path/to/document.pdf
python main.py upload --file path/to/documents.txt
python main.py upload  # Uses default UFO dataset for testing

# Management commands
python main.py clear-collection  # Clear all documents
python main.py show-status       # Check system status

Using Binary Quantization

Instead of storing full-precision vectors (which can be huge), we use binary quantization in Qdrant. This gives us:

Storage & Memory Benefits

  • 32x storage reduction (96.9% compression) by converting float32 to 1-bit representations
  • Dramatically reduced RAM usage - essential for scaling to millions of vectors
  • Lower memory bandwidth requirements for faster data access

Performance Improvements

  • Up to 40x faster similarity search through optimized bitwise operations
  • Accelerated indexing times since binary vectors are much faster to process
  • SIMD CPU optimizations for blazingly fast Hamming distance calculations
  • Better scaling characteristics as dataset size grows

Economic & Operational Benefits

  • Significant cost savings from reduced infrastructure requirements
  • Ability to handle larger datasets on the same hardware
  • Linear performance scaling - benefits multiply as you add more vectors
  • Reduced network transfer costs for distributed deployments

Quality Preservation

  • Minimal impact on search quality thanks to ColPali's robust high-dimensional embeddings
  • Oversampling and rescoring techniques maintain accuracy while preserving speed benefits
  • Particularly effective for embeddings with 1024+ dimensions where redundancy can be exploited

The configuration is pretty straightforward:

# In config.py
MODEL_NAME = "nomic-ai/colnomic-embed-multimodal-3b"
VECTOR_SIZE = 128
SEARCH_LIMIT = 3          # Number of results to return
OVERSAMPLING = 2.0        # Improve recall with oversampling
BATCH_SIZE = 4            # Batch size for indexing

You can learn more about Binary Quantization, from this Article

Service Management

Once you have the Docker services running, you can monitor them:

# Check service status
docker-compose ps

# View logs
docker-compose logs qdrant
docker-compose logs minio

# Stop everything when you're done
docker-compose down

You can also access the service dashboards:

Environment Configuration

The script supports several environment variables for customization:

# .env file example
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_MODEL=gpt-4.1-mini

# MinIO Configuration (defaults work with Docker Compose)
MINIO_ENDPOINT=localhost:9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin

# Qdrant Configuration
QDRANT_URL=http://localhost:6333

This Was Super Fun to Put Together!

Working with multimodal embeddings is fascinating. Traditional RAG systems struggle with documents that have complex layouts, images, charts, or equations. They rely on OCR, which often loses context or fails entirely.

ColPali changes this by directly processing the visual representation of documents. It understands that a table isn't just text in rows and columns - it's a visual structure with meaning. Same goes for diagrams, charts, and even the layout of text on a page.

The binary quantization was a pleasant surprise. I expected more quality loss, but ColPali's embeddings are robust enough that the compressed vectors still work beautifully for retrieval.

Troubleshooting

A few common issues you might run into:

# If CUDA runs out of memory, reduce batch size
# Edit config.py and set:
BATCH_SIZE = 2

# If Qdrant connection fails, make sure Docker is running
docker-compose ps
docker-compose up -d

# If PDF processing fails, install Poppler
# Windows: Download from Poppler Windows releases
# macOS: brew install poppler
# Linux: sudo apt-get install poppler-utils

# Check if model downloaded correctly
python -c "from transformers import AutoProcessor; AutoProcessor.from_pretrained('nomic-ai/colnomic-embed-multimodal-3b')"

There's Definitely Room for Improvement

This is just a little script, so there's plenty of room for enhancement:

  • Better error handling for edge cases
  • Support for more document formats beyond PDF
  • Fine-tuning capabilities for domain-specific documents
  • Web interface for less technical users
  • Batch processing for large document collections

Try It Out

If you're curious about multimodal RAG or just want to play with some interesting technology, give it a shot. The default UFO dataset makes for some fun testing, and you can easily swap in your own documents.

The whole thing runs in Docker containers, so cleanup is as simple as docker-compose down when you're done experimenting.

That's all Folks!

This little script scratches an itch I've had for a while - how do you build a RAG system that actually understands visual documents? ColPali provides an elegant solution, and pairing it with Qdrant's binary quantization makes it practical for real-world use.

This script is not production ready. But it's a solid foundation for anyone looking to experiment with multimodal retrieval, and the code is simple enough to understand and modify.

Plus, watching it correctly identify specific charts or equations in research papers never gets old.


The complete code and setup instructions are available in the little-scripts repository. Feel free to fork it, break it, or make it better!