Tired of Colpali? Too bad!

I'm an avid ColPali fan, perhaps even obsessive. But if there's one thing I could've done better when developing applications with it is to decouple the model-specific implementations from the rest of the application. Since there isn't official ColPali support from inference servers such as Ollama, vLLM, LMStudio, and HuggingFace, I decided to build a FastAPI server that does exactly that.

Motivation

Getting ColPali to work as a proof of concept isn't particularly difficult—resource-intensive, yes, but you can find numerous implementations on Hugging Face. The real challenge lies in finding production-ready solutions that properly separate the model from the application logic. Even the best AI code assistants struggle with this. Another pain point is that every code change triggers a reload of the checkpoint shards, causing significant and unnecessary delays during development.

One example I'm particularly fond of is Optimizing ColPali for Retrieval at Scale, 13x Faster Results by Qdrant. This article demonstrates how to optimize ColPali for scale using Pooled Retrieval and Full Reranking with ColPali and Qdrant. I implemented this approach in my colnomic_qdrant_rag little-script. However, when I attempted to extract the model-related logic from the application, I found it surprisingly challenging. After numerous attempts to separate the model logic from the vector database (especially the Mean Pooling implementation), I decided to start fresh and build a FastAPI server that handles ColPali requests over HTTP.

The little-script

The project, called colqwen_fastapi, is a FastAPI server that handles ColPali requests with some additional features. Beyond the standard query and image embedding endpoints, it exposes model and processor-related endpoints to support optimizations like Pooled Retrieval and Reranking. You can find the project here.

Endpoints

/info returns a JSON object with helpful parameters for Pooled Retrieval:
- version: The version of the model
- device: The device the model is running on
- dtype: The data type of the model
- flash_attn: Whether flash attention is enabled
- spatial_merge_size: The spatial merge size
- parameters: The number of parameters in the model
- dim: The dimension of the model
- image_token_id: The image token id
/patches calculates the number of patches for given image size and spatial merge size

Usage

You can run this server locally or on a remote server as a backend for your application. Important considerations:

Security: Secure all endpoints as they expose model functionality
Hardware: ColPali models benefit significantly from GPU acceleration

# Clone the repository
git clone https://github.com/athrael-soju/little-scripts.git
cd colqwen_fastapi

# Set up your environment
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

# Start the server
python app.py

# Verify the server is running
http GET http://localhost:8000/info

# Verify the docs are running and execute requests
http GET http://localhost:8000/docs

# Verify the openapi schema is running
http GET http://localhost:8000/openapi.json

# Generate the openapi schema
python generate_openapi.py

That's all for now!

Feel free to grab the code from GitHub and let me know if you have any questions, feedback, or suggestions for additional features/endpoints!