- Published on
You too can run the Vidore Benchmark with less than 32GB of GPU VRAM
- Authors
- Name
- Athos Georgiou

TL;DR
Yes, you too can run the Vidore benchmark on Poorware GPU. Use half‑precision, small batches, and keep preprocessing lean.
Recommended settings
- Precision: use
float16
(orbfloat16
if supported). - Batch size: start at
1
; bump to2
only if you have headroom. - Resolution: cap the long edge (e.g., 1024px) if the benchmark config allows; large images spike VRAM.
- Workers: keep dataloader workers modest (
num_workers=2–4
) to avoid host↔device stalls. - Pinned memory: enable if your pipeline benefits; disable if you see host RAM pressure.
- CUDA allocator:
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
helps fragmentation on long runs. - Determinism: turn off extra cudnn autotune if you need reproducibility.
Minimal Python runner (MTEB)
Install deps and run this minimal script to execute ViDoRe via MTEB with tiny batches.
pip install mteb colpali-engine --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu129
# mteb_run_vidore.py
import os, torch, mteb
# --- (optional but helpful) allocator + precision tweaks ---
os.environ["PYTORCH_ALLOC_CONF"] = "expandable_segments:True,max_split_size_mb:128"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = os.environ["PYTORCH_ALLOC_CONF"] # backward compat
torch.set_float32_matmul_precision("medium") # slightly lighter kernels
# --- load the pre-defined model from MTEB ---
model_name = "vidore/colqwen2.5-v0.2"
model = mteb.get_model(model_name) # uses the correct wrapper internally
# --- select the ViDoRe benchmarks ---
benchmarks = mteb.get_benchmarks(names=["ViDoRe(v1)", "ViDoRe(v2)"])
evaluator = mteb.MTEB(tasks=benchmarks)
# --- run with small batches to stay under ~32 GB VRAM ---
results = evaluator.run(
model,
encode_kwargs={"batch_size": 1}, # This number measures your GPU poorness. A 1 means you're pretty darn poor, but better off than most. Remember, glass is half full.
verbosity=2,
)
print(results)
Grab the Code
You can find the code here.
Troubleshooting
Still OOM at batch_size=1?
- Lower image resolution cap.
- Ensure models/embedders load in half‑precision.
- Free unused CUDA memory between phases (
torch.cuda.empty_cache()
in long loops). - Close background viewers/notebooks that reserve VRAM.
Throughput too low?
- Increase
num_workers
gradually (watch host RAM). - Try
bfloat16
on newer GPUs for stability. - Profile I/O—decode/resize often bottlenecks, not the model.
- Increase
Still facing issues?
- Check the vidore-benchmark and mteb repos for more details.
Interested in helping?
- Quite curious to see how it performs on your GPU, or even MPS
That’s it, have fun, bye bye!