logo
Published on

You too can run the Vidore Benchmark with less than 32GB of GPU VRAM

Authors
  • avatar
    Name
    Athos Georgiou
    Twitter
yes

TL;DR

Yes, you too can run the Vidore benchmark on Poorware GPU. Use half‑precision, small batches, and keep preprocessing lean.

  • Precision: use float16 (or bfloat16 if supported).
  • Batch size: start at 1; bump to 2 only if you have headroom.
  • Resolution: cap the long edge (e.g., 1024px) if the benchmark config allows; large images spike VRAM.
  • Workers: keep dataloader workers modest (num_workers=2–4) to avoid host↔device stalls.
  • Pinned memory: enable if your pipeline benefits; disable if you see host RAM pressure.
  • CUDA allocator: PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128 helps fragmentation on long runs.
  • Determinism: turn off extra cudnn autotune if you need reproducibility.

Minimal Python runner (MTEB)

Install deps and run this minimal script to execute ViDoRe via MTEB with tiny batches.

pip install mteb colpali-engine --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu129
# mteb_run_vidore.py
import os, torch, mteb

# --- (optional but helpful) allocator + precision tweaks ---
os.environ["PYTORCH_ALLOC_CONF"] = "expandable_segments:True,max_split_size_mb:128"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = os.environ["PYTORCH_ALLOC_CONF"]  # backward compat
torch.set_float32_matmul_precision("medium")  # slightly lighter kernels

# --- load the pre-defined model from MTEB ---
model_name = "vidore/colqwen2.5-v0.2"
model = mteb.get_model(model_name)  # uses the correct wrapper internally

# --- select the ViDoRe benchmarks ---
benchmarks = mteb.get_benchmarks(names=["ViDoRe(v1)", "ViDoRe(v2)"])
evaluator = mteb.MTEB(tasks=benchmarks)

# --- run with small batches to stay under ~32 GB VRAM ---
results = evaluator.run(
    model,
    encode_kwargs={"batch_size": 1},     # This number measures your GPU poorness. A 1 means you're pretty darn poor, but better off than most. Remember, glass is half full.
    verbosity=2,
)
print(results)

Grab the Code

You can find the code here.

Troubleshooting

  • Still OOM at batch_size=1?

    • Lower image resolution cap.
    • Ensure models/embedders load in half‑precision.
    • Free unused CUDA memory between phases (torch.cuda.empty_cache() in long loops).
    • Close background viewers/notebooks that reserve VRAM.
  • Throughput too low?

    • Increase num_workers gradually (watch host RAM).
    • Try bfloat16 on newer GPUs for stability.
    • Profile I/O—decode/resize often bottlenecks, not the model.
  • Still facing issues?

Interested in helping?

  • Quite curious to see how it performs on your GPU, or even MPS

That’s it, have fun, bye bye!