Mine the Way Your Model Scores: MaxSim Hard-Negative Mining for a Late-Interaction Student

Here’s a mismatch baked into the conventional hard-negative mining pipeline. I mine hard negatives with a single-vector bi-encoder, scoring candidates by cosine similarity. But the model I’m actually training is a late-interaction retriever, and it scores documents with MaxSim, a completely different geometry. The thing that picks my hard negatives and the thing that learns from them disagree about what “similar” means.

It’s just inherited convention. So I went and questioned it.

1950s comic-style illustration of two scientists measuring the same document with two completely different measuring devices, one a simple ruler labeled COSINE and the other an elaborate multi-armed caliper labeled MAXSIM, each getting a different number and looking confused at each other

TLDR: I mine hard negatives with single-vector cosine, the standard approach, but my model scores documents with multi-vector MaxSim, so the miner and the model disagree about what “hard” means. I rebuilt the miner to use MaxSim, and the two now disagree on ~83% of their picks, almost entirely by surfacing negatives cosine had buried. Matched MaxSim mining then beat the no-mining floor by +0.008 to 0.009 nDCG@10, while cosine mining barely cleared it (+0.003 to 0.0045). The lift, in my runs, came from mining with the strongest same-geometry checkpoint.

The two scorers

A quick refresher on the two ways a retriever can score a query against a document.

A bi-encoder crushes the whole query into one vector and the whole document into one vector, then takes a single dot product. It’s fast and cheap, and it’s the default for mining hard negatives, because you can pre-compute one embedding per document and run a nearest-neighbor scan.

A late-interaction model (ColBERT, and in the document-image world, the ColPali family) keeps a vector per token (or per image patch) and scores with MaxSim: each query token finds its best-matching document token, and those maxima are summed. It’s heavier, but it’s far more expressive about where a match happens, which is exactly why it wins on dense, visually-structured documents.

My student is the second kind. My miner was the first kind. That’s the mismatch.

It matters because hard-negative mining is a similarity judgment: “find me the documents that look relevant but aren’t.” If the thing making that judgment uses a different similarity function than the thing being trained, you can get two failure modes at once:

The bi-encoder flags negatives that a MaxSim model finds trivially separable: easy negatives dressed up as hard ones, wasting a training slot.
The bi-encoder misses negatives that are genuinely confusable token-for-token under MaxSim. Those are exactly the hard cases the student most needs to see.

The published approaches don’t close this gap either. The strongest open late-interaction recipe I know of (Nemotron ColEmbed V2) mines its hard negatives with a separate embedding model and never reports matching the miner to the student’s geometry. The paper describes that model producing a single dense vector per page, which points to a single-vector miner rather than a MaxSim one. Either way, mining a late-interaction student with an architecture-matched late-interaction scorer isn’t something I’ve seen published. So the question was sitting there unanswered: does it actually produce better negatives?

What stays frozen

Before I touch the scorer, here’s what stays frozen, because the experiment is worthless if more than one thing moves.

The positive-aware mining substrate comes straight from NV-Retriever: its TopK-PercPos recipe, the same one Nemotron ColEmbed V2 uses. For each query, score a pool of candidates, then keep a candidate as a negative only if its score is below 95% of the positive’s score (PercPos@0.95). The point is to throw out “negatives” that are actually unlabeled positives, the false negatives that quietly poison contrastive training. I’m applying it to a late-interaction student.

Everything in this table is held identical between the two arms. The only thing that changes is one row.

Knob	Value	Where it comes from
Candidate net depth	top-100 per query	First-stage recall saturates by ~100 (recall@10/50/100 ≈ 0.95/0.97/0.97); deeper just adds false negatives, per NV-Retriever / SFR mining-depth findings
Positive-aware cutoff	PercPos@0.95	NV-Retriever
Relaxed backfill band	[0.95, 0.97]	For the rare query left with fewer than K survivors after the hard cutoff
Negatives per query	4 mined, 2 at train time	Tuned; small K is standard practice, the published recipes don’t pin an exact number
Loss	single-denominator InfoNCE over [positive, mined negs, in-batch negs]	ColPali / Nemotron ColEmbed V2
Candidate scorer	cosine to MaxSim	the one variable I changed

Both arms draw their candidates from the same top-100 net mined by the same bi-encoder. The only difference is who re-ranks that net and selects the final K: single-vector cosine, or multi-vector MaxSim. If a difference shows up downstream, the scorer is the only thing it can be attributed to.

Gate 0: do they even disagree?

The cheapest way to kill this idea is to check, before training anything, whether MaxSim and cosine actually pick different negatives. If they pick basically the same ones, the paradigm can’t possibly move the model, and I’ve saved myself a full train-and-eval cycle.

So I took the bi-encoder’s top-100 candidate net, re-ranked it with my strongest late-interaction checkpoint’s MaxSim, applied the identical PercPos@0.95 / K=4 selection, and compared the two final negative sets, over a sample of 300 mined queries, sharded across 8 GPUs. I pre-registered the bar up front: proceed only if mean Jaccard ≤ 0.6 (at least 40% of picks differ), abort if ≥ 0.8.

Metric	VT 768	VT 1280
Mean Jaccard (cosine vs MaxSim, final K=4)	0.17	0.17
Discovery (MaxSim picks outside the cosine top-K)	0.73	0.73
Demotion (cosine picks that MaxSim calls easy/false)	0.005	0.007
Verdict (bar: Jaccard ≤ 0.6)	GREEN	GREEN

The two miners agree on only ~17% of their final negatives. That’s a big disagreement, but the shape of it is the interesting part. It’s almost entirely discovery (73%): MaxSim is surfacing negatives that the bi-encoder had buried deep in its ranking. It is almost not at all demotion (0.5%): MaxSim rarely throws out a negative that cosine picked. PercPos already rejects most of cosine’s false negatives; what MaxSim adds is confusable documents cosine never ranked highly in the first place.

It was also reassuringly resolution-insensitive: the disagreement looks the same whether I embed candidates at 768 or 1280 visual tokens, which meant I could mine at the cheaper resolution.

One caveat: disagreement is necessary, not sufficient. It means the two miners pick different negatives, not better ones, and a new selection can redistribute across tasks and net flat. The real test is training.

Does it actually move the model?

Two arms, same everything except the scorer. Each arm: a fresh late-interaction student, 2 seeds, evaluated at three visual-token resolutions. And, critically, a third reference point: a no-hard-negative floor, the same student trained with in-batch negatives only. That floor is what tells you whether your fancy mining is doing anything at all.

Arm (V3 nDCG@10, 2-seed mean)	@768	@1280	@1792
No-HN floor (in-batch negatives only)	0.6133	0.6170	0.6191
Cosine bi-encoder mining	0.6162	0.6212	0.6236
MaxSim late-interaction mining	0.6221	0.6254	0.6269

And the deltas:

Comparison	@768	@1280	@1792
MaxSim minus cosine	+0.0059	+0.0042	+0.0033
MaxSim minus floor	+0.0088	+0.0084	+0.0078
cosine minus floor	+0.0029	+0.0042	+0.0045

MaxSim mining clears the no-HN floor by +0.008 to 0.009, consistently. Cosine mining barely clears it, +0.003 to 0.0045, right at the edge of noise. On this setup, MaxSim is the hard-negative method that earns its keep; cosine barely does.

The direct head-to-head is more modest: MaxSim beats cosine at every seed and every resolution, but by +0.003 to +0.006, clearing my pre-registered +0.005 bar only at 768. A real edge over cosine, but a modest one.

Per task, the gain is partly redistributive, exactly the Gate 0 warning coming true. At the highest resolution, versus cosine:

Task	Δ (MaxSim minus cosine)
FinanceFr	+0.015
Physics	+0.009
FinanceEn	+0.008
Industrial	−0.004
Hr	−0.009

Net positive overall, with a reshuffle underneath.

A coverage surprise

Swapping the scorer also changed something I wasn’t watching: how many queries survived PercPos at all.

PercPos@0.95 exists to drop false negatives, candidates scoring suspiciously close to the positive. Under cosine it dropped 4.2% of my queries (about 19,000 of them): the bi-encoder couldn’t cleanly separate the true positive from the candidate pool, so almost everything scored near the positive and got filtered out. Under MaxSim, the same cutoff dropped 0.06%, just 276 queries.

The reason is geometric. Under MaxSim the true positive stands out sharply (median positive score ~19 vs. selected-negative ~15), so a 95%-of-positive cutoff sits comfortably above the negatives and below the positive. The false-negative problem that PercPos was designed to fix is much milder under late interaction than under bi-encoder cosine. Late interaction is simply better at knowing which document is the right one, which is the whole reason we use it, and that property shows up at mining time, not just at inference.

A practical consequence: the MaxSim arm trains on ~19,000 more (genuinely hard, low-positive) queries that cosine threw away. That’s part of what each method is, so I let the difference stand. Worth knowing the scorer swap changes your coverage, too.

Is it the geometry, or the teacher?

The head-to-head can’t fully separate the two. My MaxSim miner was bigger than the bi-encoder (9B vs. 2B) and was my benchmark leader, so some of its edge is a fresh student inheriting a strong champion’s behavior, rather than geometry alone. Two things bound that: the only thing crossing from miner to student is which real documents get picked as negatives, never weights or test data; and scaling the bi-encoder miner from 2B to 8B did nothing for this student, so raw strength isn’t the driver. The clean test, a strong non-champion MaxSim miner, is next on my list.

How to actually do this in colpali-engine

Now the implementation, because the colpali-engine docs don’t cover end-to-end hard-negative mining for a late-interaction model, and that gap cost me time. Here’s the shape of it.

The consumer contract. colpali-engine’s hard-negative dataset wants, per row, a query, its positive, and K negatives, referenced by row index into the dataset split it’s handed. A “sidecar” parquet alongside your dataset is a clean way to carry this:

query_row_idx:   int64            # the query/positive row
neg_row_idxs:    list[int64]      # K negative row indices
neg_source:      string           # provenance tag, e.g. "mined_maxsim"
positive_score:  float32          # scorer's score for the positive (drives PercPos)
neg_scores:      list[float32]    # scorer's score per negative

Two invariants save you from silent corruption: query_row_idx ∉ neg_row_idxs (no self-negatives), and after you subset/reindex, both the query index and the negative indices must be local to the subset you actually train on. A sidecar carrying parent-absolute indices will train happily and quietly produce garbage. Ask me how I know.

The pipeline, in stages:

Build the candidate net once, cheaply. Embed the corpus with a fast bi-encoder and take the top-100 per query. You re-use this net for both scorers; it’s just the candidate pool, not the final selection. Top-100 is deliberate: first-stage recall saturates around there, and going deeper (top-1000) mostly imports false negatives.
Re-rank with MaxSim. For each query, score its 100 candidates with your late-interaction checkpoint’s MaxSim instead of the bi-encoder’s dot product. This is the whole point. Embed candidates at a visual-token resolution consistent with how the scorer is evaluated, or its MaxSim grid won’t match its real behavior.
Select with the frozen recipe. Apply PercPos@0.95, take K=4 (backfill from the [0.95, 0.97] band if a query is short), write the sidecar.

Here is the exact MaxSim I mine with, the same one my evaluator uses, so negatives get selected in the metric the model is trained and graded on, down to the dtype:

import torch

def maxsim(q, docs):
    """ColBERT MaxSim. q is one query [s_q, h]; docs is B candidates [B, s_d, h] -> [B] scores.
    fp32 to match eval exactly; zero-padding contributes 0 to the per-query-token max-then-sum."""
    return torch.einsum("sh,bth->bst", q.float(), docs.float()).max(dim=-1).values.sum(dim=-1)

Re-ranking the candidate net is just swapping which function scores the candidates.

Selection is then NV-Retriever’s PercPos, applied to whichever scores you hand it: rank candidates hardest-first, keep those scoring below 95% of the positive, and backfill from the [0.95, 0.97] band only if fewer than K survive.

STRICT, RELAXED, KFINAL = 0.95, 0.97, 4

def select(cands, scores, pos_score):
    """cands[i] has score scores[i]; pos_score is the query-positive similarity in the SAME metric.
    Keep the hardest negatives below STRICT*pos; backfill from [STRICT, RELAXED)*pos if too few pass."""
    paired = sorted(zip(scores, cands), key=lambda t: t[0], reverse=True)  # hardest (highest) first
    keep = [(c, s) for s, c in paired if s < STRICT * pos_score]
    if len(keep) < KFINAL:
        keep += [(c, s) for s, c in paired if STRICT * pos_score <= s < RELAXED * pos_score]
    keep = keep[:KFINAL]
    return [int(c) for c, s in keep], [float(s) for c, s in keep]

The same select runs on both arms’ scores; only the scorer that produced them differs. Per query: embed the query and its positive, MaxSim-score the candidate net the bi-encoder already retrieved, select, and append a sidecar row. (The encode and multi-GPU memmap scaffolding is elided here; the scoring is the maxsim above, one einsum over all ~100 candidates.)

import pyarrow as pa
import pyarrow.parquet as pq

out_q, out_neg, out_ns, out_pos = [], [], [], []
for q_row, q_emb, pos_emb, cand_rows, cand_emb in mined_queries:   # embeddings are [s, h] multi-vector
    pos_ms  = float(maxsim(q_emb, pos_emb.unsqueeze(0))[0])
    cand_ms = maxsim(q_emb, cand_emb).tolist()                     # one einsum over all ~100 candidates
    sel, sel_scores = select(cand_rows, cand_ms, pos_ms)
    out_q.append(q_row); out_neg.append(sel); out_ns.append(sel_scores); out_pos.append(pos_ms)

pq.write_table(pa.table({
    "query_row_idx":  pa.array(out_q,   pa.int64()),
    "neg_row_idxs":   pa.array(out_neg,  pa.list_(pa.int64())),
    "neg_source":     pa.array(["mined_maxsim"] * len(out_q), pa.string()),
    "positive_score": pa.array(out_pos, pa.float32()),
    "neg_scores":     pa.array(out_ns,  pa.list_(pa.float32())),
}), "sidecar_maxsim.parquet", compression="zstd")

The thing that will actually bottleneck you is not the GPU. Embedding document images at scale is CPU-bound: image decode and preprocessing ran ~110 ms/img on CPU versus ~18 ms/img for the GPU forward pass at batch size 32. Naively, your B200s sit idle waiting for PIL. The fix that took my aggregate throughput from ~70 to ~250 img/s on an 8×B200 box: a producer pool, a couple of CPU subprocesses per GPU worker feeding a shared-memory queue (maxsize capped so it can’t balloon), so preprocessing for the next batch overlaps the current forward pass. It’s a classic move, but it dodges the Python GIL (real subprocesses, not threads) and it’s the difference between a 6-hour mine and a day-long one.

The producer pool fixes throughput; the other optimization avoids redundant work entirely. A neg-pool document shows up as a candidate for many different queries, so re-embedding it each time would be enormous waste. Instead I embed the whole neg-pool (~150k unique docs) once into a disk-backed fp16 memmap of shape [n_docs, max_tokens, dim], GPU-sharded, then open it read-only during mining and gather each query’s candidate rows by index. Embed-once versus embed-per-appearance is the difference between ~150k document forward passes and tens of millions. The file is large (~80 GB at a few hundred tokens × 320 dims in fp16) but it’s pre-allocated sparse, the OS page cache keeps hot rows in RAM, and zero-padded rows contribute 0 to MaxSim so they cost nothing at score time. Positives and query text are embedded on the fly, since each is used once.

One MaxSim-specific gotcha beyond that memory budget: hard-fail on token-count overflow rather than silently truncate a document’s grid, or you’ll corrupt scores for your longest (often most informative) documents.

Where this paid off, and where it might not

In my setup, MaxSim mining looked most worth it when the student is late-interaction, the cosine hard negatives were already barely beating the no-mining floor, and there was room for the heavier scoring pass. The Gate 0 disagreement probe is the cheap insurance policy: I’d run it first, and if the MaxSim scorer picks essentially the same negatives as the bi-encoder, there’s probably little to gain, for the cost of one embedding pass rather than a full training run.

It looked less worth it when the scoring cost dwarfed the benefit, or when the candidate embedding resolution couldn’t be matched to the scorer’s eval resolution. And I wouldn’t lean on it to rescue a weak miner; that teacher-strength angle is a separate post.

The one line I’d keep: mine the way your model scores. If the student judges similarity with MaxSim, it’s worth mining the hard negatives with a strong MaxSim scorer too, ideally the best same-geometry checkpoint available.

If you train late-interaction retrievers, and especially if you’ve ever checked your own hard-negative mining against a no-mining floor, I’d like to compare notes. My models are on Hugging Face, and you can reach me on GitHub, LinkedIn, or by email.