Diminishing Returns and the Art of Knowing When to Stop

So, looks like I can't help myself falling into rabbit holes. It started with the release of Qwen3.5 and the thought of building my first ColPali model. Three versions later, I had a question I couldn't let go of.

The Question

1950s comic style illustration of a frantic scientist cranking all dials to maximum on a control panel labeled Optimization and Compute, but all gauges remain flatlined

I trained three generations of ColQwen3.5, each with more sophisticated optimization than the last. The third used automated hyperparameter search across multiple nodes, followed by Bayesian per-layer model soup optimization. Substantially more compute and sophistication than the first two combined. And the scores barely moved.

What caught my attention was how it didn't beat V2: the additional optimization didn't just plateau, it reshuffled which tasks improved and which regressed. Gains and losses nearly cancelled out, with a net difference smaller than the variance between individual training seeds.

Diminishing returns aren't news. What surprised me was that past the wall, my scores didn't plateau; they reshuffled.

ColQwen3.5 is a visual document retrieval model I've been developing as part of my work on Snappy and the broader ColPali ecosystem. It takes document images (PDFs, scanned pages, slides) and retrieves the most relevant ones for a given text query, without needing OCR or text extraction first. It's a 4-billion-parameter model built on Qwen3.5 that uses late interaction for scoring: each query token scores against each document token independently, then the scores are aggregated.

The ViDoRe Ecosystem

The model is evaluated on the ViDoRe benchmark suites, the standard evaluation sets for visual document retrieval:

ViDoRe V1: 10 tasks scored by nDCG@5 (a standard relevance ranking metric where 1.0 is perfect), near-saturated (best models score above 0.91)
ViDoRe V2: 4 English tasks (nDCG@5), domain-specialized
ViDoRe V3: 8 enterprise tasks (nDCG@5 and nDCG@10), covering finance, HR, pharma, energy, and more. This is the primary benchmark

All evaluation runs through MTEB v2.10.8 (deterministic, no query sampling), and every evaluation produces a JSON result file. I've released all of these publicly as part of the optimization trail.

1950s comic strip in three panels: a scientist calmly pouring V1 beakers, then confidently mixing a soup, then surrounded by an absurd Rube Goldberg machine producing the same result

Three Generations of Optimization

All three versions share the same base model, training data, and loss function. The only thing that changes is the optimization pipeline.

Shared training foundation

Qwen3.5 4B base model, LoRA (Low-Rank Adaptation, a parameter-efficient fine-tuning method) applied to attention and MLP layers plus the projection head, ColBERT negative cross-entropy loss with hard negatives mined from the base checkpoint, same-source batch sampling, and approximately 776K query-document pairs from six datasets (ColPali training set, VisRAG synthetic and in-domain, VDR multilingual, TAT-QA, TabFQuAD). All versions use AdamW with batch size 32 and BF16 mixed precision.

Stage	V1	V2	V3
HPO	Manual	Manual	Optuna (16 trials)
LoRA config	r=32, α=32	r=32, α=32	r=16, α=64
α/r ratio	1.0	1.0	4.0
Scheduler	Linear	Cosine	Cosine
Dropout	0.1	0.1	0.197
Seeds	4	3	3
Seed merge	Linear avg	Linear avg	Full state dict avg
Soup	None	3-ratio grid	11-trial Bayesian
Soup search space	—	3 ratios (manual)	14 params (Optuna)
Selection decisions	~20	~30	~50

Here's what each generation involved:

V1: Manual Optimization

Four sequential training phases with ablation studies, four seeds per phase, merged by linear weight averaging. No model soup.

V2: Adding Model Soups

Same training recipe as V1 with hard negatives, three seeds, linear merge. Added a model soup with V1: three blend ratios tested, best selected by V3@10.

V3: Automated Search and Bayesian Soups

V3 added automated hyperparameter search and per-layer Bayesian model soup optimization to the pipeline.

Automated HPO. I ran Optuna with 16 trials, optimizing V1 and V3 nDCG@5 jointly (equal weight). A caveat worth stating upfront: MTEB's maintainers recommend against using MTEB scores as an HPO objective, because even process-level optimization against benchmark scores can introduce selection bias toward the evaluation set. My HPO did exactly that. The transferable findings I discuss later (α/r ratio, cosine scheduling) are architecture-level insights that should hold independently, but they were surfaced by a search scored against MTEB, and that caveat applies. In hindsight, equal weighting was also a questionable choice: V1 is near-saturated (>0.91), so its compressed dynamic range likely biased the sampler toward V1-favoring configs. I revisit this in What the Optimization Actually Bought.

HPO search details

The search used a Tree-structured Parzen Estimator (TPE) sampler with successive halving pruner. The search space covered learning rate, LoRA rank and alpha, scheduler type, dropout, warmup ratio, and weight decay.

Selected config: lr=4.57e-5, r=16, α=64, cosine scheduler, dropout=0.197.

Seed training and merging. Three seeds trained for 3,031 steps each using the HPO-selected config. I used full state dict averaging for seed merging (each adapter merged into the base via merge_and_unload, then averaged).

Why state dict averaging instead of PEFT's add_weighted_adapter

V1 and V2 were trained with PEFT versions before v0.7.0, where add_weighted_adapter produced incorrect results due to independent A/B matrix processing (Issue #1155). The bug is attenuated when seed-trained adapters are similar (as mine are, since they share architecture, data, and training recipe). Still, I adopted state dict averaging as the more principled approach for V3.

Quality gate. Before proceeding to the soup stage, I required the merged V3 model to match or exceed V2's V1@5 score (0.913) to ensure V1 performance wasn't regressing. V3 Merge scored 0.9193. Gate passed. So far, so good.

Per-layer Bayesian soup. Model soup with V2 used per-layer Bayesian optimization: Optuna searched over 6 attention block weights, 6 MLP block weights, and embedding/visual projection weights (14 parameters total) across 11 trials, evaluated on a fast subset of 3 V1 + 3 V3 tasks. This is also where things got complicated.

I also explored a DARE-TIES merge variant (V3@5: 0.5855), but it didn't improve over the Bayesian soup.

1950s comic illustration of a strongman smashing a carnival high striker game, the puck flying sideways instead of up, with a sign reading Hit Harder For Higher Score

The Results

To isolate what the optimization actually contributed, I trained a control adapter using V2's default config (lr=5e-5, r=32, α=32, cosine scheduler) for 500 steps. No HPO, no seed merging, no soup. Just a single training run with default settings.

One caveat: the control trains for 500 steps vs. 3,031 for V3, so the Control → V1 gain conflates training duration with optimization sophistication. A single-seed run at 3,031 steps (without merging) would isolate the duration effect, but I didn't run one. Without that, I can't decompose how much of the +0.033 V3@5 gain comes from longer training vs. multi-seed merging.

Scores for all model variants across ViDoRe benchmarks, ordered by optimization intensity (higher is better; differences below ~0.002 are likely noise):

Model	Optimization	V1@5	V3@5	V3@10	V2@5
V3 Control	None	0.8928	0.5498	—	—
V1	Seeds + merge	0.9166	0.5830	0.6086	0.6035
V2	+ soup (3 ratios)	0.9172	0.5913	0.6177	0.6131
V3 Full Merge	+ HPO	0.9193	0.5857	0.6140	0.6417
V3 DARE-TIES	+ HPO + DARE-TIES	0.9166	0.5855	0.6129	0.6356
V3 Soup	+ HPO + soup	0.9156	0.5905	0.6180	0.6350

For the time-pressed: The first round of optimization captured ~90% of the V3@5 gains. Automated hyperparameter search found a few transferable insights (higher LoRA scaling ratios, cosine scheduling). The final model soup stage, which blends two models layer by layer, mostly reshuffled which tasks improved and which regressed, with a net change smaller than seed variance.

The data breaks into three stages with sharply different returns:

High marginal returns (Control → V1, confounded with training duration)

Moving from a short baseline run to multi-seed training with merging yields the largest gains: +0.0238 on V1@5 and +0.0332 on V3@5. This stage conflates longer training (500 → 3,031 steps) with optimization sophistication, so the gain isn't purely from smarter optimization. Still, standard training practices clearly get you most of the way there.

Moderate returns (V1 → V2)

Adding a 3-ratio model soup with V1 yields +0.0006 on V1 and +0.0083 on V3@5. The gains are real but an order of magnitude smaller than the previous stage.

Diminishing and mixed returns (V2 → V3)

HPO and Bayesian soups yield mixed results. Compared to V2, V3 Soup gains on V2@5 (+0.0219) and V3@10 (+0.0003), but regresses on the primary benchmark, V3@5 (-0.0008), and on V1 (-0.0016).

Compared to V3 Full Merge (before the soup), the soup stage itself hurt V2@5 by 0.0067. More optimization compute did not translate to uniform improvement.

The V3 variants don't follow a monotonic curve on V3@5. They scatter around a plateau (0.5855–0.5905), consistent with noise rather than a systematic trend.

Notice something else: V3 Full Merge (before the soup) achieves the highest V1@5 (0.9193) and V2@5 (0.6417) of any variant. The HPO stage found real value. It's the subsequent soup stage that traded those gains away on some benchmarks while adding them on others.

1950s comic illustration of a bewildered chef in a kitchen, surrounded by pots and jars, with a sign reading Chef Bernie's Recipe Rumble and a speech bubble asking Wh-what do I do now

The Reshuffle

When I broke down V3 Soup's performance against V2 task by task on the 8 ViDoRe V3 tasks, the pattern became clear:

Task	V2	V3 Merge	V3 Soup	Δ (Soup - V2)
ComputerScience	0.7538	0.7440	0.7543	+0.0005
Energy	0.6918	0.6653	0.6692	-0.0226
FinanceEn	0.5923	0.5897	0.5925	+0.0002
FinanceFr	0.4782	0.4677	0.4759	-0.0023
HR	0.5858	0.5907	0.5971	+0.0113
Industrial	0.5111	0.5162	0.5268	+0.0157
Pharmaceuticals	0.6324	0.6342	0.6350	+0.0026
Physics	0.4853	0.4779	0.4734	-0.0119
Average	0.5913	0.5857	0.5905	-0.0008

V3 wins on 5 tasks and loses on 3. The largest improvement (+0.0157 on Industrial) and the largest regression (-0.0226 on Energy) are both an order of magnitude larger than the mean difference. The gains and losses nearly cancel, producing a mean difference of just -0.0008.

Even if the average barely moves, the individual tasks are changing substantially. That's what I call the reshuffling effect. The reshuffling claim doesn't rest on the mean difference; it rests on the variance of per-task deltas.

A paired t-test on the 8 per-task deltas yields t(7) = -0.19, p = 0.86. The mean difference is not statistically significant. But with only n=8 tasks, the test has very low power, so the non-significant result is consistent with both "no real difference" and "an effect too small to detect."

One important caveat here: the soup was optimized against a fast subset of only 3 V1 + 3 V3 tasks, and I didn't record which ones. The reshuffling could partly be the soup winning on the tasks it was optimized for and regressing on the rest. I can't rule that out. If anything, that would make the point stronger: optimizing merge weights against a subset of your benchmarks buys you gains on those benchmarks at the expense of others.

For context, the V3@5 gap between V2 and V3 Soup (0.0008) is smaller than individual V1 seed variance: the four V1 Phase 4 seeds scored 0.9184, 0.9185, 0.9193, and 0.9199 on V1@5, a span of 0.0015. That's a different, near-saturated benchmark, so the comparison isn't perfect, but it gives a sense of scale. Individual V3 seed scores on V3@5 were not recorded prior to merging. Without per-seed V3@5 scores, I cannot construct a confidence interval for the V3@5 mean, which means the reshuffling claim rests on task-level delta variance rather than a formal significance test against seed noise.

Inside the soup weights

The Bayesian soup search optimized 14 per-layer V2/V3 interpolation weights. The resulting weights are highly non-uniform: Block 2's attention is 97% V2, while Block 1's attention is 93% V3. This could indicate that V2 and V3 contribute different capabilities at different layers. But given the small soup evaluation subset (discussed in The Reshuffle), the non-uniformity may partly or entirely reflect per-layer task-specific fitting rather than a meaningful decomposition of retrieval capability. Without held-out evaluation, I can't distinguish the two.

Component	Block 0	Block 1	Block 2	Block 3	Block 4	Block 5
Attention (V2 wt)	0.61	0.07	0.97	0.30	0.68	0.12
MLP (V2 wt)	0.17	0.95	0.81	0.10	0.44	0.50

Weight 1.0 = entirely V2; weight 0.0 = entirely V3. The remaining 2 of 14 parameters are omitted as they don't follow the block structure.

1950s comic illustration of an exhausted runner sprinting toward a finish line that keeps moving just out of reach

What the Optimization Actually Bought

So was it all wasted effort? Not exactly. The diminishing returns pattern is metric-dependent. V3 Soup achieves a +0.0219 improvement on V2@5, and V3 Full Merge achieves an even larger +0.0286 before soup blending. The additional optimization did find substantial value for certain evaluation sets. The plateau is specific to V3@5.

Not all stages are equally suspect

The strongest signal from the 16 Optuna trials was the α/r ratio. Every config with α/r ≥ 4.0 outperformed every config with α/r ≤ 1.0 on both V1 and V3 short-run evaluations. The consistency across both evaluation sets suggests that stronger per-parameter scaling is genuinely beneficial for this architecture, though the HPO's equal-weight V1+V3 objective means I can't fully rule out V1-bias in the sampler influencing which configs surfaced. Full-length training runs on held-out tasks would be the real test. Cosine scheduling also consistently beat linear.

Parameter	Search range	Selected	V2 default
Learning rate	[1e-5, 1e-4]	4.57e-5	5e-5
LoRA rank	{8, 16, 32}	16	32
LoRA alpha	{16, 32, 64, 128}	64	32
Scheduler	{linear, cosine}	cosine	cosine
Dropout	[0.0, 0.3]	0.197	0.1
Warmup ratio	[0.02, 0.15]	0.08	0.03
Weight decay	[0.0, 0.1]	0.02	0.01

These findings are about training dynamics, not about fitting to specific benchmarks.

Seed merging also does what the model soups literature says it should. Individual V1 seed scores on V1@5 span just 0.0015 (0.9184 to 0.9199); the merged model scores 0.9193, near the top of that range.

The per-layer Bayesian soup is the most suspect stage. It directly optimizes merge weights against benchmark scores on a small evaluation subset. This is where the reshuffling shows up.

1950s noir comic illustration of a detective studying a cork evidence board with red strings connecting pinned papers labeled Scores and Benchmarks

Goodhart's Law and Leaderboard Gaps

This matters for anyone evaluating model quality from leaderboard rankings: the gaps between top models may partly reflect which benchmarks each team optimized against, not which model retrieves documents better.

Goodhart's Law applied to model merging: optimize merge weights against a specific set of benchmark scores, and improvements on targeted metrics come at the expense of others.

I wanted to measure where exactly this shows up in a realistic optimization pipeline. The answer is specific: at the model soup stage, not at HPO.

HPO found transferable insights (the α/r ratio, cosine scheduling) that hold across evaluation sets, with the caveats discussed in Not all stages are equally suspect. The redistribution happens when you start optimizing merge weights directly against benchmark scores. The soup's small evaluation subset (see The Reshuffle) means the final V3@5 average covers tasks the soup never saw, and gains on optimized tasks come at the expense of the rest.

Process-level vs. test-level optimization

My pipeline never accesses test documents. All optimization is over hyperparameters, weight combinations, and merge ratios. This is process-level optimization: selection pressure acts on training recipes, not on test data.

Compare this with the NeurIPS 2024 LLM Merging Competition, where participants optimized merge configurations against specific task outputs. Their generalization gap was severe: 0.83 on public validation but 0.41 on hidden final tasks. My diminishing returns pattern is milder (a plateau with redistribution, not a public/private collapse), but it's real, and it sets a ceiling.

Leaderboard gaps

When the gap between V2 and V3 on V3@5 (0.0008) is smaller than individual seed variance on V1@5 (0.0015), a substantial portion of the differences between competitive models may reflect optimization pressure rather than genuine retrieval quality. Most competing models undergo their own HPO campaigns and model souping, but the specific evaluation sets driving their decisions are rarely disclosed.

The broad rankings still hold. All optimized models clearly outperform the unoptimized control. But when the gap between competitive models is smaller than seed variance, the last few decimal places on a leaderboard may be measuring optimization pressure, not retrieval quality. Private hold-out evaluation sets are the right tool for calibrating these gaps. ViDoRe V3's maintainers have been proactive about this: the benchmark includes 2 private hold-out datasets out of 10 total, and the ViDoRe V2 paper explicitly called out models exhibiting overfitting.

So, When Should You Stop?

If you're going to invest compute beyond standard training practices, HPO is where to spend it. The transferable insights (LoRA scaling ratios, cosine scheduling) carry forward to future models. The soup stage is where to be skeptical: it optimizes directly against benchmark scores, and the gains it produces on targeted metrics come at the cost of others.

This is a single pipeline (n=1); a different HPO campaign or soup search might yield different results. But now I have exact numbers for the trajectory instead of a vague sense of where the ceiling is.

Full Per-Task Scores

The V3 per-task breakdown is in The Reshuffle above. V1 and V2 breakdowns below.

ViDoRe V1 per-task scores (nDCG@5)

Task	V2	V3 Merge	V3 Soup
ArxivQA	0.9155	0.9151	0.9179
DocVQA	0.6610	0.6663	0.6658
InfoVQA	0.9356	0.9331	0.9359
ShiftProject	0.9404	0.9196	0.9039
SynthDocQA AI	1.0000	1.0000	1.0000
SynthDocQA Energy	0.9739	0.9769	0.9712
SynthDocQA Gov	0.9742	0.9729	0.9729
SynthDocQA Health	0.9889	0.9926	0.9889
Tabfquad	0.9453	0.9726	0.9599
Tatdqa	0.8377	0.8438	0.8394
Average	0.9172	0.9193	0.9156

ViDoRe V2 per-task scores (English, nDCG@5)

Task	V2	V3 Merge	V3 Soup
BioMedicalLectures	0.6466	0.6621	0.6621
ESGReportsHL	0.6957	0.7671	0.7332
ESGReports	0.5753	0.5992	0.5765
EconomicsReports	0.5348	0.5385	0.5682
Average	0.6131	0.6417	0.6350

Reproducibility

All model weights and evaluation data are publicly available:

Models: athrael-soju/colqwen3.5-4.5B-{v1,v2,v3} on HuggingFace
Evaluation trail: athrael-soju/colqwen-optimization-trail, 776+ MTEB result files, training configs, and all evaluation JSON files documenting selection decisions across the optimization trail
Training code is not publicly released; the evaluation trail and model cards document all hyperparameters and training configurations

If you're training retrieval models and have run into similar dynamics, or if you think I'm reading too much into the noise, I'd love to hear about it. Reach out on GitHub, LinkedIn, or via email.