The Price of Anarchy in Disaggregated Inference

Choose a version of the post that suits you best.

1950s comic-style illustration of an alien invasion over a burning city: a vast armada of flying saucers and a ridged mothership stamped DECODE, crewed by big-headed black-eyed grey aliens, battling human fighter planes and anti-aircraft flak fire overhead, while a riveted atomic rocket battery stamped PREFILL launches a salvo of finned rockets skyward; both sides draw power through cables from a single shared glowing reactor core stamped GPU BUDGET at the center, a large round gauge labeled PRICE OF ANARCHY has its needle swung into a red SATURATION zone, and a wind-up tin robot labeled CONTROLLER yanks a red-handled switch on the power-routing panel between the two feeds

I’ve been fascinated by the Nash equilibrium since I can remember, or perhaps since I watched the movie A Beautiful Mind back in 2001. Back then I’d probably confused game theory with video games, as I’ve been an avid gamer since birth, but the basic principles of the theory stuck with me.

Part of what I do these days is inference optimization. I discovered NVIDIA’s Dynamo a few months ago through a conversation with Kevin Cochrane, Vultr’s CMO, who opened the door for our ongoing collaboration. Had that conversation not happened, this paper would likely not exist.

If you’d like, you can skip to the TLDR. Otherwise, the director’s cut is below.

Disaggregated inference splits prefill and decode onto separate GPU pools that share one hardware budget, which makes them competing agents. NVIDIA Dynamo is modelled as three coupled games (a prefill/decode resource game, a selfish caching game over the KV hierarchy, and a routing congestion game with cache externalities), with an empirical Price of Anarchy index measured on a 3-node B200 cluster. Below GPU saturation the index plateaus at 7 to 19 and is invariant to router parameters (CV under 0.5%). Above the knee it grows an order of magnitude. A 270-line Python controller detects the transition from 5s Prometheus polling and flips routing parameters per request through Dynamo’s existing router_config_override hook, no Rust changes. It cuts the index 3.1x on the 70B 1P/5D topology and drops TTFT P99 up to 7.6x at the knee. The full paper is on arXiv: 2606.17081.

Disaggregated serving as a multi-agent system

LLM inference at datacenter scale is a resource allocation problem. Every request has to be assigned to a GPU, batched, and granted KV cache memory, all under millisecond latency budgets. When demand exceeds capacity, those allocations decide whether the system degrades gracefully or the queues blow up.

One architecture that performs at scale is disaggregated inference, which physically separates the compute-bound prefill phase (reading the prompt) from the memory-bandwidth-bound decode phase (generating tokens) onto distinct GPU pools. NVIDIA’s Dynamo is one of the most complete production implementations and includes: a Planner that reallocates GPUs between the pools, a KV-aware Smart Router, and a hierarchical KV Block Manager spanning HBM, DRAM, SSD, and networked storage.

Splitting the phases creates competing agents on a shared budget. Prefill and decode contend for GPUs while optimizing different objectives (time-to-first-token versus inter-token latency). Requests compete for placement: routing one to a cache-warm worker helps that request and congests the worker for the next arrival. KV blocks share memory tiers, and evicting one block imposes a recomputation cost on the request it belonged to.

Congestion games, the Price of Anarchy, and selfish caching each map onto one of those contests. I pointed all three at Dynamo.

The three games

I modelled disaggregated serving as three games, each running at its own timescale and granularity.

Game	Players	Mechanism	What’s contested	Timescale
Γ_PD (resource)	prefill pool, decode pool	Planner	GPU budget G	~30s
Γ_KV (caching)	GPU workers	KV Block Manager	tiered memory G1 to G4	continuous
Γ_R (routing)	concurrent requests	Smart Router	decode workers	sub-ms

Γ_PD is a two-player resource game with a shared constraint, which makes it a Generalized Nash Equilibrium Problem. The coupling is asymmetric: prefill utility depends only on its own allocation and the arrival rate, but decode utility also depends on prefill, because a starved prefill pool idles decode workers regardless of decode GPU count. Under diminishing returns it has a unique variational equilibrium where the marginal SLO improvement from one more GPU is equal across pools. I treated this one analytically, since the topologies I ran used fixed P/D splits.

Γ_KV is selfish caching over a four-tier hierarchy, with the KV Block Manager as a Stackelberg leader setting eviction policy. Frequency-based eviction (frequency initialized at 1, doubled on hit, decremented on decay) approximates a threshold strategy. On complete-graph topologies (workers equidistant, the NVLink-within-a-node case) selfish caching is socially optimal, PoA = 1. There’s a Braess-like trap worth flagging: adding cache nodes can worsen the PoA, so naively bolting more GPU memory onto a cache pool could in theory degrade the system. Homogeneous request patterns (popular shared prefixes) bound the paradox.

Γ_R is the routing game, the one I measured directly. The Smart Router’s per-request worker selection is a congestion game, except the KV overlap term adds positive externalities on top. The cost to a request of landing on worker j is

C_i = f_j(n_j)        congestion cost (load on worker j)
    - ω · o_ij        cache overlap benefit (prefix reuse)

The overlap term ties cost to request identity (which prefix it shares) on top of raw co-user count, which breaks the anonymous-cost assumption and the clean potential-game structure. In Dynamo’s actual code the router scores each worker as ω · (blocks needing prefill) + (active decode blocks), cache-miss cost plus load; that’s the same argmin with the overlap term’s sign flipped. Dynamo’s router is centralized, but its sequential greedy assignment (route each arrival to the lowest-cost worker given current loads) is exactly best-response dynamics in this game, so the Price of Anarchy is still the right thing to measure.

The three games are coupled. The Planner sets the routing game’s worker pool. Routing then decides where blocks get created and concentrates cache pressure; the evictions that follow feed back into the overlap scores that redirect later requests. Under load this closes into a loop: prefill saturates, decode idles, the Planner shifts GPUs toward prefill, and the worker set moves under the router mid-flight. Saturation in one game changes the inputs to the other two. That coupling under load is what I wanted to measure.

Measuring via the PoA index

I wrote the empirical index as PoA-hat, to keep it honest and distinct from a classical Price of Anarchy bound. For a sliding window of completed requests, it’s the ratio of observed total latency to a hindsight-optimal reassignment of those same requests to workers, computed with the Hungarian algorithm:

PoA-hat = Σ actual per-request latency  /  OPT(window)

Structurally that’s a competitive ratio against an offline optimum. Three key properties:

The cost matrix is an uncalibrated parametric model (a·n + b + d/(C−n)^β − w_c·o) whose parameters are fixed a priori. That makes PoA-hat a relative index on an internal scale: a value of 19 does not translate to “19x worse than optimal.”
OPT freezes the observed loads and ignores how reassignment would shift them, which makes OPT an underestimate and PoA-hat an upper bound on the true PoA.
At very low load (a handful of in-flight requests) the denominator goes trivially small and the index inflates, so the C ≤ 4 rows are dropped from any plateau claim.

For spotting the transition I only need the rate of change. A rising index is enough.

The regime transition

I swept the 340B (TP=8, 1P/2D) across concurrency and watched two columns: TTFT P99 and PoA-hat.

C	TTFT P99	ITL P99	PoA-hat	rps
8	170 ms	21.9 ms	19.0	1.3
64	739 ms	22.7 ms	18.7	10.2
96	544 ms	22.5 ms	18.7	15.3
128	16.2 s	22.6 ms	27.9	16.5
256	83.2 s	22.3 ms	176.9	17.6
384	109 s	23.3 ms	283.6	18.0

TTFT P99 went from 74 ms at C=1 to 113 s at C=512, a 1,500x blowup, while ITL P99 sat at 21.7 ± 1.3 ms the whole time. Decode never noticed the prefill bottleneck. PoA-hat held flat near 19 until C=128, then climbed an order of magnitude.

That climb came from the latency function’s shape. The function is roughly linear below saturation, so the index is stable, but near capacity it grows a singular term, d/(n_sat − n)^β, a pole at the saturation point. That’s the same shape that makes the Price of Anarchy of an M/M/1 queue diverge as utilization ρ → 1. The driver is sensitivity. On identical workers the Nash allocations stay near-balanced, so load imbalance can’t be the cause. Near the pole, even a near-balanced equilibrium costs disproportionately more than an optimum that leaves headroom below the singularity.

Past the knee a second effect takes over: the single prefill worker queues no matter how cleanly you balance decode, so a PoA-hat of 284 at C=384 really means the system is in overload and no routing decision can manufacture prefill capacity.

Both models landed in the same three regimes. Same first post-knee grid point (C=128), same finite-difference magnitude across the knee (≈ 0.46 to 0.55 on the [64, 128] interval), and that held across a 4.9x parameter gap. The 340B plateau (18.7) sat 2.5x above the 70B 1P/2D plateau (7.47), which makes sense: a misroute on a bigger model costs more compute. The transition itself landed in the same place on both.

Tuning only matters at the knee

I went into the parameter sweep expecting the two knobs to shift the index everywhere, the way tuning usually does. Below saturation they were inert. PoA-hat held flat regardless of either: on the 340B at C=64 all 16 (τ, ω) cells landed at 18.7 ± 0.10, on the 70B 1P/2D at 7.47 ± 0.08. Coefficient of variation under 0.5%.

At the knee that flatness was gone. Here’s the 70B 1P/2D at C=128, every cell a PoA-hat for one (τ, ω) setting:

τ \ ω	0.0	0.3	0.7	1.0
0.0	19.6	18.4	25.4	19.1
0.3	27.5	20.6	14.6	15.6
0.7	28.2	26.4	24.6	20.5
1.0	21.2	25.7	27.2	15.4

The same sixteen settings that had been indistinguishable below the knee now spread 1.9x, from 14.6 to 28.2. More decode workers widened it further: the 70B 1P/5D spread hit 2.0x.

If no knob moved the steady 7-to-19x below the knee, where did it come from? I thought the router was making bad worker choices. It wasn’t. On the same cost model, round-robin, random, and power-of-two-choices all landed within 0.3 to 10% of the Hungarian optimum at C ≥ 8, the same as Dynamo’s greedy router. Static assignment is a solved problem here, and a head-to-head of routing algorithms would call this system efficient. The slowness was temporal: queuing delay, prefill/decode contention, batch scheduling, the cost of how requests stack up over time. Which worker each request landed on was almost a sideshow. The gap between the static-routing index (~1.02 to 1.08) and the measured one (~7 to 19) was the system-level inefficiency PoA-hat exists to surface, and no per-request policy closed it. Closing it would take coordinated batch assignment or faster P/D rebalancing.

At the knee those same knobs decided everything and that split is what I built the controller around.

The dynamics behind those two tables are easier to feel than to read off them. Here’s an interactive companion, The Orbits of Anarchy, embedded live:

Interactive companion. Open in a new tab for the full-size version.

The controller

Computing a Nash equilibrium directly was never on the table. Finding one is PPAD-complete, and computing an ε-approximate one stays PPAD-complete for constant ε; Dynamo’s router has under a millisecond per decision. The whole design leans onto the cheap side.

Every game-theory-inspired system that ships does some version of this. DRF uses progressive filling. Shockwave solves a Fisher market with convex optimization. Themis runs simple multi-round auctions. The theory names the property and supplies the vocabulary; a tractable mechanism approximates it. Mine is no different: the games tell you what to measure and when it matters, and a cheap detector does the runtime work.

Detect the transition from aggregate metrics, switch to parameters already known to work for that regime, and leave the system alone the rest of the time. That turns the PPAD-hard equilibrium problem into a cheap classification, run every few seconds.

I wrapped ~270 lines of Python around Dynamo’s KvPushRouter, no changes to the Rust core. Detection runs on an EWMA of TTFT P99 (α = 0.3), polled from Prometheus every 5 seconds. The theory points at the second derivative of latency with respect to load as the hardware-agnostic signal, but at 5s polling that estimate drowned in sampling noise, so I settled on a simpler trigger: a smoothed TTFT P99 crossing a threshold θ1. Three regimes, three settings of the router’s temperature τ and overlap weight ω:

Regime	τ	ω	Why
Below	0.0	1.0	exploit cache locality (PoA-hat bounded)
Transition	0.7	1.0	calibrated optimum from the 70B 1P/5D sweep
Saturated	0.8	0.1	load-balance, suppress hotspots (conjectural)

That Saturated row is conjectural, and I flag it as such. It never fired in any reported run, because the load spike didn’t hold past θ2 long enough under the EWMA hysteresis. The Below and Transition rows are the ones my experiments actually exercised.

I built the switch as a zero-downtime port redirect through the existing router_config_override hook: two frontends run side by side (default params on 8000, optimal on 8001) and traffic moves on detection with no restart. Two notes worth flagging:

θ1 has to scale with the model’s baseline TTFT. An absolute threshold breaks across models: the 300 ms value tuned on the 70B (baseline ~55 ms) trips early on the 340B (baseline ~150 to 200 ms). Set it at roughly 3 to 5x baseline and it transfers; 1.0 s works for the 340B.
The switch fired at 54.8 ± 0.2s into the saturated phase across all three 340B iterations. A 0.2s spread across runs told me the detector was locked onto the real transition.

What it buys

I ran a three-phase load spike (C = 32 → 128 → 32, durations 120/180/120s), three iterations per strategy, static (τ=0, ω=1 throughout) versus adaptive. Two models: Nemotron-4-340B at TP=8 (full-node workers, cross-InfiniBand KV transfers) and Llama-3.1-70B at TP=4. All numbers below are the saturated phase.

Config	Metric	Static	Adaptive	Δ
70B 1P/5D	PoA-hat	66.4 ± 12.2	21.5 ± 0.17	3.1x
70B 1P/5D	TTFT P99	8.2 s	4.2 s	1.9x
70B 1P/2D	PoA-hat	23.1 ± 1.6	10.7 ± 0.5	2.2x
70B 1P/2D	TTFT P99	25.9 s	3.4 s	7.6x
340B 1P/2D	PoA-hat	33.2 ± 1.1	25.8 ± 0.07	22%
340B 1P/2D	TTFT P99 (aggregate)	28.3 ± 5.9 s	5.9 ± 0.06 s	4.8x
340B 1P/2D	TTFT P99 (post-switch steady state)	28.3 s	0.97 s	~29x

The flagship number is the 3.1x PoA-hat cut on the 70B 1P/5D, and it’s the statistically clean one: the static baseline has an 18% coefficient of variation, and the ratio doesn’t straddle zero in any credible interval. The 70B TTFT ratios are demoted on purpose: their static baselines are wild (the 1P/5D static iterations were 2.6, 13.8, and 8.2s, CV 68%), so a TTFT ratio there would need n ≥ 5 before it can be trusted.

The 340B aggregate hides the real shape. The switch fired ~55s into a 180s saturated phase, so the aggregate mixed ~55s at the bad operating point with ~125s at the good one. Post-switch steady state was ~0.97s against a static ~28.3s, roughly 29x. It costs throughput: 11.6 rps adaptive versus 18.2 static in the saturated phase, a 36% reduction. On the 70B 1P/5D the trade is gentler, 44.3 rps versus 50.9, a 13% cut. You’re moving the operating point toward latency, at the one moment it pays off. Same model, same silicon, only the assignment changes.

There’s a second payoff in the variance. Static saturated-phase TTFT P99 swung from 21.5 to 31.8s across iterations on the 340B; adaptive held 5.9 ± 0.06s. A lot of what the controller buys is killing the queue explosion that drives that spread.

Some uncertainties

The measurement caveats matter as much as the headline, so here they are flat:

Most cells are single runs. Experiments 1, 2, and the Pareto sweeps are one trial per configuration. The ±0.10 spread across the C=64 grid measures parameter sensitivity. It is not a measurement-error bar, so an individual value like the 18.7 plateau could move on rerun. Only the adaptive comparison has repeats (n=3).
Saturation measurements drift. The same nominal 340B config at C=128 reported PoA-hat of 27.9, 36.0, and 33.2 across three experiments, a ~29% spread. Saturation-regime parameter-sensitivity claims on the 340B should be read with that in mind.
The 1P/2D adaptive gains are lower bounds. The single (τ=0.7, ω=1.0) setting tuned on the 70B 1P/5D was reused across every topology, to test whether one regime-gated config generalizes. A per-topology native sweep would do at least as well.
The workload is deliberately narrow. 5 prompt templates, 128-token inputs, 256-token outputs, temperature 0. Clean to measure, but it leaves the cache-placement game (Game 2) nearly degenerate: 5 prefixes across 2 to 5 workers partition trivially, and 128-token blocks never come close to filling 192 GB of HBM, so the multi-tier cache cost is never exercised. The parameter invariance below saturation may partly reflect that easy cache problem. Whether the regime structure holds on variable-length, multi-turn, or production-trace traffic is the biggest open question.
Game 1 (P/D resource) is analytical only, since the topologies here use fixed P/D splits, and its GPU-utilization numbers are synthetic. The hierarchical KV PoA is a conjecture (closest analog is Chun et al.’s O(√n) line-topology bound under uniform cost, which the tier-heterogeneous model here relaxes). Existence of pure Nash equilibria for the routing game with ω > 0 under general prefix structures is open.

And the n = 1 pipeline caveat applies throughout: one cluster, two models, three topologies.

What it changes for operators

The three-game model names what operators already trade off by hand, and the regime structure shows that tuning has no measurable effect until the knee. The controller is a drop-in that acts on both of those facts without touching the Rust core. As disaggregated serving moves into general production (NVIDIA reports 30x throughput for DeepSeek-R1 on a disaggregated GB200 NVL72 rack), knowing when tuning matters, and shipping something that acts on it, is worth having before the next traffic spike.

Where this goes next: MoE serving turns routing into a nested congestion game (requests to GPUs, then tokens to experts), where the inefficiency could compound layer over layer. Multi-tenant clusters add genuine strategic agents, the case where the game theory stops being only a lens and the mechanism-design machinery (strategy-proof auctions, incentive-compatible allocation) starts to earn its keep. And a Planner that watched the PoA signal directly could drive P/D rebalancing harder than the current one-worker-per-30-seconds cadence, going after the saturation bottleneck at its source.

1950s comic-style illustration of a relaxed engineer in a lab coat leaning back with his hands behind his head under a sign reading ENGINEER DON'T WORRY IT'S AUTOMATED, a tank stamped GPU BUDGET pouring into a boiler stamped DYNAMO that feeds two pipes labeled PREFILL and DECODE through a PLANNER gauge, a SMART ROUTER panel sorting incoming requests, a wall chart labeled PRICE OF ANARCHY showing a flat line that spikes sharply upward at a mark labeled SATURATION, and a small wind-up tin robot labeled CONTROLLER reaching for a wall switch labeled SATURATION OVERRIDE

When you send a prompt to a large model, the work splits into two phases. One reads your whole prompt. The other writes the reply one token at a time. Those two phases want different things from the hardware, so the fastest systems now run them on two separate pools of GPUs. NVIDIA’s Dynamo is the leading example, and it already ships the parts that run the split: a smart router that decides which GPU each request lands on, and a planner that divides the GPU budget between the two pools.

Splitting the work buys speed. It also starts a fight. The two pools draw from one fixed pile of GPUs, so every chip the prefill side grabs is one the decode side doesn’t get. Each new request needs a GPU too, and they pile into whatever slots are open. So does the cache that holds earlier work, crowding against the scarce fast memory. Three fights at once, every piece chasing what’s best for itself, and I modelled each as its own game. There’s a name for the gap between all that grabbing and one perfectly coordinated plan: the Price of Anarchy. The wider the gap, the more of your hardware goes to waste.

I demonstrated this using a 3-node B200 cluster, running a 70B and a 340B model, and measured how far its selfish, request-by-request decisions fall short of the best possible assignment, and how that gap moves as load climbs. The answer came out cleaner than I expected.

While the GPUs had headroom, the router’s settings barely mattered. Sixteen combinations of the two main tuning knobs all landed within a hair of each other. If you’ve been hand-tuning your router during normal traffic, the measurements say you’ve been polishing something that doesn’t move.

I assumed the router was picking the wrong GPUs. The numbers said otherwise. The obvious alternatives, round-robin, random, pick-the-emptier-of-two, all land within a few percent of perfect. Below the knee, the slowness lives somewhere else. It’s the cost of requests piling up and waiting on each other over time, and no routing rule erases that.

Then the system saturates, and those same settings suddenly decide everything. Past a sharp threshold (the “knee”), the inefficiency jumps by roughly ten times. The right setting keeps things responsive; the wrong one tips the whole thing over. The threshold showed up at the same point on both models, even though one is almost five times larger than the other. That surprised me. It comes from how the system is built, so the model’s size doesn’t move it.

If tuning only pays off at the saturation point, the move is obvious. Leave the system alone until it’s about to tip, then flip the settings at exactly that moment.

That fix is about 270 lines of Python. It watches the system’s live metrics every few seconds, spots the saturation transition, and switches the router’s behavior on the fly. It needs no changes to Dynamo’s core code. Any existing deployment could bolt it on.

Same model, same hardware, with the add-on watching:

Worst-case response time at peak load drops by 4.8x to 7.6x.
On the big model, once the switch fires mid-overload, the worst-case tail falls from 28 seconds to about 1 second.
The system gets steadier, too: response times that used to swing wildly from run to run under heavy load settle into a tight, predictable band.

None of this is free. Buying that latency back costs throughput, between 13% and 36% depending on the setup. That’s a deliberate trade. The difference is that you make it on a schedule you chose, instead of discovering it mid-incident.

Disaggregated serving is moving into mainstream production right now (NVIDIA quotes a 30x throughput gain on a disaggregated rack for one popular model). Operators are out there tuning these routers by hand. This work puts names on the trade-offs they already feel, and shows that tuning has no real effect until the system nears its limit. The small add-on handles the hard part from there.

1950s comic-style illustration of a cramped diner kitchen during a dinner rush, under a DINNER RUSH banner and a ceiling ticket rail jammed full of orders. The left cook, apron labeled READER, holds one enormous order ticket unspooling to the floor; the right cook, apron labeled PLATER, plates peas one at a time. A young apprentice boy in a paper hat dashes between them with an armful of tickets, and a small wind-up tin robot labeled HELPER 3000 stands on the counter reaching for a wall switch labeled CALM RULES. Handwritten signs read HURRY UP BUT DON'T MESS UP and ONE ORDER AT A TIME, DO IT RIGHT

Picture a tiny diner. Three people run it:

The Reader grabs each ticket and reads the whole order.
The Plater builds the meal, one spoonful at a time.
The Runner, a young apprentice, carries each order to whichever cook can take it.

The Reader and the Plater are the two cooks, good at different things, so each gets their own counter and both go faster. The Runner never lets an order pick its own cook; he seats every one.

Now the problem: both cooks share one stove (the computer chips). When one cook hogs the burners, the other gets fewer. And the orders keep flying in faster than the cooks clear them. It gets frantic fast.

But the apprentice can only ever do what’s best for the order in his hand right now. Seat every order that way, grab-the-best-you-can-see, and the kitchen still runs slower than if a calm manager had planned the whole rush in advance. That gap between grab-it-yourself and a perfect plan has a fancy name, the Price of Anarchy. This work measures exactly how big it gets.

While there’s plenty of stove to go around, fussing with the kitchen rules barely changes a thing. I tried sixteen different sets of them. They all came out about the same, so the fussing was mostly for nothing.

The apprentice is already seating orders fine. Any sensible rule he follows works about as well. The slow part is just everyone stuck waiting in line behind everyone else.

But the moment the burners almost run out, the rules suddenly matter a lot. Good rules keep the kitchen calm. Bad rules turn it into a screaming mess, ten times worse. And that tipping point shows up at the same spot whether the diner is cozy or enormous.

So I built a tiny helper, and it does one job. It sits at the pass and watches the ticket rail over the apprentice’s shoulder. The moment the kitchen is about to back up, it taps him: switch to the calm rules. The rest of the time it stays quiet and lets him work.

The helper is small and simple, and you don’t have to add a cook, buy a bigger stove, or rebuild the kitchen to use it.

That’s the diner. Back in the real AI machine, when it gets slammed with questions:

The slowest answers come back 5 to 8 times faster.
On the biggest model, the worst wait drops from about 28 seconds down to 1 second.

There’s a catch, though: to get the slow answers out fast, the machine handles a few fewer questions overall. That’s a fair trade, and now you get to make it on purpose, before a rush forces it on you.

The full paper, with the proofs, the Dynamo architecture diagram, and the complete experiment tables, is on arXiv: 2606.17081. NVIDIA Dynamo, the serving stack all of this runs on, is at developer.nvidia.com/dynamo. If you work on disaggregated serving, or you’ve measured anything like this on your own cluster, I’d like to compare notes. Reach me on GitHub, LinkedIn, or by email.

Special thanks to Vultr for providing the resources to run these experiments, and especially to Kevin Cochrane for his support and guidance.