Autoscaling
Serving a model means choosing how much GPU capacity to keep running. Provision for peak load and you pay for idle GPUs whenever traffic is low; provision for average load and requests pile up when traffic spikes. Autoscaling removes that trade‑off: Emissary adds replicas when load rises and removes them when load falls, so capacity tracks demand and you pay closer to what you actually use.
This page explains when autoscaling applies, the configurations you control, how the scaling decision is made, and how to tune it.
What a replica is
A replica is one GPU‑backed copy of your engine — a single inference server, on its own GPU(s), capable of handling a share of your traffic. Autoscaling is the process of changing how many replicas are running. Two replicas serve roughly twice the traffic of one. A cluster engine always keeps at least one replica running, and scales up to your maximum as load rises.
Every replica serves the engine's base model plus all of its fine‑tuned LoRA deployments, so adding a replica adds capacity for every deployment on the engine at once.
Configurations
You configure autoscaling when you create a cluster engine. The configurations below are what you control; everything else (which GPU, how replicas are provisioned, how requests are routed) is handled for you.
| Configuration | Controls | Default |
|---|---|---|
| Min replicas | The floor — replicas never drop below this. Minimum is 1. | 1 |
| Max replicas | The ceiling — replicas never rise above this. Caps cost and protects against runaway scaling. | 3 |
| Autoscaling metric | What the scaler measures: concurrency (in‑flight requests) or request per seconds (QPS). | Concurrency (batch_request) |
| Target per replica | The per‑replica load the scaler aims to hold (concurrent requests or QPS, depending on the metric). | 10 |
| Upscale delay | How long elevated load must persist before a replica is added. | 10s |
| Downscale delay | How long load must stay low before a replica is removed. | 1800s (30 min) |
Min replicas is at least 1 — cluster engines do not scale to zero, so there's always a warm
replica ready to serve. Min replicas must be ≤ max replicas. Leaving a configuration unset falls
back to the platform default for that field.
How autoscaling works
A dedicated controller watches your engine's traffic and decides how many replicas it should have. The loop is straightforward:
-
Measure load. The controller continuously samples traffic over a rolling 60‑second window — either the request rate or the number of concurrent in‑flight requests, depending on your chosen metric.
-
Compute the desired replica count. It divides measured load by your per‑replica target and rounds up, then clamps the result between your min and max replicas:
desired_replicas = clamp( ceil(load / target_per_replica), min_replicas, max_replicas ) -
Apply hysteresis. The controller re‑evaluates roughly every 20 seconds. A change isn't applied the instant the number moves — the new desired count must hold for the upscale delay (to add replicas) or the downscale delay (to remove them). This keeps brief spikes and dips from thrashing the replica count.
-
Act. The controller adds or removes replicas to reach the desired count, then waits for the next evaluation.
Worked example 1: Concurrency (batch_request)
Your engine uses the concurrency metric (the default) with a target of 100 concurrent
requests per replica, a min of 1, and a max of 5. Here load is the peak number of
simultaneous in‑flight requests over the window.
- Peak concurrency is ~80 in‑flight requests →
ceil(80 / 100) = 1replica. - Concurrency climbs to a peak of ~350 in‑flight requests →
ceil(350 / 100) = 4replicas (once the upscale delay has elapsed). - Concurrency settles back to ~80 → returns to
1replica, but only after the downscale delay has elapsed.
Worked example: Requests per seconds (request_rate)
The same math applies to the request rate metric, except load is requests per second. With
a target of 10 QPS per replica, a min of 1, and a max of 5:
- Steady ~8 requests/second →
ceil(8 / 10) = 1replica. - A climb to ~32 requests/second →
ceil(32 / 10) = 4replicas. - Back to ~8 requests/second → returns to
1replica after the downscale delay.
Choosing a scaling metric
Emissary offers two ways to measure load. Pick the one that matches how your workload behaves.
Concurrency (default)
Scales on the number of in‑flight requests. The controller takes the peak concurrency over the 60‑second window and divides by your target concurrency per replica. Because it provisions for the peak rather than the average, it reacts well to bursts and handles long‑running requests (such as streaming generations) where the number of requests is a poor proxy for actual load.
This is the default metric on Emissary. For most inference workloads, request rate is a weak signal — requests vary widely in how long they take, so two engines serving the same QPS can be under very different real load. Concurrency measures the work actually in flight, which tracks how busy each replica is far more reliably across the range of workloads our clients run.
Requests per second
Scales on requests per second. The controller counts completed requests over the 60‑second window, computes QPS, and divides by your target QPS per replica. It's a good fit for steady, request‑oriented traffic where each request takes a similar amount of time, so QPS is a faithful stand‑in for load.
Emissary also sets a per‑replica admission limit automatically based on task type — classification replicas accept far more concurrent requests than text‑generation replicas, because a classification call is a single fast forward pass while text generation streams tokens over a longer period. You don't configure this; it's tuned for you per task.
Min and max replicas
Min replicas is your warm baseline — the capacity the engine always keeps running. The
minimum is 1: a cluster engine does not scale to zero, so there's always a replica ready and
requests never wait for a GPU to come online in steady state. Raise min replicas above 1 when
you want guaranteed headroom for redundancy or to absorb sudden spikes without waiting on a
scale‑up.
Scaling all the way down to zero replicas when an engine is idle — and paying nothing for GPU
while it sits idle — is something we're actively working toward. We're holding off until we've
settled two things: how to gracefully handle requests that arrive while a replica is warming
up from zero (so those requests wait rather than fail), and shortening warm‑up time itself
so that the cold start after an idle period is short enough to be acceptable. Until then, the
minimum is 1.
Max replicas is your safety ceiling. The engine will never exceed it, even under heavy load —
excess traffic is served by the existing replicas (at higher per‑replica load) rather than by
provisioning unbounded GPUs. Set it from your expected peak: roughly
peak_load / target_per_replica, plus a little headroom.
Scaling dynamics and oscillation
The two delays are how you balance responsiveness against stability:
- Upscale delay (default
10s) — short, so the engine adds capacity quickly when load rises. - Downscale delay (default
30 min) — deliberately long, so the engine holds warm capacity through brief lulls instead of tearing down replicas that it will need again moments later. Because spinning a replica back up incurs a cold start, it's usually cheaper to keep one warm for a while than to remove it prematurely.
If you see replicas repeatedly scaling up and down (oscillation), increase the downscale delay first — it's the most effective lever for smoothing the replica count. A short upscale delay paired with a long downscale delay gives you fast scale‑up while protecting capacity during temporary dips.
Replica lifecycle
When the controller decides to add a replica, it goes through a few stages before it serves traffic:
- Provisioning — a GPU instance is requested from a cloud provider.
- Warm‑up — the replica boots, pulls the model, and starts its inference server. Emissary allows up to 20 minutes for a single‑GPU replica (25 minutes for multi‑GPU) to become healthy; this is the cost you pay on a cold start.
- Ready — once the replica passes its
/healthcheck it joins the pool and the load balancer begins routing requests to it.
On scale‑down, a replica is drained before it's removed: it stops receiving new requests and is given a grace period (about two minutes) to finish in‑flight requests, after which its GPU instance is released. In‑flight requests are not dropped when the engine scales in.
LoRA adapters under autoscaling
On a cluster engine, your fine‑tuned deployments are LoRA adapters, and the replica set is constantly changing. Emissary keeps adapters consistent automatically:
- When a new replica comes online, the controller loads every registered adapter onto it before routing adapter traffic there, so a request for any deployment can be served by any ready replica.
- The load balancer maintains a map of which replicas have which adapters loaded and routes each
request — keyed by the
modelyou specify — to a replica that can serve it. - When you add or remove a fine‑tuned deployment, the change is propagated across the live replicas without recreating the engine.
You don't have to think about any of this; you point requests at a deployment by name and the engine handles placement.
Cost
A cluster engine has two cost components: the GPU replicas, which scale with traffic, and a fixed controller.
- GPU replicas are billed for the time they run, so this part of your cost scales with the
replica count:
- Min replicas sets your GPU baseline — the capacity you always pay for (at least one replica, since the engine never scales to zero). Higher means more warm capacity and higher baseline cost.
- Max replicas caps your worst‑case GPU spend during a spike.
- Downscale delay trades cost for stability — a longer hold keeps replicas (and their cost) alive through lulls.
- Controller — each cluster engine runs its own dedicated controller (the component that watches traffic, makes scaling decisions, and routes requests) on CPU. This adds a flat $0.50/hour for as long as the engine is active, on top of GPU cost. On‑demand engines have no controller and so don't carry this charge.
Emissary shows an estimated hourly cost range before you create an autoscaling engine, spanning the min‑replica and max‑replica GPU cases plus the controller, so you can see both your floor and your ceiling up front.
Recommended starting points
- Development / light use — min
1, max1–2, default delays. The single warm replica keeps costs predictable while you iterate. - Steady production traffic — min
1(or higher for redundancy), max sized from your peak, concurrency metric. Capacity follows demand with a warm replica always ready. - Bursty or streaming traffic — concurrency metric, a short upscale delay, and a longer downscale delay so the engine reacts fast to bursts but doesn't thrash on the way down.
When in doubt, start with the defaults and adjust max replicas to your observed peak and the downscale delay to dampen any oscillation you see.