Lab Workspace

Hot-swap Inference

Serve LoRA adapters from a running HybrIE runtime — load and unload them at runtime and route individual requests through them, with no restart and no redeploy.

Load and unload at runtime

POST/v1/adapters/:id/load
POST/v1/adapters/:id/unload

Loading places a registered adapter into the running runtime's memory so it can serve requests; unloading frees it. Both happen live — the base model keeps serving throughout, and no process restart or redeploy is involved. Only loaded adapters can be used at inference time.

Load body:

ParameterTypeDescription
backendsstring[]Optional backends to load the adapter on. Omit to load on the default backend.
pinbooleanPin the adapter so it is never evicted. max_active_loras caps how many adapters can be loaded at once — unpinned adapters can be evicted to make room.

Unload accepts backends only:

curl
curl -X POST http://localhost:8080/v1/adapters/refund-policy/load \
  -H "Content-Type: application/json" \
  -d '{"pin": true}'

curl -X POST http://localhost:8080/v1/adapters/refund-policy/unload \
  -H "Content-Type: application/json" \
  -d '{}'

Check what is currently loaded with GET /v1/adapters/status.

Per-request routing headers

Target a loaded adapter on a single request by passing the x-hybrie-adapter-id header on chat completions. The runtime serves that request through the adapter while every other request continues to use the base model:

POST/v1/chat/completions
curl
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-hybrie-adapter-id: refund-policy" \
  -d '{
    "model": "qwen3-4b",
    "messages": [{"role": "user", "content": "What is our refund policy?"}]
  }'

The full set of routing headers in v0.1.65:

ParameterTypeDescription
x-hybrie-adapter-idstringServe this request through the named loaded adapter. Also settable as gRPC metadata hybrie.adapter_id — the header wins when both are present.
x-hybrie-execution-modestringlocal, cloud, or hybrid — where the request executes.
x-hybrie-cloud-providerstringCloud provider to use when execution goes to the cloud.
x-hybrie-preferred-peer-idstringRoute the request to a specific registered compute peer.

Routing is per request, not per runtime — different requests can target different adapters concurrently on the same base model. Omit the headers and the request is served by the base model alone with default routing.

Pinning, TTL, and sessions

Adapters produced by POST /v1/d2l/internalize can be pinned at creation so they are never evicted, carry an optional ttl_seconds after which the adapter auto-expires and stops being served, and can be tied to a session via session_key. Use short TTLs for ephemeral, per-session internalizations and pinned long-lived adapters for shared knowledge.

Usage metering

Adapter-served traffic shows up in workspace usage metering: usage events record the adapter_id that served each request, and the month-to-date billing snapshot reports adapter coverage — the percentage of traffic served through adapters. See Usage & Billing.

From the CLI

bash
stimulir lab adapters load <adapter-id>
stimulir lab adapters unload <adapter-id>

Next