Lab Workspace

Hot-swap Inference

Serve LoRA adapters from a running HybrIE runtime — load and unload them at runtime and route individual requests through them, with no restart and no redeploy.

Load and unload at runtime

POST/v1/adapters/:id/load

POST/v1/adapters/:id/unload

Loading places a registered adapter into the running runtime's memory so it can serve requests; unloading frees it. Both happen live — the base model keeps serving throughout, and no process restart or redeploy is involved. Only loaded adapters can be used at inference time.

Load body:

Parameter	Type	Description
`backends`	string[]	Optional backends to load the adapter on. Omit to load on the default backend.
`pin`	boolean	Pin the adapter so it is never evicted. `max_active_loras` caps how many adapters can be loaded at once — unpinned adapters can be evicted to make room.

Unload accepts backends only:

curl

curl -X POST http://localhost:8080/v1/adapters/refund-policy/load \
  -H "Content-Type: application/json" \
  -d '{"pin": true}'

curl -X POST http://localhost:8080/v1/adapters/refund-policy/unload \
  -H "Content-Type: application/json" \
  -d '{}'

Check what is currently loaded with GET /v1/adapters/status.

Per-request routing headers

Target a loaded adapter on a single request by passing the x-hybrie-adapter-id header on chat completions. The runtime serves that request through the adapter while every other request continues to use the base model:

POST/v1/chat/completions

curl

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-hybrie-adapter-id: refund-policy" \
  -d '{
    "model": "qwen3-4b",
    "messages": [{"role": "user", "content": "What is our refund policy?"}]
  }'

The full set of routing headers in v0.1.65:

Parameter	Type	Description
`x-hybrie-adapter-id`	string	Serve this request through the named loaded adapter. Also settable as gRPC metadata `hybrie.adapter_id` — the header wins when both are present.
`x-hybrie-execution-mode`	string	`local`, `cloud`, or `hybrid` — where the request executes.
`x-hybrie-cloud-provider`	string	Cloud provider to use when execution goes to the cloud.
`x-hybrie-preferred-peer-id`	string	Route the request to a specific registered compute peer.

Routing is per request, not per runtime — different requests can target different adapters concurrently on the same base model. Omit the headers and the request is served by the base model alone with default routing.

Pinning, TTL, and sessions

Adapters produced by POST /v1/d2l/internalize can be pinned at creation so they are never evicted, carry an optional ttl_seconds after which the adapter auto-expires and stops being served, and can be tied to a session via session_key. Use short TTLs for ephemeral, per-session internalizations and pinned long-lived adapters for shared knowledge.

Usage metering

Adapter-served traffic shows up in workspace usage metering: usage events record the adapter_id that served each request, and the month-to-date billing snapshot reports adapter coverage — the percentage of traffic served through adapters. See Usage & Billing.

From the CLI

bash

stimulir lab adapters load <adapter-id>
stimulir lab adapters unload <adapter-id>

Train and register adapters in PEFT Tuning (LoRA).
See how adapter coverage affects your bill in Usage & Billing.

Load and unload at runtime#

Per-request routing headers#

Pinning, TTL, and sessions#

Usage metering#

From the CLI#

Next#

Load and unload at runtime

Per-request routing headers

Pinning, TTL, and sessions

Usage metering

From the CLI

Next