Lab Workspace
Hot-swap Inference
Serve LoRA adapters from a running HybrIE runtime — load and unload them at runtime and route individual requests through them, with no restart and no redeploy.
Load and unload at runtime
/v1/adapters/:id/load/v1/adapters/:id/unloadLoading places a registered adapter into the running runtime's memory so it can serve requests; unloading frees it. Both happen live — the base model keeps serving throughout, and no process restart or redeploy is involved. Only loaded adapters can be used at inference time.
Load body:
| Parameter | Type | Description |
|---|---|---|
backends | string[] | Optional backends to load the adapter on. Omit to load on the default backend. |
pin | boolean | Pin the adapter so it is never evicted. max_active_loras caps how many adapters can be loaded at once — unpinned adapters can be evicted to make room. |
Unload accepts backends only:
curl -X POST http://localhost:8080/v1/adapters/refund-policy/load \
-H "Content-Type: application/json" \
-d '{"pin": true}'
curl -X POST http://localhost:8080/v1/adapters/refund-policy/unload \
-H "Content-Type: application/json" \
-d '{}'Check what is currently loaded with GET /v1/adapters/status.
Per-request routing headers
Target a loaded adapter on a single request by passing the x-hybrie-adapter-id header on chat completions. The runtime serves that request through the adapter while every other request continues to use the base model:
/v1/chat/completionscurl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "x-hybrie-adapter-id: refund-policy" \
-d '{
"model": "qwen3-4b",
"messages": [{"role": "user", "content": "What is our refund policy?"}]
}'The full set of routing headers in v0.1.65:
| Parameter | Type | Description |
|---|---|---|
x-hybrie-adapter-id | string | Serve this request through the named loaded adapter. Also settable as gRPC metadata hybrie.adapter_id — the header wins when both are present. |
x-hybrie-execution-mode | string | local, cloud, or hybrid — where the request executes. |
x-hybrie-cloud-provider | string | Cloud provider to use when execution goes to the cloud. |
x-hybrie-preferred-peer-id | string | Route the request to a specific registered compute peer. |
Routing is per request, not per runtime — different requests can target different adapters concurrently on the same base model. Omit the headers and the request is served by the base model alone with default routing.
Pinning, TTL, and sessions
Adapters produced by POST /v1/d2l/internalize can be pinned at creation so they are never evicted, carry an optional ttl_seconds after which the adapter auto-expires and stops being served, and can be tied to a session via session_key. Use short TTLs for ephemeral, per-session internalizations and pinned long-lived adapters for shared knowledge.
Usage metering
Adapter-served traffic shows up in workspace usage metering: usage events record the adapter_id that served each request, and the month-to-date billing snapshot reports adapter coverage — the percentage of traffic served through adapters. See Usage & Billing.
From the CLI
stimulir lab adapters load <adapter-id>
stimulir lab adapters unload <adapter-id>Next
- Train and register adapters in PEFT Tuning (LoRA).
- See how adapter coverage affects your bill in Usage & Billing.
