Engineering Workspace
Inference API
OpenAI-compatible chat completions through the console, routed to your BYOK providers or Managed Inference — plus the HybrIE runtime endpoints for BYOC deployments.
Console inference endpoint
/api/v1/inference/chat/completionsDrop-in OpenAI-compatible chat completions at https://api.stimulir.com. Authenticate with a hyb_* API key:
curl https://api.stimulir.com/api/v1/inference/chat/completions \
-H "Authorization: Bearer hyb_..." \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-4b",
"messages": [{"role": "user", "content": "Hello"}]
}'Requests are routed by model prefix: models matching one of your BYOK credentials are sent upstream with your own provider key; everything else is served by Managed Inference. Usage is metered per request — see Usage & Billing.
Because the endpoint is OpenAI-compatible, existing OpenAI SDK clients work by pointing base_url at https://api.stimulir.com/api/v1/inference and using a hyb_* key.
From the CLI
stimulir infer chat "Draft a runbook for failed payouts" --model qwen3-4b --stream
stimulir modelsHybrIE runtime endpoints
In BYOC deployments you run the HybrIE runtime yourself. It serves an OpenAI-compatible HTTP API on port 8080 (gRPC on 9090).
Chat completions
/v1/chat/completionsStreaming chat completions. Local models (Qwen3 / Qwen3-Coder) run via Candle on Metal or CUDA, with cloud fallback to OpenAI, Anthropic, Gemini, or Mistral when configured. Supported sampling parameters:
| Parameter | Type | Description |
|---|---|---|
temperature | float | Sampling temperature. |
top_p | float | Nucleus sampling probability mass. |
top_k | integer | Top-k sampling cutoff. |
thinking_mode | boolean | Enable model thinking/reasoning traces where the model supports it. |
Routing headers
Runtime requests can be steered per request with x-hybrie-* headers (v0.1.65):
| Parameter | Type | Description |
|---|---|---|
x-hybrie-adapter-id | string | Serve the request through a loaded LoRA adapter — no restart. Also settable as gRPC metadata hybrie.adapter_id; the header wins. See Hot-swap Inference for loading and adapter details. |
x-hybrie-execution-mode | string | local, cloud, or hybrid — where the request executes. |
x-hybrie-cloud-provider | string | Cloud provider to use when execution goes to the cloud. |
x-hybrie-preferred-peer-id | string | Route the request to a specific registered compute peer. |
Multimodal (vision)
Multimodal models accept image content parts in the OpenAI image_url shape. The runtime currently advertises three multimodal models — claude-sonnet-4-6, gemini-3-flash-preview, and gpt-4.1-mini; check GET /v1/models for the live list.
{
"model": "gpt-4.1-mini",
"messages": [
{ "role": "user", "content": [
{ "type": "text", "text": "What is in this image?" },
{ "type": "image_url",
"image_url": { "url": "https://example.com/receipt.png" } }
] }
]
}Embeddings
/v1/embeddingsOpenAI-compatible embeddings.
Audio
/v1/audio/transcriptionsSpeech-to-text using local Whisper.
/v1/audio/speechText-to-speech. The local Qwen2.5-Omni voice model isExperimental
Qwen2.5-Omni also serves the bidirectional voice loop — see Realtime Voice.
Models
/v1/modelsList models available on the runtime.
