Engineering Workspace

Inference API

OpenAI-compatible chat completions through the console, routed to your BYOK providers or Managed Inference — plus the HybrIE runtime endpoints for BYOC deployments.

Console inference endpoint

POST/api/v1/inference/chat/completions

Drop-in OpenAI-compatible chat completions at https://api.stimulir.com. Authenticate with a hyb_* API key:

curl
curl https://api.stimulir.com/api/v1/inference/chat/completions \
  -H "Authorization: Bearer hyb_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-4b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Requests are routed by model prefix: models matching one of your BYOK credentials are sent upstream with your own provider key; everything else is served by Managed Inference. Usage is metered per request — see Usage & Billing.

Because the endpoint is OpenAI-compatible, existing OpenAI SDK clients work by pointing base_url at https://api.stimulir.com/api/v1/inference and using a hyb_* key.

From the CLI

bash
stimulir infer chat "Draft a runbook for failed payouts" --model qwen3-4b --stream
stimulir models

HybrIE runtime endpoints

In BYOC deployments you run the HybrIE runtime yourself. It serves an OpenAI-compatible HTTP API on port 8080 (gRPC on 9090).

Chat completions

POST/v1/chat/completions

Streaming chat completions. Local models (Qwen3 / Qwen3-Coder) run via Candle on Metal or CUDA, with cloud fallback to OpenAI, Anthropic, Gemini, or Mistral when configured. Supported sampling parameters:

ParameterTypeDescription
temperaturefloatSampling temperature.
top_pfloatNucleus sampling probability mass.
top_kintegerTop-k sampling cutoff.
thinking_modebooleanEnable model thinking/reasoning traces where the model supports it.

Routing headers

Runtime requests can be steered per request with x-hybrie-* headers (v0.1.65):

ParameterTypeDescription
x-hybrie-adapter-idstringServe the request through a loaded LoRA adapter — no restart. Also settable as gRPC metadata hybrie.adapter_id; the header wins. See Hot-swap Inference for loading and adapter details.
x-hybrie-execution-modestringlocal, cloud, or hybrid — where the request executes.
x-hybrie-cloud-providerstringCloud provider to use when execution goes to the cloud.
x-hybrie-preferred-peer-idstringRoute the request to a specific registered compute peer.

Multimodal (vision)

Multimodal models accept image content parts in the OpenAI image_url shape. The runtime currently advertises three multimodal models — claude-sonnet-4-6, gemini-3-flash-preview, and gpt-4.1-mini; check GET /v1/models for the live list.

vision request body
{
  "model": "gpt-4.1-mini",
  "messages": [
    { "role": "user", "content": [
      { "type": "text", "text": "What is in this image?" },
      { "type": "image_url",
        "image_url": { "url": "https://example.com/receipt.png" } }
    ] }
  ]
}

Embeddings

POST/v1/embeddings

OpenAI-compatible embeddings.

Audio

POST/v1/audio/transcriptions

Speech-to-text using local Whisper.

POST/v1/audio/speech

Text-to-speech. The local Qwen2.5-Omni voice model isExperimental

Qwen2.5-Omni also serves the bidirectional voice loop — see Realtime Voice.

Models

GET/v1/models

List models available on the runtime.