Lab Workspace
Evaluation
Evaluate prompts, staged data assets, inference endpoints, PEFT adapters, and Doc-to-LoRA hypernet checkpoints from one durable Lab run record.
Durable eval runs
/api/v1/lab/evals/runs/api/v1/lab/evals/runs/api/v1/lab/evals/runs/{run_id}/api/v1/lab/evals/runs/{run_id}/executeLab evaluations are first-class workspace resources. A run persists the suite, cases, candidates, pending results, execution routing, and lineage before compute starts. The same route covers prompt regression, curated data or trace snapshots, baseline model endpoints, hot-swapped adapters, and external endpoint candidates.
| Parameter | Type | Description |
|---|---|---|
suite_namerequired | string | Display name for the eval suite. |
source | string | One of data_asset, prompt, trace, manual, or mixed. |
data_asset_id | uuid | Optional Engineering data asset staged for target=eval. |
prompt_refs | object[] | Prompt refs by key plus optional version or label. |
candidates | object[] | Inference candidates: managed model, prompt version, adapter hot-swap, or external endpoint. |
baseline_provider | string | Provider for the baseline candidate. Default hybrie. |
baseline_model | string | Model or endpoint model for the baseline candidate. |
execute | boolean | Create the run and immediately dispatch execution. |
# Create a durable prompt/data eval run
stimulir lab eval create-run \
--name "Prompt and data regression" \
--data-asset-id <asset-id> \
--prompt agent_chat_browser_research:staging \
--model hybrie-runtime-default
# Create and execute in one call
stimulir lab eval create-run --data-asset-id <asset-id> --prompt <key>:<version> --execute
# List and execute runs
stimulir lab eval runs
stimulir lab eval get <run-id>
stimulir lab eval execute-run <run-id>curl -X POST https://api.stimulir.com/api/v1/lab/evals/runs \
-H "Authorization: Bearer $STIMULIR_TOKEN" \
-H "X-Business-Profile-Id: $WORKSPACE_ID" \
-H "Content-Type: application/json" \
-d '{
"suite_name": "Prompt and data regression",
"source": "mixed",
"data_asset_id": "<asset-id>",
"prompt_refs": [{"key": "agent_chat_browser_research", "label": "staging"}],
"baseline_provider": "hybrie",
"baseline_model": "hybrie-runtime-default",
"execute": true
}'/v1/eval/* routes below for synchronous low-level adapter, policy, and hypernet benchmark checks.PEFT adapter evaluation
/v1/eval/adapterScores a classical PEFT adapter — produced by SFT training, an RL run with policy=peft-lora, exported from the D2L pipeline, or registered externally — on held-out Needle-In-A-Haystack retrieval, against the zero-LoRA baseline. The call is synchronous.
| Parameter | Type | Description |
|---|---|---|
familyrequired | string | Base model family: qwen3-4b, qwen3-0.6b, or mistral-7b. |
adapter_dirrequired | string | Directory containing the adapter weights — adapter_model.safetensors + adapter_config.json. |
examples | integer | Number of held-out examples to score. Default 200. |
seed | integer | Seed for reproducible example generation. Default 1. |
device | string | auto (default), metal, cuda, or cpu. |
model_dir | string | Optional base-model directory override. |
The metrics are nested under a report key in the response. It carries four accuracy metrics — n, adapter_acc, base_acc, lift (adapter_acc − base_acc) — plus the adapter's configuration so the score is self-describing:
| Parameter | Type | Description |
|---|---|---|
r | integer | LoRA rank of the evaluated adapter. |
lora_alpha | integer | LoRA alpha of the evaluated adapter. |
scaling | float | Effective scaling applied to the low-rank update: alpha / r. |
target_modules | string[] | Modules the adapter attaches to. |
curl -X POST http://localhost:8080/v1/eval/adapter \
-H "Content-Type: application/json" \
-d '{
"family": "qwen3-4b",
"adapter_dir": "~/hybrie-mounts/d2l-artifacts/rl-1718000000/adapter",
"examples": 200,
"seed": 42
}'stimulir lab eval adapter --family qwen3-4b --adapter-dir <dir>Reward evaluation (policies)
/v1/eval/rlReward evaluation of a policy on a verifiable environment: the harness samples tasks from the environment, generates completions with the chosen policy, and scores each completion with the environment's reward function. The call is synchronous.
| Parameter | Type | Description |
|---|---|---|
familyrequired | string | Base model family: qwen3-4b, qwen3-0.6b, or mistral-7b. |
environment | string | Verifiable environment to evaluate on. Default niah — the only environment today. |
policy | string | One of base (default), hypernet, or peft-lora. |
checkpoint_dir | string | Required unless policy=base — a hypernet checkpoint for policy=hypernet, or a PEFT adapter directory for policy=peft-lora. |
num_tasks | integer | Number of environment tasks to evaluate. Default 50. |
max_new_tokens | integer | Generation budget per task. Default 16. |
temperature | float | Sampling temperature. Default 0.0 (near-greedy). |
seed | integer | Seed for reproducible task generation. Default 1. |
pass_threshold | float | A task passes when its reward is ≥ this threshold. Default 1.0. |
device | string | auto (default), metal, cuda, or cpu. |
model_dir | string | Optional base-model directory override. |
The report summarizes the reward distribution and per-task outcomes:
| Parameter | Type | Description |
|---|---|---|
environment | string | The environment that was evaluated. |
policy | string | The policy that was evaluated. |
n | integer | Number of tasks evaluated. |
mean_reward | float | Mean reward across tasks. |
std_reward | float | Standard deviation of reward. |
min_reward | float | Lowest task reward. |
max_reward | float | Highest task reward. |
pass_rate | float | Fraction of tasks with reward ≥ pass_threshold. |
pass_threshold | float | The threshold the pass rate was computed against. |
per_task | object[] | Per-task records: task, reward, passed, and completion (truncated to 200 characters). |
To measure lift, run the eval twice with the same environment, seed, and num_tasks — once with policy=base for the floor, then with your trained policy — and compare mean_reward and pass_rate:
# Baseline
curl -X POST http://localhost:8080/v1/eval/rl \
-H "Content-Type: application/json" \
-d '{"family": "qwen3-4b", "policy": "base", "num_tasks": 50, "seed": 42}'
# Trained policy
curl -X POST http://localhost:8080/v1/eval/rl \
-H "Content-Type: application/json" \
-d '{
"family": "qwen3-4b",
"policy": "peft-lora",
"checkpoint_dir": "~/hybrie-mounts/d2l-artifacts/rl-1718000000/adapter",
"num_tasks": 50,
"seed": 42
}'stimulir lab eval rl --family qwen3-4b --policy peft-lora --checkpoint-dir <dir> --tasks 50Hypernet checkpoint evaluation (NIAH)
/v1/eval/niahScores a trained hypernet checkpoint — the artifact produced by Doc-to-LoRA hypernetwork training — on held-out Needle-In-A-Haystack retrieval: known facts ("needles") are buried in long filler context, and the model must retrieve them exactly. The harness runs the model twice — with the trained checkpoint and with no adapter (zero-LoRA baseline) — so the report isolates what the training actually contributed. The call is synchronous.
| Parameter | Type | Description |
|---|---|---|
familyrequired | string | Base model family: qwen3-4b, qwen3-0.6b, or mistral-7b. |
checkpoint_dirrequired | string | Path to the hypernet checkpoint produced by a D2L training job (metadata.json + model.safetensors). |
examples | integer | Number of held-out examples to score. Default 200. |
seed | integer | Seed for reproducible example generation. Default 1. |
device | string | auto (default), metal, cuda, or cpu. |
model_dir | string | Optional base-model directory override. |
The metrics are nested under a report key in the response:
| Parameter | Type | Description |
|---|---|---|
n | integer | Number of held-out examples scored. |
adapter_acc | float | Exact-match retrieval accuracy with the trained adapter applied (0–1). |
base_acc | float | Accuracy of the frozen base model with no adapter — the floor. |
lift | float | adapter_acc − base_acc: the improvement attributable to the training. |
curl -X POST http://localhost:8080/v1/eval/niah \
-H "Content-Type: application/json" \
-d '{
"family": "qwen3-4b",
"checkpoint_dir": "~/hybrie-mounts/d2l-artifacts/train-1718000000/checkpoint",
"examples": 200,
"seed": 42
}'stimulir lab eval niah --family qwen3-4b --checkpoint-dir <dir> --examples 200 --seed 42Automatic post-training eval
You usually don't need to call the NIAH endpoint by hand: every Doc-to-LoRA training job runs the same held-out eval automatically when it finishes, using a disjoint seed so eval examples never overlap the training data. The resulting report (adapter_acc, base_acc, lift) is embedded in the job's final run report alongside the per-step loss curve — see Training Jobs.
Audio evaluation
/v1/eval/audioSignal-level quality gates for speech audio: peak/RMS level (dBFS), crest factor, clipping and flat-run detection, plus optional intelligibility checking — a local Whisper transcript and word-error-rate against an expected_text. The response includes a gates object with pass/fail booleans per check.
What's measured today — and what isn't
Evaluation has two layers. The durable Lab route records prompt, data, endpoint, and adapter comparison runs with lineage and result rows. HybrIE's low-level benchmark endpoints remain deterministic and verifiable: exact-match retrieval for adapters and checkpoints, and reward evaluation of policies with pass-rate gates via /v1/eval/rl. The honest caveat is that niah is the only built-in reward environment today; the runtime's Environment trait is the extensibility hook for additional verifiable environments.
seed and the same examples / num_tasks count when comparing checkpoints or policies, so before/after numbers are directly comparable.