Lab Workspace

Evaluation

Evaluate prompts, staged data assets, inference endpoints, PEFT adapters, and Doc-to-LoRA hypernet checkpoints from one durable Lab run record.

Durable eval runs

GET/api/v1/lab/evals/runs

POST/api/v1/lab/evals/runs

GET/api/v1/lab/evals/runs/{run_id}

POST/api/v1/lab/evals/runs/{run_id}/execute

Lab evaluations are first-class workspace resources. A run persists the suite, cases, candidates, pending results, execution routing, and lineage before compute starts. The same route covers prompt regression, curated data or trace snapshots, baseline model endpoints, hot-swapped adapters, and external endpoint candidates.

Parameter	Type	Description
`suite_name`required	string	Display name for the eval suite.
`source`	string	One of `data_asset`, `prompt`, `trace`, `manual`, or `mixed`.
`data_asset_id`	uuid	Optional Engineering data asset staged for `target=eval`.
`prompt_refs`	object[]	Prompt refs by `key` plus optional `version` or `label`.
`candidates`	object[]	Inference candidates: managed model, prompt version, adapter hot-swap, or external endpoint.
`baseline_provider`	string	Provider for the baseline candidate. Default hybrie.
`baseline_model`	string	Model or endpoint model for the baseline candidate.
`execute`	boolean	Create the run and immediately dispatch execution.

CLI

# Create a durable prompt/data eval run
stimulir lab eval create-run \
  --name "Prompt and data regression" \
  --data-asset-id <asset-id> \
  --prompt agent_chat_browser_research:staging \
  --model hybrie-runtime-default

# Create and execute in one call
stimulir lab eval create-run --data-asset-id <asset-id> --prompt <key>:<version> --execute

# List and execute runs
stimulir lab eval runs
stimulir lab eval get <run-id>
stimulir lab eval execute-run <run-id>

curl

curl -X POST https://api.stimulir.com/api/v1/lab/evals/runs \
  -H "Authorization: Bearer $STIMULIR_TOKEN" \
  -H "X-Business-Profile-Id: $WORKSPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "suite_name": "Prompt and data regression",
    "source": "mixed",
    "data_asset_id": "<asset-id>",
    "prompt_refs": [{"key": "agent_chat_browser_research", "label": "staging"}],
    "baseline_provider": "hybrie",
    "baseline_model": "hybrie-runtime-default",
    "execute": true
  }'

Use the durable Lab route for client prompt management, trace/data cleaning, endpoint comparison, and eval history. Use the direct HybrIE /v1/eval/* routes below for synchronous low-level adapter, policy, and hypernet benchmark checks.

PEFT adapter evaluation

POST/v1/eval/adapter

Scores a classical PEFT adapter — produced by SFT training, an RL run with policy=peft-lora, exported from the D2L pipeline, or registered externally — on held-out Needle-In-A-Haystack retrieval, against the zero-LoRA baseline. The call is synchronous.

Parameter	Type	Description
`family`required	string	Base model family: qwen3-4b, qwen3-0.6b, or mistral-7b.
`adapter_dir`required	string	Directory containing the adapter weights — `adapter_model.safetensors` + `adapter_config.json`.
`examples`	integer	Number of held-out examples to score. Default 200.
`seed`	integer	Seed for reproducible example generation. Default 1.
`device`	string	auto (default), metal, cuda, or cpu.
`model_dir`	string	Optional base-model directory override.

The metrics are nested under a report key in the response. It carries four accuracy metrics — n, adapter_acc, base_acc, lift (adapter_acc − base_acc) — plus the adapter's configuration so the score is self-describing:

Parameter	Type	Description
`r`	integer	LoRA rank of the evaluated adapter.
`lora_alpha`	integer	LoRA alpha of the evaluated adapter.
`scaling`	float	Effective scaling applied to the low-rank update: alpha / r.
`target_modules`	string[]	Modules the adapter attaches to.

curl

curl -X POST http://localhost:8080/v1/eval/adapter \
  -H "Content-Type: application/json" \
  -d '{
    "family": "qwen3-4b",
    "adapter_dir": "~/hybrie-mounts/d2l-artifacts/rl-1718000000/adapter",
    "examples": 200,
    "seed": 42
  }'

bash

stimulir lab eval adapter --family qwen3-4b --adapter-dir <dir>

Reward evaluation (policies)

POST/v1/eval/rl

Reward evaluation of a policy on a verifiable environment: the harness samples tasks from the environment, generates completions with the chosen policy, and scores each completion with the environment's reward function. The call is synchronous.

Parameter	Type	Description
`family`required	string	Base model family: qwen3-4b, qwen3-0.6b, or mistral-7b.
`environment`	string	Verifiable environment to evaluate on. Default `niah` — the only environment today.
`policy`	string	One of `base` (default), `hypernet`, or `peft-lora`.
`checkpoint_dir`	string	Required unless `policy=base` — a hypernet checkpoint for `policy=hypernet`, or a PEFT adapter directory for `policy=peft-lora`.
`num_tasks`	integer	Number of environment tasks to evaluate. Default 50.
`max_new_tokens`	integer	Generation budget per task. Default 16.
`temperature`	float	Sampling temperature. Default 0.0 (near-greedy).
`seed`	integer	Seed for reproducible task generation. Default 1.
`pass_threshold`	float	A task passes when its reward is ≥ this threshold. Default 1.0.
`device`	string	auto (default), metal, cuda, or cpu.
`model_dir`	string	Optional base-model directory override.

The report summarizes the reward distribution and per-task outcomes:

Parameter	Type	Description
`environment`	string	The environment that was evaluated.
`policy`	string	The policy that was evaluated.
`n`	integer	Number of tasks evaluated.
`mean_reward`	float	Mean reward across tasks.
`std_reward`	float	Standard deviation of reward.
`min_reward`	float	Lowest task reward.
`max_reward`	float	Highest task reward.
`pass_rate`	float	Fraction of tasks with reward ≥ `pass_threshold`.
`pass_threshold`	float	The threshold the pass rate was computed against.
`per_task`	object[]	Per-task records: `task`, `reward`, `passed`, and `completion` (truncated to 200 characters).

To measure lift, run the eval twice with the same environment, seed, and num_tasks — once with policy=base for the floor, then with your trained policy — and compare mean_reward and pass_rate:

curl

# Baseline
curl -X POST http://localhost:8080/v1/eval/rl \
  -H "Content-Type: application/json" \
  -d '{"family": "qwen3-4b", "policy": "base", "num_tasks": 50, "seed": 42}'

# Trained policy
curl -X POST http://localhost:8080/v1/eval/rl \
  -H "Content-Type: application/json" \
  -d '{
    "family": "qwen3-4b",
    "policy": "peft-lora",
    "checkpoint_dir": "~/hybrie-mounts/d2l-artifacts/rl-1718000000/adapter",
    "num_tasks": 50,
    "seed": 42
  }'

bash

stimulir lab eval rl --family qwen3-4b --policy peft-lora --checkpoint-dir <dir> --tasks 50

Hypernet checkpoint evaluation (NIAH)

POST/v1/eval/niah

Scores a trained hypernet checkpoint — the artifact produced by Doc-to-LoRA hypernetwork training — on held-out Needle-In-A-Haystack retrieval: known facts ("needles") are buried in long filler context, and the model must retrieve them exactly. The harness runs the model twice — with the trained checkpoint and with no adapter (zero-LoRA baseline) — so the report isolates what the training actually contributed. The call is synchronous.

Parameter	Type	Description
`family`required	string	Base model family: qwen3-4b, qwen3-0.6b, or mistral-7b.
`checkpoint_dir`required	string	Path to the hypernet checkpoint produced by a D2L training job (metadata.json + model.safetensors).
`examples`	integer	Number of held-out examples to score. Default 200.
`seed`	integer	Seed for reproducible example generation. Default 1.
`device`	string	auto (default), metal, cuda, or cpu.
`model_dir`	string	Optional base-model directory override.

The metrics are nested under a report key in the response:

Parameter	Type	Description
`n`	integer	Number of held-out examples scored.
`adapter_acc`	float	Exact-match retrieval accuracy with the trained adapter applied (0–1).
`base_acc`	float	Accuracy of the frozen base model with no adapter — the floor.
`lift`	float	adapter_acc − base_acc: the improvement attributable to the training.

curl

curl -X POST http://localhost:8080/v1/eval/niah \
  -H "Content-Type: application/json" \
  -d '{
    "family": "qwen3-4b",
    "checkpoint_dir": "~/hybrie-mounts/d2l-artifacts/train-1718000000/checkpoint",
    "examples": 200,
    "seed": 42
  }'

bash

stimulir lab eval niah --family qwen3-4b --checkpoint-dir <dir> --examples 200 --seed 42

Automatic post-training eval

You usually don't need to call the NIAH endpoint by hand: every Doc-to-LoRA training job runs the same held-out eval automatically when it finishes, using a disjoint seed so eval examples never overlap the training data. The resulting report (adapter_acc, base_acc, lift) is embedded in the job's final run report alongside the per-step loss curve — see Training Jobs.

Audio evaluation

POST/v1/eval/audio

Signal-level quality gates for speech audio: peak/RMS level (dBFS), crest factor, clipping and flat-run detection, plus optional intelligibility checking — a local Whisper transcript and word-error-rate against an expected_text. The response includes a gates object with pass/fail booleans per check.

What's measured today — and what isn't

Evaluation has two layers. The durable Lab route records prompt, data, endpoint, and adapter comparison runs with lineage and result rows. HybrIE's low-level benchmark endpoints remain deterministic and verifiable: exact-match retrieval for adapters and checkpoints, and reward evaluation of policies with pass-rate gates via /v1/eval/rl. The honest caveat is that niah is the only built-in reward environment today; the runtime's Environment trait is the extensibility hook for additional verifiable environments.

Use a fixed seed and the same examples / num_tasks count when comparing checkpoints or policies, so before/after numbers are directly comparable.

Durable eval runs#

PEFT adapter evaluation#

Reward evaluation (policies)#

Hypernet checkpoint evaluation (NIAH)#

Automatic post-training eval#

Audio evaluation#

What's measured today — and what isn't#

Durable eval runs

PEFT adapter evaluation

Reward evaluation (policies)

Hypernet checkpoint evaluation (NIAH)

Automatic post-training eval

Audio evaluation

What's measured today — and what isn't