Lab Workspace

Evaluation

Evaluate prompts, staged data assets, inference endpoints, PEFT adapters, and Doc-to-LoRA hypernet checkpoints from one durable Lab run record.

Durable eval runs

GET/api/v1/lab/evals/runs
POST/api/v1/lab/evals/runs
GET/api/v1/lab/evals/runs/{run_id}
POST/api/v1/lab/evals/runs/{run_id}/execute

Lab evaluations are first-class workspace resources. A run persists the suite, cases, candidates, pending results, execution routing, and lineage before compute starts. The same route covers prompt regression, curated data or trace snapshots, baseline model endpoints, hot-swapped adapters, and external endpoint candidates.

ParameterTypeDescription
suite_namerequiredstringDisplay name for the eval suite.
sourcestringOne of data_asset, prompt, trace, manual, or mixed.
data_asset_iduuidOptional Engineering data asset staged for target=eval.
prompt_refsobject[]Prompt refs by key plus optional version or label.
candidatesobject[]Inference candidates: managed model, prompt version, adapter hot-swap, or external endpoint.
baseline_providerstringProvider for the baseline candidate. Default hybrie.
baseline_modelstringModel or endpoint model for the baseline candidate.
executebooleanCreate the run and immediately dispatch execution.
CLI
# Create a durable prompt/data eval run
stimulir lab eval create-run \
  --name "Prompt and data regression" \
  --data-asset-id <asset-id> \
  --prompt agent_chat_browser_research:staging \
  --model hybrie-runtime-default

# Create and execute in one call
stimulir lab eval create-run --data-asset-id <asset-id> --prompt <key>:<version> --execute

# List and execute runs
stimulir lab eval runs
stimulir lab eval get <run-id>
stimulir lab eval execute-run <run-id>
curl
curl -X POST https://api.stimulir.com/api/v1/lab/evals/runs \
  -H "Authorization: Bearer $STIMULIR_TOKEN" \
  -H "X-Business-Profile-Id: $WORKSPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "suite_name": "Prompt and data regression",
    "source": "mixed",
    "data_asset_id": "<asset-id>",
    "prompt_refs": [{"key": "agent_chat_browser_research", "label": "staging"}],
    "baseline_provider": "hybrie",
    "baseline_model": "hybrie-runtime-default",
    "execute": true
  }'
Use the durable Lab route for client prompt management, trace/data cleaning, endpoint comparison, and eval history. Use the direct HybrIE /v1/eval/* routes below for synchronous low-level adapter, policy, and hypernet benchmark checks.

PEFT adapter evaluation

POST/v1/eval/adapter

Scores a classical PEFT adapter — produced by SFT training, an RL run with policy=peft-lora, exported from the D2L pipeline, or registered externally — on held-out Needle-In-A-Haystack retrieval, against the zero-LoRA baseline. The call is synchronous.

ParameterTypeDescription
familyrequiredstringBase model family: qwen3-4b, qwen3-0.6b, or mistral-7b.
adapter_dirrequiredstringDirectory containing the adapter weights — adapter_model.safetensors + adapter_config.json.
examplesintegerNumber of held-out examples to score. Default 200.
seedintegerSeed for reproducible example generation. Default 1.
devicestringauto (default), metal, cuda, or cpu.
model_dirstringOptional base-model directory override.

The metrics are nested under a report key in the response. It carries four accuracy metrics — n, adapter_acc, base_acc, lift (adapter_acc − base_acc) — plus the adapter's configuration so the score is self-describing:

ParameterTypeDescription
rintegerLoRA rank of the evaluated adapter.
lora_alphaintegerLoRA alpha of the evaluated adapter.
scalingfloatEffective scaling applied to the low-rank update: alpha / r.
target_modulesstring[]Modules the adapter attaches to.
curl
curl -X POST http://localhost:8080/v1/eval/adapter \
  -H "Content-Type: application/json" \
  -d '{
    "family": "qwen3-4b",
    "adapter_dir": "~/hybrie-mounts/d2l-artifacts/rl-1718000000/adapter",
    "examples": 200,
    "seed": 42
  }'
bash
stimulir lab eval adapter --family qwen3-4b --adapter-dir <dir>

Reward evaluation (policies)

POST/v1/eval/rl

Reward evaluation of a policy on a verifiable environment: the harness samples tasks from the environment, generates completions with the chosen policy, and scores each completion with the environment's reward function. The call is synchronous.

ParameterTypeDescription
familyrequiredstringBase model family: qwen3-4b, qwen3-0.6b, or mistral-7b.
environmentstringVerifiable environment to evaluate on. Default niah — the only environment today.
policystringOne of base (default), hypernet, or peft-lora.
checkpoint_dirstringRequired unless policy=base — a hypernet checkpoint for policy=hypernet, or a PEFT adapter directory for policy=peft-lora.
num_tasksintegerNumber of environment tasks to evaluate. Default 50.
max_new_tokensintegerGeneration budget per task. Default 16.
temperaturefloatSampling temperature. Default 0.0 (near-greedy).
seedintegerSeed for reproducible task generation. Default 1.
pass_thresholdfloatA task passes when its reward is ≥ this threshold. Default 1.0.
devicestringauto (default), metal, cuda, or cpu.
model_dirstringOptional base-model directory override.

The report summarizes the reward distribution and per-task outcomes:

ParameterTypeDescription
environmentstringThe environment that was evaluated.
policystringThe policy that was evaluated.
nintegerNumber of tasks evaluated.
mean_rewardfloatMean reward across tasks.
std_rewardfloatStandard deviation of reward.
min_rewardfloatLowest task reward.
max_rewardfloatHighest task reward.
pass_ratefloatFraction of tasks with reward ≥ pass_threshold.
pass_thresholdfloatThe threshold the pass rate was computed against.
per_taskobject[]Per-task records: task, reward, passed, and completion (truncated to 200 characters).

To measure lift, run the eval twice with the same environment, seed, and num_tasks — once with policy=base for the floor, then with your trained policy — and compare mean_reward and pass_rate:

curl
# Baseline
curl -X POST http://localhost:8080/v1/eval/rl \
  -H "Content-Type: application/json" \
  -d '{"family": "qwen3-4b", "policy": "base", "num_tasks": 50, "seed": 42}'

# Trained policy
curl -X POST http://localhost:8080/v1/eval/rl \
  -H "Content-Type: application/json" \
  -d '{
    "family": "qwen3-4b",
    "policy": "peft-lora",
    "checkpoint_dir": "~/hybrie-mounts/d2l-artifacts/rl-1718000000/adapter",
    "num_tasks": 50,
    "seed": 42
  }'
bash
stimulir lab eval rl --family qwen3-4b --policy peft-lora --checkpoint-dir <dir> --tasks 50

Hypernet checkpoint evaluation (NIAH)

POST/v1/eval/niah

Scores a trained hypernet checkpoint — the artifact produced by Doc-to-LoRA hypernetwork training — on held-out Needle-In-A-Haystack retrieval: known facts ("needles") are buried in long filler context, and the model must retrieve them exactly. The harness runs the model twice — with the trained checkpoint and with no adapter (zero-LoRA baseline) — so the report isolates what the training actually contributed. The call is synchronous.

ParameterTypeDescription
familyrequiredstringBase model family: qwen3-4b, qwen3-0.6b, or mistral-7b.
checkpoint_dirrequiredstringPath to the hypernet checkpoint produced by a D2L training job (metadata.json + model.safetensors).
examplesintegerNumber of held-out examples to score. Default 200.
seedintegerSeed for reproducible example generation. Default 1.
devicestringauto (default), metal, cuda, or cpu.
model_dirstringOptional base-model directory override.

The metrics are nested under a report key in the response:

ParameterTypeDescription
nintegerNumber of held-out examples scored.
adapter_accfloatExact-match retrieval accuracy with the trained adapter applied (0–1).
base_accfloatAccuracy of the frozen base model with no adapter — the floor.
liftfloatadapter_acc − base_acc: the improvement attributable to the training.
curl
curl -X POST http://localhost:8080/v1/eval/niah \
  -H "Content-Type: application/json" \
  -d '{
    "family": "qwen3-4b",
    "checkpoint_dir": "~/hybrie-mounts/d2l-artifacts/train-1718000000/checkpoint",
    "examples": 200,
    "seed": 42
  }'
bash
stimulir lab eval niah --family qwen3-4b --checkpoint-dir <dir> --examples 200 --seed 42

Automatic post-training eval

You usually don't need to call the NIAH endpoint by hand: every Doc-to-LoRA training job runs the same held-out eval automatically when it finishes, using a disjoint seed so eval examples never overlap the training data. The resulting report (adapter_acc, base_acc, lift) is embedded in the job's final run report alongside the per-step loss curve — see Training Jobs.

Audio evaluation

POST/v1/eval/audio

Signal-level quality gates for speech audio: peak/RMS level (dBFS), crest factor, clipping and flat-run detection, plus optional intelligibility checking — a local Whisper transcript and word-error-rate against an expected_text. The response includes a gates object with pass/fail booleans per check.

What's measured today — and what isn't

Evaluation has two layers. The durable Lab route records prompt, data, endpoint, and adapter comparison runs with lineage and result rows. HybrIE's low-level benchmark endpoints remain deterministic and verifiable: exact-match retrieval for adapters and checkpoints, and reward evaluation of policies with pass-rate gates via /v1/eval/rl. The honest caveat is that niah is the only built-in reward environment today; the runtime's Environment trait is the extensibility hook for additional verifiable environments.

Use a fixed seed and the same examples / num_tasks count when comparing checkpoints or policies, so before/after numbers are directly comparable.