Model Qualification Readiness

Daily readiness snapshot for active Foundry/chat models. Use it alongside HEA Spends Viewer: this page explains model readiness, not turn-based NCU or visible Q/A usage.

Overall Status

Gate Ready

Active Targets Passing

Active Targets

Surface	Model	Status	Profile

Surface Summary

Surface	Status	Models	Passed	Failed

Model Readiness Trend (Last N Runs)

Overall readiness run by run. Use the model filter above to inspect one candidate across time. Cost is shown once per model because it is a fixture-suite property, not a run-by-run signal; the tooltip keeps the normalized cost-index details. Cost index is usage-shaped from observed qualification tokens, with 100 meaning the surface reference model.

Qualified Rejected/Not ready Missing/Unknown

Surface	Model	Cost	Last N Runs (latest → older)

Per-Check Trend (Last N Runs)

Same run history, but broken down by threshold check for each surface/model. This replaces the older blockers-only view with the full pass/warn/fail/missing picture.

Check passed Warning only Blocking failure Missing/No data

Surface	Model	Check	Last N Runs (latest → older)

Recent History

Generated	Run ID	Status	Failed/Total

Metrics Lexicon

How qualification is tested: each run executes fixed fixtures from scripts/evals/fixtures and aggregates metrics per surface/model. Pass/fail is evaluated against thresholds in config/model_qualification_thresholds.json.

Foundry setup: runs stress-sized chunk-pack extraction cases (context expanded up to safe char limit), then computes extraction reliability and factual alignment.

Chat setup: runs strict current-limit guardrail fixtures (chat_runtime) and separate headroom fixtures (chat_runtime_headroom). Headroom checks are warning-only by design. Helper surfaces can also appear here, such as bounded grounded-evidence bullets (chat_ge_bullets) or bounded semantic helper calls (chat_semantic_helper).

Cost index: estimated from observed prompt/completion tokens on the qualification fixtures, priced with the repo pricing tables. 100 means “same estimated model cost as the surface reference model for that run”. Foundry judge calls and fixed pipeline costs outside the candidate model are intentionally excluded.

Stress limits: Foundry uses foundry_safe_input_char_limit; Chat uses explicit caps from readiness payload: chat_history_trim_max_tokens_design, chat_history_trim_safe_margin_tokens_design, chat_doc_summary_max_chars_design, chat_cta_max_items_design, and chat_prompt_input_token_limit_design.

Foundry Quality Metrics

fact_precision: blended precision used for gate = average(heuristic precision, judge precision) when judge is available; otherwise heuristic precision only.
judge_precision: LLM-judge precision = predicted facts judged as supported expected facts / predicted facts.
judge_coverage: expected facts coverage = expected facts matched by judge / total expected facts.
stability: repeatability score (0..1) from repeated identical fixtures; combines status consistency, output-signature consistency, and quality variance across repeats.
empty: empty-response rate = cases with empty model content / total cases.
json_fail: JSON parse fail rate = cases where strict JSON extraction fails / total cases.
retry_exh: retry exhausted rate = cases that still fail after retry policy / total cases.

Reliability/Latency Metrics

p95_ms: 95th percentile end-to-end latency across fixture cases.
infra: aggregate infrastructure error rate (timeouts, 429, 5xx, transport).
timeout: timeout error rate.
rate_limit: HTTP 429/rate-limit error rate.
5xx: upstream service 5xx rate.
transport: transport/network error rate (DNS/reset/refused/unreachable).
length_finish: rate of responses ending with finish_reason=length.

Chat Quality Metrics

grounded: keyword coverage score against expected grounded terms.
halluc_rate: cases containing forbidden/out-of-context signals / total cases.
budget_overrun: cases exceeding configured prompt-token budget / total cases (warn-level by default, non-blocking unless configured in pass criteria).
limit_case_reached: share of chat stress fixtures that actually hit the configured limit targets (docs/history/CTA/prompt-size).
stress_grounded / stress_halluc: same quality metrics but computed only on is_limit_case=true fixtures (worst-case validation path).
stability: repeatability score (0..1) across repeated identical prompts; lower score means more run-to-run drift.

Helper Surface Metrics

label_accuracy: exact-match score for bounded semantic helper outputs against the fixture expectation.
false_positive: rate at which a helper returns a decisive/actionable result on fixtures that were intentionally ambiguous or should stay empty.
quote_presence / quote_exact: for GE bullets, whether returned bullets include quoted fragments and whether those quoted fragments are actually present in the supplied excerpt.
empty_expected_non_empty: rate at which GE bullet extraction returns no bullets on fixtures that should have grounded evidence.