Daily readiness snapshot for active Foundry/chat models. Use it alongside HEA Spends Viewer: this page explains model readiness, not turn-based NCU or visible Q/A usage.
Overall readiness run by run. Use the model filter above to inspect one candidate across time. Cost is shown once per model because it is a fixture-suite property, not a run-by-run signal; the tooltip keeps the normalized cost-index details. Cost index is usage-shaped from observed qualification tokens, with 100 meaning the surface reference model.
QualifiedRejected/Not readyMissing/Unknown
Surface
Model
Cost
Last N Runs (latest → older)
Per-Check Trend (Last N Runs)
Same run history, but broken down by threshold check for each surface/model. This replaces the older blockers-only view with the full pass/warn/fail/missing picture.
Check passedWarning onlyBlocking failureMissing/No data
Surface
Model
Check
Last N Runs (latest → older)
Recent History
Generated
Run ID
Status
Failed/Total
Metrics Lexicon
How qualification is tested: each run executes fixed fixtures from scripts/evals/fixtures and aggregates metrics per surface/model. Pass/fail is evaluated against thresholds in config/model_qualification_thresholds.json.
Foundry setup: runs stress-sized chunk-pack extraction cases (context expanded up to safe char limit), then computes extraction reliability and factual alignment.
Chat setup: runs strict current-limit guardrail fixtures (chat_runtime) and separate headroom fixtures (chat_runtime_headroom). Headroom checks are warning-only by design. Helper surfaces can also appear here, such as bounded grounded-evidence bullets (chat_ge_bullets) or bounded semantic helper calls (chat_semantic_helper).
Cost index: estimated from observed prompt/completion tokens on the qualification fixtures, priced with the repo pricing tables. 100 means “same estimated model cost as the surface reference model for that run”. Foundry judge calls and fixed pipeline costs outside the candidate model are intentionally excluded.
Stress limits: Foundry uses foundry_safe_input_char_limit; Chat uses explicit caps from readiness payload:
chat_history_trim_max_tokens_design,
chat_history_trim_safe_margin_tokens_design,
chat_doc_summary_max_chars_design,
chat_cta_max_items_design,
and chat_prompt_input_token_limit_design.
Foundry Quality Metrics
fact_precision: blended precision used for gate = average(heuristic precision, judge precision) when judge is available; otherwise heuristic precision only.
judge_coverage: expected facts coverage = expected facts matched by judge / total expected facts.
stability: repeatability score (0..1) from repeated identical fixtures; combines status consistency, output-signature consistency, and quality variance across repeats.
empty: empty-response rate = cases with empty model content / total cases.
json_fail: JSON parse fail rate = cases where strict JSON extraction fails / total cases.
retry_exh: retry exhausted rate = cases that still fail after retry policy / total cases.
Reliability/Latency Metrics
p95_ms: 95th percentile end-to-end latency across fixture cases.
length_finish: rate of responses ending with finish_reason=length.
Chat Quality Metrics
grounded: keyword coverage score against expected grounded terms.
halluc_rate: cases containing forbidden/out-of-context signals / total cases.
budget_overrun: cases exceeding configured prompt-token budget / total cases (warn-level by default, non-blocking unless configured in pass criteria).
limit_case_reached: share of chat stress fixtures that actually hit the configured limit targets (docs/history/CTA/prompt-size).
stress_grounded / stress_halluc: same quality metrics but computed only on is_limit_case=true fixtures (worst-case validation path).
stability: repeatability score (0..1) across repeated identical prompts; lower score means more run-to-run drift.
Helper Surface Metrics
label_accuracy: exact-match score for bounded semantic helper outputs against the fixture expectation.
false_positive: rate at which a helper returns a decisive/actionable result on fixtures that were intentionally ambiguous or should stay empty.
quote_presence / quote_exact: for GE bullets, whether returned bullets include quoted fragments and whether those quoted fragments are actually present in the supplied excerpt.
empty_expected_non_empty: rate at which GE bullet extraction returns no bullets on fixtures that should have grounded evidence.