Model Qualification Readiness

Daily readiness snapshot for active Foundry/chat models. Use it alongside HEA Spends Viewer: this page explains model readiness, not turn-based NCU or visible Q/A usage.
source: - generated: - limits: - filters: model=all | check=all

Overall Status

-

Gate Ready

-

Active Targets Passing

-
Active Targets
Surface Model Status Profile
Surface Summary
Surface Status Models Passed Failed
Model Readiness Trend (Last N Runs)
Overall readiness run by run. Use the model filter above to inspect one candidate across time. Cost is shown once per model because it is a fixture-suite property, not a run-by-run signal; the tooltip keeps the normalized cost-index details. Cost index is usage-shaped from observed qualification tokens, with 100 meaning the surface reference model.
Qualified Rejected/Not ready Missing/Unknown
Surface Model Cost Last N Runs (latest → older)
Per-Check Trend (Last N Runs)
Same run history, but broken down by threshold check for each surface/model. This replaces the older blockers-only view with the full pass/warn/fail/missing picture.
Check passed Warning only Blocking failure Missing/No data
Surface Model Check Last N Runs (latest → older)
Recent History
Generated Run ID Status Failed/Total
Metrics Lexicon

How qualification is tested: each run executes fixed fixtures from scripts/evals/fixtures and aggregates metrics per surface/model. Pass/fail is evaluated against thresholds in config/model_qualification_thresholds.json.

Foundry setup: runs stress-sized chunk-pack extraction cases (context expanded up to safe char limit), then computes extraction reliability and factual alignment.

Chat setup: runs strict current-limit guardrail fixtures (chat_runtime) and separate headroom fixtures (chat_runtime_headroom). Headroom checks are warning-only by design. Helper surfaces can also appear here, such as bounded grounded-evidence bullets (chat_ge_bullets) or bounded semantic helper calls (chat_semantic_helper).

Cost index: estimated from observed prompt/completion tokens on the qualification fixtures, priced with the repo pricing tables. 100 means “same estimated model cost as the surface reference model for that run”. Foundry judge calls and fixed pipeline costs outside the candidate model are intentionally excluded.

Stress limits: Foundry uses foundry_safe_input_char_limit; Chat uses explicit caps from readiness payload: chat_history_trim_max_tokens_design, chat_history_trim_safe_margin_tokens_design, chat_doc_summary_max_chars_design, chat_cta_max_items_design, and chat_prompt_input_token_limit_design.

Foundry Quality Metrics

Reliability/Latency Metrics

Chat Quality Metrics

Helper Surface Metrics