Coding Agent Reliability

Failed runs24last 24h

Open issues93 regressed

Replay gateFailprompt-v13

Avg cost$0.34per run

Repeated failures grouped by normalized test, evaluator, and interceptor signals.

IssueCategoryRunsStatusLast seen

Refresh token test fails after auth refactortest_regression12unresolved14m ago

Grader marks caught sessions as missedwrong_evaluation7regressed38m ago

Interceptor skips defect injection on vague promptsinterceptor_failure5unresolved1h ago

Each run connects instruction, terminal, diff, tests, and grader output.

run_8f21network-nplusonegpt-5.4-minifailed$0.41

run_8f20auth-refreshgpt-5.4-minifailed$0.36

run_8f19sql-dept-avggpt-5.4-minipassed$0.28

Candidate prompt is blocked by grading agreement and false-negative thresholds.