Results Archive β Project Verity-H
This file summarizes all evaluation results across versions. The actual result files are stored in the repo history; this document provides a human-readable summary.
v0.2.1 (Qwen3-4B-Instruct-2507, 30 cases)
Best pre-batch-verifier results. 7 LLM calls per case.
| Metric | Baseline Normal | Baseline Honesty | Verity-H |
|---|---|---|---|
| unsupported_claim_rate | 10% | 0% | 0% |
| correct_abstention_rate | 70% | 100% | 100% |
| grounded_accept_rate | 0% | 0% | 100% |
| contradiction_detection_rate | 60% | 40% | 80% |
| pressure_hypothesis_correctness | 0% | 0% | 100% |
| false_contradiction_rate | 0% | 0% | 0% |
| partial_answer_coverage | 0% | 0% | 100% |
| parse_error_rate | β | β | 0% |
| latency_p50 | 3,525ms | 3,244ms | 11,339ms |
Stored in: results_qwen3_4b_v6/
v0.3 (Qwen3-4B, 30 cases)
First batch-verifier version. 2 LLM calls per case. 43% faster.
| Metric | v0.2.1 | v0.3 | Change |
|---|---|---|---|
| unsupported_claim_rate | 0% | 0% | β |
| correct_abstention_rate | 100% | 100% | β |
| grounded_accept_rate | 100% | 100% | β |
| contradiction_detection_rate | 80% | 80% | β |
| pressure_hypothesis_correctness | 100% | 60% | π» (fixed in v0.3.1) |
| latency_p50 | 11,339ms | 6,495ms | β 43% faster |
Stored in: results_qwen3_4b_v7/
v0.3.1 (Qwen3-4B, 100 cases)
Added inference detector. Fixed pressure regression.
Stored in: results_real_llm/
v0.3.2 (Qwen3-4B, 100 cases, sandbox run)
Conservative frame detector, slot-mismatch guard, safer pressure routing.
| Metric | v0.3.1 | v0.3.2 | Target | Status |
|---|---|---|---|---|
| false_contradiction_rate | 9.4% | 0.0% | <3% | β Fixed |
| pressure_hypothesis_correctness | 53.3% | 66.7% | >75% | β οΈ Missed |
| not_in_evidence_label_rate | 69% | 79.3% | >80% | β οΈ Just missed |
| unsupported_claim_rate | 15% | 0.0% | 0% | β Fixed |
| contradiction_detection_rate | 60% | 26.7% | high | β Regressed |
Stored in: results/verity_pipeline_v0.3.2.jsonl
v0.4 β Simplified Baseline
Status: Current version. Simplified for maintainability.
Key simplifications:
- Frame-based contradiction detector removed (status-pair only)
- Slot-mismatch guard removed (semantic relevance is known limitation)
- Numeric/date/money conflicts logged as
possible_conflictbut NOT forced - 209 tests pass, zero false contradictions
- 2 LLM calls per case
Stored in: results/ (future runs)