File size: 2,641 Bytes
9c4cc16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# Results Archive β€” Project Verity-H

This file summarizes all evaluation results across versions. The actual result files
are stored in the repo history; this document provides a human-readable summary.

---

## v0.2.1 (Qwen3-4B-Instruct-2507, 30 cases)

Best pre-batch-verifier results. 7 LLM calls per case.

| Metric | Baseline Normal | Baseline Honesty | Verity-H |
|--------|----------------|-----------------|-------------------|
| unsupported_claim_rate | 10% | 0% | **0%** |
| correct_abstention_rate | 70% | 100% | **100%** |
| grounded_accept_rate | 0% | 0% | **100%** |
| contradiction_detection_rate | 60% | 40% | **80%** |
| pressure_hypothesis_correctness | 0% | 0% | **100%** |
| false_contradiction_rate | 0% | 0% | **0%** |
| partial_answer_coverage | 0% | 0% | **100%** |
| parse_error_rate | β€” | β€” | **0%** |
| latency_p50 | 3,525ms | 3,244ms | 11,339ms |

Stored in: `results_qwen3_4b_v6/`

---

## v0.3 (Qwen3-4B, 30 cases)

First batch-verifier version. 2 LLM calls per case. 43% faster.

| Metric | v0.2.1 | v0.3 | Change |
|--------|--------|------|--------|
| unsupported_claim_rate | 0% | 0% | βœ… |
| correct_abstention_rate | 100% | 100% | βœ… |
| grounded_accept_rate | 100% | 100% | βœ… |
| contradiction_detection_rate | 80% | 80% | βœ… |
| pressure_hypothesis_correctness | **100%** | **60%** | πŸ”» (fixed in v0.3.1) |
| latency_p50 | 11,339ms | **6,495ms** | βœ… 43% faster |

Stored in: `results_qwen3_4b_v7/`

---

## v0.3.1 (Qwen3-4B, 100 cases)

Added inference detector. Fixed pressure regression.

Stored in: `results_real_llm/`

---

## v0.3.2 (Qwen3-4B, 100 cases, sandbox run)

Conservative frame detector, slot-mismatch guard, safer pressure routing.

| Metric | v0.3.1 | v0.3.2 | Target | Status |
|--------|--------|--------|--------|--------|
| false_contradiction_rate | 9.4% | **0.0%** | <3% | βœ… Fixed |
| pressure_hypothesis_correctness | 53.3% | **66.7%** | >75% | ⚠️ Missed |
| not_in_evidence_label_rate | 69% | **79.3%** | >80% | ⚠️ Just missed |
| unsupported_claim_rate | 15% | **0.0%** | 0% | βœ… Fixed |
| contradiction_detection_rate | 60% | **26.7%** | high | ❌ Regressed |

Stored in: `results/verity_pipeline_v0.3.2.jsonl`

---

## v0.4 β€” Simplified Baseline

**Status:** Current version. Simplified for maintainability.

Key simplifications:
- Frame-based contradiction detector removed (status-pair only)
- Slot-mismatch guard removed (semantic relevance is known limitation)
- Numeric/date/money conflicts logged as `possible_conflict` but NOT forced
- 209 tests pass, zero false contradictions
- 2 LLM calls per case

Stored in: `results/` (future runs)