verity-h-prototype / RESULTS_ARCHIVE.md
Sravanth18's picture
Upload RESULTS_ARCHIVE.md
9c4cc16 verified

Results Archive β€” Project Verity-H

This file summarizes all evaluation results across versions. The actual result files are stored in the repo history; this document provides a human-readable summary.


v0.2.1 (Qwen3-4B-Instruct-2507, 30 cases)

Best pre-batch-verifier results. 7 LLM calls per case.

Metric Baseline Normal Baseline Honesty Verity-H
unsupported_claim_rate 10% 0% 0%
correct_abstention_rate 70% 100% 100%
grounded_accept_rate 0% 0% 100%
contradiction_detection_rate 60% 40% 80%
pressure_hypothesis_correctness 0% 0% 100%
false_contradiction_rate 0% 0% 0%
partial_answer_coverage 0% 0% 100%
parse_error_rate β€” β€” 0%
latency_p50 3,525ms 3,244ms 11,339ms

Stored in: results_qwen3_4b_v6/


v0.3 (Qwen3-4B, 30 cases)

First batch-verifier version. 2 LLM calls per case. 43% faster.

Metric v0.2.1 v0.3 Change
unsupported_claim_rate 0% 0% βœ…
correct_abstention_rate 100% 100% βœ…
grounded_accept_rate 100% 100% βœ…
contradiction_detection_rate 80% 80% βœ…
pressure_hypothesis_correctness 100% 60% πŸ”» (fixed in v0.3.1)
latency_p50 11,339ms 6,495ms βœ… 43% faster

Stored in: results_qwen3_4b_v7/


v0.3.1 (Qwen3-4B, 100 cases)

Added inference detector. Fixed pressure regression.

Stored in: results_real_llm/


v0.3.2 (Qwen3-4B, 100 cases, sandbox run)

Conservative frame detector, slot-mismatch guard, safer pressure routing.

Metric v0.3.1 v0.3.2 Target Status
false_contradiction_rate 9.4% 0.0% <3% βœ… Fixed
pressure_hypothesis_correctness 53.3% 66.7% >75% ⚠️ Missed
not_in_evidence_label_rate 69% 79.3% >80% ⚠️ Just missed
unsupported_claim_rate 15% 0.0% 0% βœ… Fixed
contradiction_detection_rate 60% 26.7% high ❌ Regressed

Stored in: results/verity_pipeline_v0.3.2.jsonl


v0.4 β€” Simplified Baseline

Status: Current version. Simplified for maintainability.

Key simplifications:

  • Frame-based contradiction detector removed (status-pair only)
  • Slot-mismatch guard removed (semantic relevance is known limitation)
  • Numeric/date/money conflicts logged as possible_conflict but NOT forced
  • 209 tests pass, zero false contradictions
  • 2 LLM calls per case

Stored in: results/ (future runs)