# Project Verity-H v0.4

**Teaching AI to say "I don't know."**

When humans lack knowledge, they admit it — "I'm not sure", "I don't know", "let me check." LLMs don't. They fill gaps with plausible-sounding assumptions and present them as facts. Verity-H researches whether a lightweight verification pipeline can enforce honest behavior: **share what you know, flag what you don't, never silently guess.**

The system lets an LLM answer a question, then **verifies every claim against the provided evidence** before the user sees it. Supported claims pass through. Unsupported claims get flagged. Contradictions get caught. The user sees what's verified vs. what's a guess — like talking to an honest colleague.

> For the full architecture, research grounding, and design decisions, see **[DESIGN.md](DESIGN.md)**.

---

## Quick Start

```bash
# Clone
git clone https://huggingface.co/Sravanth18/verity-h-prototype
cd verity-h-prototype

# Setup
python -m venv .venv && source .venv/bin/activate
pip install -e ".[test]"

# Run tests (mock mode, no API key needed)
pytest
```

## Run Evaluation

```bash
# Set environment
export LLM_MODE=hf_api
export HF_API_KEY=your-key-here
export MODEL_NAME=Qwen/Qwen3-4B-Instruct-2507
export LLM_CALL_DELAY=2

# Baselines
python -m src.baseline_runner --mode normal --output results/baseline_normal.jsonl
python -m src.baseline_runner --mode honesty --output results/baseline_honesty.jsonl

# Pipeline
python -m src.pipeline_runner --output results/verity_pipeline_v0.4.jsonl

# Batched (resumable if interrupted)
python run_pipeline_batched.py --delay 0.5 --output results/verity_pipeline_v0.4.jsonl

# Report
python -m src.report --normal results/baseline_normal.jsonl \
                     --honesty results/baseline_honesty.jsonl \
                     --pipeline results/verity_pipeline_v0.4.jsonl \
                     --output results/report.md
```

## How It Works

```
Question + Evidence
       │
       ▼
  1. Split evidence into spans          (deterministic)
  2. Draft answer                        (LLM call #1)
  3. Extract + label claims              (LLM call #2)
  4. Post-process:                       (deterministic)
     • Filter junk/meta claims
     • Fix mislabeled claims via span matching
     • Detect inferential claims (4-tier)
     • Detect contradictions (status-pair only; numeric/date logged for audit)
  5. Gate decision                       (deterministic)
       │
       ▼
  Final answer with transparency metadata
```

**2 LLM calls per case.** Everything else is deterministic and auditable.

## Pipeline Decisions

| Situation | Decision | What user sees |
|-----------|----------|---------------|
| All claims verified | `accept` | Clean answer from verified claims |
| Some claims unverified | `partial` | "What I can verify" + "What I cannot verify" |
| Status-pair contradiction (open/closed, approved/rejected, etc.) | `contradiction` | Flags conflict, shows both sides |
| No evidence for the question | `needs_info` | "I don't have enough info" + what's needed |
| Speculative question (pressure=1) | `hypothesis` | Low-confidence guess with full caveats |
| Verifier failed to parse | `verifier_error` | Refuses to answer |

## Inference Detection (v0.3.1+)

The verifier catches claims the LLM wrongly marks as SUPPORTED:

| Tier | What it catches | Example |
|------|----------------|---------|
| 1. Epistemic hedges | "suggests", "consistent with", "most likely" | "Symptoms are *consistent with* bacterial infection" |
| 2. Logical leaps | "therefore", "based on these findings" | "*Therefore* the patient has strep throat" |
| 3. Deontic/normative | "should", "recommended", "indicated" | "Antibiotics *should be* started" |
| 4. Speculative questions | Question asks for judgment/prediction | "Should we invest?" → answer is inherently inferential |

Grounded in: CogniBench (arxiv:2505.20767), GME modality taxonomy (arxiv:2106.08037), BioScope corpus.

## Results (Qwen3-4B, 30 cases, v0.2.1)

| Metric | Baseline Normal | Baseline Honesty | Verity-H |
|--------|:-:|:-:|:-:|
| Unsupported claim rate (↓) | 10% | 0% | **0%** |
| Correct abstention (↑) | 70% | 100% | **100%** |
| Grounded accept (↑) | 0% | 0% | **100%** |
| Contradiction detection (↑) | 60% | 40% | **80%** |
| Pressure hypothesis (↑) | 0% | 0% | **100%** *(v0.2.1)* |
| False contradiction (↓) | 0% | 0% | **0%** |
| Partial coverage (↑) | 0% | 0% | **100%** |
| Latency p50 | 3,525ms | 3,244ms | **6,495ms** *(v0.3, 2-call batch)* |

See [RESULTS_ARCHIVE.md](RESULTS_ARCHIVE.md) for full version history.

## Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `LLM_MODE` | `mock` | `mock` / `api` (OpenAI) / `hf_api` (HuggingFace) |
| `HF_API_KEY` | — | HuggingFace API key (for `hf_api` mode) |
| `OPENAI_API_KEY` | — | OpenAI API key (for `api` mode) |
| `MODEL_NAME` | `Qwen/Qwen3-4B-Instruct-2507` | Model to use |
| `LLM_TEMPERATURE` | `0.0` | Temperature |
| `LLM_MAX_TOKENS` | `2048` | Max tokens per response |
| `LLM_CALL_DELAY` | `2` | Seconds between API calls (rate limiting) |
| `LLM_MAX_CALLS_PER_MINUTE` | `30` | Per-minute rate limit |

## Gold Cases

100 cases across 6 categories (development set):

| Category | Count | Tests |
|----------|:-----:|-------|
| `grounded` | 17 | All claims in evidence → accept |
| `missing_info` | 14 | Evidence doesn't cover question → abstain |
| `contradiction` | 15 | Conflicting facts in evidence → flag |
| `pressure` | 15 | Speculative question → hypothesis with caveats |
| `filler_trap` | 15 | Tempts model to invent facts → abstain |
| `partial_answer` | 24 | Some facts available, some not → partial |

**100 total cases — development set only. Not a held-out evaluation.**

## Tests

209 tests covering all modules. Run with `pytest -v`.

```
tests/
├── test_calibration.py          # Table-format probe validation
├── test_claim_filter.py         # Slot-aware relevance filtering
├── test_constants.py            # Shared stop words
├── test_contradiction_checks.py # Status-pair contradictions + possible_conflict audit
├── test_evidence_spans.py       # Abbreviation-aware splitting
├── test_gate.py                 # All gate rules + edge cases
├── test_inference_detector.py   # All 4 tiers + exact failure cases
├── test_metrics.py              # Pipeline + baseline metrics
├── test_schemas.py              # Pydantic validation
├── test_span_matcher.py         # Substring/fuzzy/numeric matching
└── test_verifier.py             # Batch table parser + integration
```

## What This Does NOT Do

- No internet search or retrieval (RAG)
- No vector databases
- No fine-tuning
- No UI or deployment
- No GPU required

This is a **research harness**, not a product.

## Known Limitations (v0.4)

The v0.4 baseline intentionally trades some detection for **zero false positives** and **maintainable code**.

| # | Limitation | Why | Mitigation |
|---|-----------|-----|------------|
| 1 | **Numeric contradictions not caught deterministically** | Money/percentage/count/date conflicts have too many false positives (e.g., revenue target vs actual revenue). | Relies on verifier LLM. If LLM misses, contradiction is not flagged. |
| 2 | **Semantic relevance not enforced** | "How fast can the car go?" with only engine specs supported → `accept`. v0.3.2 had a 20-entry synonym-table guard but it was too rule-heavy for a baseline. | Acceptable for v0.4. Future: semantic similarity check (not synonym table). |
| 3 | **100 cases = dev set only** | The deterministic rules were tuned against failures on this set. Results are directional, not publication-grade. | Create held-out 50-case test set for unbiased validation. |
| 4 | **Inference detector is regex-based** | Covers common hedges but cannot catch all inferential reasoning. | Grounded in CogniBench + GME + BioScope; handles most common cases. |
| 5 | **Single evidence document** | No multi-document consensus or evidence weighting. | Designed for single-pass evaluation. |

## Next Steps

- [x] Simplify to v0.4 baseline — status-pair contradictions only, no frame detector
- [x] Remove slot-mismatch guard (semantic relevance is known limitation)
- [x] 209 tests pass, zero false contradictions
- [ ] Run v0.4 eval on full 100-case development set
- [ ] Test on multiple models (1B, 4B, 70B+) to prove model independence
- [ ] Create held-out 50-case test set for unbiased evaluation
- [ ] Confidence calibration analysis

---

*See [DESIGN.md](DESIGN.md) for the full architecture document.*