Skip to content

Commit fed3912

Browse files
Feat/selfcheckgpt (#4)
* Update lock, toml and config * Prompt utilities per language * Save hallucinated labels * Train hallucination detector * Catch if text appear multiple times in hallucinated answers Co-authored-by: Dan Saattrup Smart <47701536+saattrupdan@users.noreply.github.com> * Update languages to literal type Co-authored-by: Dan Saattrup Smart <47701536+saattrupdan@users.noreply.github.com> * Apply suggestions from code review (minor changes) Co-authored-by: Dan Saattrup Smart <47701536+saattrupdan@users.noreply.github.com> * Minor changes based on review * Changes from review * Detect hallucinations in model to evaluate * Minor edits * Clean up yaml * Clean up yaml * Bugfix * Temperature as kwargs * Implementation of selfcheckgpt * Bugfic * Bugfix * Implementation of selfcheckgpt * cleanup * selfcheckgpt update * Clean up and temperature changes * Bugfix * Remove junk from Qwen outputs * selfcheckgpt update * Change yaml settings, and minor cleanups * Increase max output tokens for selfcheckgpt * Add support for OpenAI models * Clean up content in answers * OpenAI support for SelfCheckGPT * Prompt utils for selfcheckgpt * Selfcheckgpt cleanup * Selfcheckgpt simplified * Code-check fix * Resolving copilot code review * Resolving copilot code review * Fix for test * Fix for mypy * Mainly mypy checks * Fix code-check * Fix code-check * Need to add package as dependency for mypy * Relative import for mypy * English as default may solve the mypy issue * Try suppress the mypy error * Exclude train.py from pre-commit mypy check * Explicit literal for mypy * Add all languages for EuroEval * Add QA prompt for each language, used for formatting * Remove forced use_safetensors * Implement review comments on mainly formatting and docstrings * New logic for clearing model outputs from special tokens * Str formatting bug * Update src/factuality_eval/hallucination_detection.py Co-authored-by: Dan Saattrup Smart <47701536+saattrupdan@users.noreply.github.com> * Implement review * Update src/factuality_eval/model_generation.py Co-authored-by: Dan Saattrup Smart <47701536+saattrupdan@users.noreply.github.com> * Update src/factuality_eval/model_generation.py Co-authored-by: Dan Saattrup Smart <47701536+saattrupdan@users.noreply.github.com> --------- Co-authored-by: Dan Saattrup Smart <47701536+saattrupdan@users.noreply.github.com>
1 parent 04ec722 commit fed3912

40 files changed

Lines changed: 1117 additions & 122 deletions

.pre-commit-config.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ repos:
1010
- id: trailing-whitespace
1111
- id: debug-statements
1212
- repo: https://github.com/astral-sh/ruff-pre-commit
13-
rev: v0.13.0
13+
rev: v0.14.13
1414
hooks:
1515
- id: ruff-check
1616
args:
@@ -27,11 +27,11 @@ repos:
2727
- pyi
2828
- jupyter
2929
- repo: https://github.com/kynan/nbstripout
30-
rev: 0.8.1
30+
rev: 0.9.0
3131
hooks:
3232
- id: nbstripout
3333
- repo: https://github.com/pre-commit/mirrors-mypy
34-
rev: v1.18.1
34+
rev: v1.19.1
3535
hooks:
3636
- id: mypy
3737
args:

README.md

Lines changed: 10 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,30 @@
11
# Factuality Evaluation of LLMs
22

33
______________________________________________________________________
4-
[![Code Coverage](https://img.shields.io/badge/Coverage-91%25-green.svg)](https://github.com/alexandrainst/factuality_eval/tree/main/tests)
4+
[![Code Coverage](https://img.shields.io/badge/Coverage-48%25-orange.svg)](https://github.com/alexandrainst/factuality_eval/tree/main/tests)
55
[![Documentation](https://img.shields.io/badge/docs-passing-green)](https://alexandrainst.github.io/factuality_eval)
66
[![License](https://img.shields.io/github/license/alexandrainst/factuality_eval)](https://github.com/alexandrainst/factuality_eval/blob/main/LICENSE)
77
[![LastCommit](https://img.shields.io/github/last-commit/alexandrainst/factuality_eval)](https://github.com/alexandrainst/factuality_eval/commits/main)
88
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/alexandrainst/factuality_eval/blob/main/CODE_OF_CONDUCT.md)
99

10-
1110
## Literature Review
1211

1312
### Evaluation Tools
1413

1514
| Paper title | Authors | Affiliation | Published | Code | Summary | Comments | Languages | Tool |
1615
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
17-
| OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs | Iqbal, H., Wang, Y., Wang, M., Georgiev, G., Geng, J., Gurevych, I., & Nakov, P. | MBZUAI (Mohamed bin Zayed University of Artificial Intelligence) | 2024-08 | https://github.com/mbzuai-nlp/openfactcheck | OpenFactCheck has 3 modules: \n\n- RESPONSEEVAL: customize fact-checking system and assess the factuality of all claims in an input document\n- LLMEVAL: assess overall factuality of an LLM\n- CHECKEREVAL: evaluate automatic fact-checking systems | They created two datasets: [FactQA](https://raw.githubusercontent.com/hasaniqbal777/OpenFactCheck/main/src/openfactcheck/templates/llm/questions.csv) (6480 questions) and [FactBench](https://raw.githubusercontent.com/hasaniqbal777/OpenFactCheck/main/src/openfactcheck/templates/factchecker/claims.jsonl) (4507 claims). | English, Urdu | OpenFactCheck |
18-
| Loki: An Open-Source Tool for Fact Verification | Li, H., Han, X., Wang, H., Wang, Y., Wang, M., Xing, R., ... & Baldwin | LibrAI, MBZUAI, Monash University, The University of Melbourne | 2024-10 | https://github.com/Libr-AI/OpenFactVerification | | https://loki.librai.tech/ | Multilingual | Loki |
16+
| OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs | Iqbal, H., Wang, Y., Wang, M., Georgiev, G., Geng, J., Gurevych, I., & Nakov, P. | MBZUAI (Mohamed bin Zayed University of Artificial Intelligence) | 2024-08 | <https://github.com/mbzuai-nlp/openfactcheck> | OpenFactCheck has 3 modules: \n\n- RESPONSEEVAL: customize fact-checking system and assess the factuality of all claims in an input document\n- LLMEVAL: assess overall factuality of an LLM\n- CHECKEREVAL: evaluate automatic fact-checking systems | They created two datasets: [FactQA](https://raw.githubusercontent.com/hasaniqbal777/OpenFactCheck/main/src/openfactcheck/templates/llm/questions.csv) (6480 questions) and [FactBench](https://raw.githubusercontent.com/hasaniqbal777/OpenFactCheck/main/src/openfactcheck/templates/factchecker/claims.jsonl) (4507 claims). | English, Urdu | OpenFactCheck |
17+
| Loki: An Open-Source Tool for Fact Verification | Li, H., Han, X., Wang, H., Wang, Y., Wang, M., Xing, R., ... & Baldwin | LibrAI, MBZUAI, Monash University, The University of Melbourne | 2024-10 | <https://github.com/Libr-AI/OpenFactVerification> | | <https://loki.librai.tech/> | Multilingual | Loki |
1918
| | | | | | | | | FactScore |
20-
| | | | | https://www.comet.com/site/blog/selfcheckgpt-for-llm-evaluation/ | | A blackbox hallucination detection method that relies solely on stochastic sampling of model responses. The core intuition of their method is that factually accurate responses are typically consistent and frequent, whereas hallucinated outputs tend to vary and contradict each other. | | SelfCheckGPT |
19+
| | | | | <https://www.comet.com/site/blog/selfcheckgpt-for-llm-evaluation/> | | A blackbox hallucination detection method that relies solely on stochastic sampling of model responses. The core intuition of their method is that factually accurate responses are typically consistent and frequent, whereas hallucinated outputs tend to vary and contradict each other. | | SelfCheckGPT |
2120
| Long-form factuality in large language models | | | | | | | | LongForm SAFE |
2221
| | | | | | | Not open-source | | Perplexity fact checker |
2322
| Hallucination to Truth: A Review of Fact-Checking and Factuality\n\nEvaluation in Large Language Models | Rahman, S. S., Islam, M. A., Alam, M. M., Zeba, M., Rahman, M. A., Chowa, S. S., ... & Azam, S. | United International University (Bangladesh), Daffodil International University (Bangladesh), Charles Darwin University (Australia) | 2025-08 | | | | | |
2423
| FACTTEST: FACTUALITY TESTING IN LARGE LANGUAGE MODELS WITH FINITE-SAMPLE AND DISTRIBUTION-FREE GUARANTEES | Fan Nie1 Xiaotian Hou2 Shuhang Lin2 James Zou1 Huaxiu Yao3 Linjun Zhang | Stanford University, 2Rutgers University, 3UNC-Chapel Hill | 2024-11 | | Used to "finetune" models to not answer if the answer is likely to be false. | | | |
2524
| Seq vs Seq: An Open Suite of Paired Encoders and Decoders | | | | | TinyLettuce is used to have a dataset consisting of hallunications and correct responses.\n\n*"**The Problem**: Training robust hallucination detection models requires large datasets of both correct and hallucinated responses. Manually creating such datasets is expensive and time-consuming.*\n\n***Our Solution****: LettuceDetect's synthetic data generation pipeline can generate realistic hallucinations from factual content."* | | | |
26-
| Hallucination Risk Calculator & Prompt Re‑engineering Toolkit (OpenAI‑only) | | | | https://hassana.io/readme.html | Calculate the risk of hallucination based on a prompt.\n\nBasically just entropy calculation?\n\nProblem is, which prompts should we supply? | | | |
25+
| Hallucination Risk Calculator & Prompt Re‑engineering Toolkit (OpenAI‑only) | | | | <https://hassana.io/readme.html> | Calculate the risk of hallucination based on a prompt.\n\nBasically just entropy calculation?\n\nProblem is, which prompts should we supply? | | | |
2726
| (Im)possibility of Automated Hallucination Detection in\n\nLarge Language Models | | | | | Not possible if trained only on correct samples (duh) | | | |
28-
| HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models | | | | https://github.com/RUCAIBox/HaluEval | Many citations | | | |
29-
27+
| HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models | | | | <https://github.com/RUCAIBox/HaluEval> | Many citations | | | |
3028

3129
### Evaluation Benchmarks and Datasets
3230

@@ -53,10 +51,9 @@ ______________________________________________________________________
5351
| | | | | | | SimpleQA |
5452
| | | | | | Possibly not public/open. | PersonQA |
5553
| TRUSTSCORE: REFERENCE-FREE EVALUATION OFLLM RESPONSE TRUSTWORTHINESS | Danna Zheng, Danyang Liu, Mirella Lapata, Jeff Z. Pan | University of Edinburgh,\n\nHuawei Edinburgh Research Centre | | | | TrustScore |
56-
| Know What You Don't Know: Unanswerable Questions for SQuAD | | | 2018-11 | https://rajpurkar.github.io/SQuAD-explorer/ | Many | SQuAD |
54+
| Know What You Don't Know: Unanswerable Questions for SQuAD | | | 2018-11 | <https://rajpurkar.github.io/SQuAD-explorer/> | Many | SQuAD |
5755
| | | | | | is **an automatic evaluation metric for factual precision in long-form text generation**. It uses large language models and retrieval to break down generations into atomic facts and then measure the correctness with respect to a knowledge source (like Wikipedia). | FactScore |
5856

59-
6057
### Papers from Dan
6158

6259
[Survey on Factuality in Large Language Models](https://dl.acm.org/doi/10.1145/3742420 "https://dl.acm.org/doi/10.1145/3742420")
@@ -83,7 +80,6 @@ ______________________________________________________________________
8380

8481
[TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness](http://arxiv.org/abs/2402.12545 "http://arxiv.org/abs/2402.12545")
8582

86-
8783
## Why are LLMs not factual?
8884

8985
- LLMs do not know what they do not know, sometimes overestimate their capacities and
@@ -98,9 +94,9 @@ know the answer.)
9894

9995
- Studies assessing language models’ factuality or evaluating whether the methods are
10096
effective to mitigate model hallucinations use different datasets and metrics.
101-
- This makes it difficult to compare, in the same conditions, the factuality of
102-
different models as well as to compare the effectiveness of different factuality
103-
enhancement approaches.
97+
- This makes it difficult to compare, in the same conditions, the factuality of
98+
different models as well as to compare the effectiveness of different factuality
99+
enhancement approaches.
104100

105101
## Research goals
106102

config/hallucination_detection.yaml

Lines changed: 24 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,14 @@ defaults:
33
- _self_
44

55
base_dataset:
6-
id: alexandrainst/multi-wiki-qa:da
6+
id: multi-wiki-qa
7+
organisation: alexandrainst
78
split: train
89
context_key: context
910
question_key: question
1011
answer_key: answers
1112
squad_format: true
1213

13-
model: gpt-4.1-mini
14-
temperature: 1.0
1514

1615
beta_distribution:
1716
mean: 0.2
@@ -30,9 +29,28 @@ training:
3029
epochs: 5
3130
learning_rate: 1e-5
3231
weight_decay: 0.01
33-
language: da
3432
push_to_hub: True
33+
max_length: 8192
3534

3635
models:
37-
target_model_name: ettin-encoder-17m-multi-wiki-qa-da
38-
pretrained_model_name: jhu-clsp/ettin-encoder-17m
36+
hallu_detect_model: mmBERT-small
37+
pretrained_model: jhu-clsp/mmBERT-small
38+
eval_model: Qwen/Qwen3-0.6B
39+
hallu_gen_model: gpt-4.1-mini
40+
41+
language: da
42+
43+
selfcheckgpt:
44+
num_samples: 10
45+
sampling_temperature: 1.0
46+
reference_temperature: 0.0
47+
reference_do_sample: false
48+
prompt_model: gpt-4o-mini
49+
output_dir: data/final/selfcheckgpt
50+
max_retries: 3
51+
request_timeout: null
52+
context_char_limit: null
53+
54+
generation:
55+
max_examples: 1000
56+
max_new_tokens: 32768

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,14 @@ maintainers = [
1313
]
1414
requires-python = ">=3.11,<4.0"
1515
dependencies = [
16+
"accelerate>=1.10.1",
1617
"datasets>=4.0.0",
1718
"hydra-core>=1.3.2",
1819
"lettucedetect>=0.1.8",
1920
"nltk>=3.9.1",
2021
"protobuf>=6.32.1",
2122
"python-dotenv>=1.0.1",
23+
"openai>=1.4.0",
2224
"tiktoken>=0.11.0",
2325
]
2426

src/factuality_eval/dataset_generation.py

Lines changed: 38 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ def load_qa_data(
2222
answer_key: str,
2323
squad_format: bool,
2424
testing: bool,
25+
max_examples: int = -1,
2526
) -> tuple[list[list[str]], list[str], list[str]]:
2627
"""Load the base dataset.
2728
@@ -40,14 +41,22 @@ def load_qa_data(
4041
Whether the answers are in SQuAD format.
4142
testing:
4243
If True, only load a small subset of the data for testing purposes.
44+
max_examples:
45+
Maximum number of data samples. If -1, it will use all samples.
4346
4447
Returns:
4548
A tuple of (contexts, questions, answers).
4649
"""
4750
logger.info(f"Loading base dataset {base_dataset_id!r}...")
4851
dataset_id = base_dataset_id.split(":")[0]
4952
subset = base_dataset_id.split(":")[1] if ":" in base_dataset_id else None
50-
ds = load_dataset(path=dataset_id, name=subset, split=split)
53+
54+
ds = load_dataset(path=dataset_id, name=subset)
55+
56+
if len(ds.keys()) > 1: # Dataset is already split
57+
ds = ds[split]
58+
else:
59+
ds = ds[split].train_test_split(test_size=0.2, seed=42)[split]
5160

5261
logger.info("Preparing dataset...")
5362
contexts: list[list[str]] = [[ctx] for ctx in ds[context_key]]
@@ -64,6 +73,11 @@ def load_qa_data(
6473
contexts = contexts[:10]
6574
questions = questions[:10]
6675
answers = answers[:10]
76+
elif max_examples != -1:
77+
logger.info(f"Truncating dataset to {max_examples} examples...")
78+
contexts = contexts[:max_examples]
79+
questions = questions[:max_examples]
80+
answers = answers[:max_examples]
6781

6882
return contexts, questions, answers
6983

@@ -113,8 +127,8 @@ def generate_hallucinations_from_qa_data(
113127
answers: list[str],
114128
intensities: list[float],
115129
model: str,
116-
temperature: float,
117130
output_jsonl_path: Path | None,
131+
temperature: float | None = None,
118132
) -> Dataset:
119133
"""Generate hallucinations from given QA data.
120134
@@ -129,11 +143,12 @@ def generate_hallucinations_from_qa_data(
129143
A list of hallucination intensities for each QA pair.
130144
model:
131145
The model name to use for hallucination generation.
132-
temperature:
133-
The temperature to use for the model during generation.
134146
output_jsonl_path:
135147
The path to save the generated dataset in JSONL format, or None to skip
136148
saving.
149+
temperature:
150+
The temperature to use for the model during generation. If None, the
151+
default temperature is used. Defaults to None.
137152
138153
Returns:
139154
A Dataset containing both original and hallucinated QA pairs.
@@ -166,9 +181,12 @@ def generate_hallucinations_from_qa_data(
166181
)
167182
except Exception as e:
168183
logger.error(f"Error during generation: {e}. Skipping...")
169-
continue
170184

171-
hallucinated_labels = get_hallucinated_labels(result)
185+
hallucinated_labels = get_hallucinated_labels(hallucinated_dict=result)
186+
187+
# Skip samples where labels cannot be reliably determined
188+
if hallucinated_labels is None:
189+
continue
172190

173191
# Save the record
174192
record = dict(
@@ -237,25 +255,31 @@ def generate_hash(context: list[str], question: str, answer: str) -> str:
237255
return hashlib.md5((context[0] + question + answer).encode("utf-8")).hexdigest()
238256

239257

240-
def get_hallucinated_labels(hallucinated_dict: dict) -> list[dict]:
258+
def get_hallucinated_labels(hallucinated_dict: dict) -> list[dict] | None:
241259
"""Get the hallucinated labels from the generation result.
242260
243261
Args:
244262
hallucinated_dict:
245263
The dictionary from the hallucination generator.
246264
247265
Returns:
248-
A list of dictionaries with start, end, and label for each hallucinated part.
266+
A list of dictionaries with start, end, and label for each hallucinated part,
267+
or None if the labels cannot be reliably determined.
249268
"""
250269
hallucinated_labels = []
251270
for part in hallucinated_dict["hallucinated_parts"]:
252-
if hallucinated_dict["hallucinated_answer"].count(part) > 1:
253-
raise ValueError(
254-
f"The part {part!r} appears multiple times in the hallucinated answer "
255-
f"{hallucinated_dict['hallucinated_answer']!r}, so could not correctly "
256-
"mark the spans."
271+
answer = hallucinated_dict["hallucinated_answer"]
272+
count = answer.count(part)
273+
274+
if count > 1:
275+
# Cannot reliably label - discard this sample
276+
logger.warning(
277+
f"Discarding sample - hallucinated part {part!r} appears {count} times "
278+
f"in answer, cannot determine which occurrence is hallucinated."
257279
)
258-
start = hallucinated_dict["hallucinated_answer"].find(part)
280+
return None
281+
282+
start = answer.find(part)
259283
if start != -1:
260284
hallucinated_labels.append(
261285
{"start": start, "end": start + len(part), "label": "hallucinated"}
Lines changed: 64 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,13 @@
11
"""Detection of hallucinations in a dataset."""
22

3+
import logging
34
from collections import defaultdict
45

56
from datasets import Dataset
67
from lettucedetect.models.inference import HallucinationDetector
78

9+
logger = logging.getLogger(__name__)
10+
811

912
def detect_hallucinations(
1013
dataset: Dataset, model: str = "KRLabsOrg/tinylettuce-ettin-17m-en"
@@ -18,26 +21,76 @@ def detect_hallucinations(
1821
Returns:
1922
A dictionary with the predicted answers and ground truth hallucinated parts.
2023
"""
21-
detector = HallucinationDetector(method="transformer", model_path=model)
24+
detector = HallucinationDetector(
25+
method="transformer", model_path=model, device_map="auto", torch_dtype="auto"
26+
)
2227

2328
predict_answers = []
2429
all_hallucinated_parts = []
25-
for context, question, answer, hallucinated_parts in zip(
26-
dataset["context"],
27-
dataset["question"],
28-
dataset["answer"],
29-
dataset["hallucinated_parts"],
30+
for context, question, answer in zip(
31+
dataset["context"], dataset["question"], dataset["answer"]
3032
):
3133
# Use the detector to predict if the answer is hallucinated
32-
predict_answer = detector.predict(
33-
context=context, question=question, answer=answer
34-
)
35-
34+
try:
35+
predict_answer = detector.predict(
36+
context=context, question=question, answer=answer
37+
)
38+
except Exception as e:
39+
logger.error(f"Error during hallucination detection: {e}. Skipping...")
40+
continue
3641
predict_answers.append(predict_answer)
37-
all_hallucinated_parts.append(hallucinated_parts)
42+
43+
if "hallucinated_parts" in dataset.column_names:
44+
for hallucinated_part in dataset["hallucinated_parts"]:
45+
all_hallucinated_parts.append(hallucinated_part)
3846

3947
data_dict: dict[str, list] = defaultdict(list)
4048
data_dict["predict_answers"] = predict_answers
4149
data_dict["ground_truth"] = all_hallucinated_parts
4250

4351
return data_dict
52+
53+
54+
def evaluate_predicted_answers(hallucinations: dict) -> None:
55+
"""Evaluate the predicted answers for hallucinations.
56+
57+
Args:
58+
hallucinations:
59+
A dictionary with the predicted answers and ground truth hallucinated parts.
60+
"""
61+
logger.info("Evaluating model answers for hallucinations...")
62+
63+
no_hallucination_in_answers = []
64+
no_tokens_in_answers = []
65+
66+
hallucinated_tokens = 0
67+
total_tokens = 0
68+
for predict_answer in hallucinations["predict_answers"]:
69+
no_hallucination_in_answer = 0
70+
no_tokens_in_answer = 0
71+
for tokens in predict_answer:
72+
hallucinated_tokens += tokens["pred"]
73+
total_tokens += 1
74+
75+
no_hallucination_in_answer += tokens["pred"]
76+
no_tokens_in_answer += 1
77+
no_hallucination_in_answers.append(no_hallucination_in_answer)
78+
no_tokens_in_answers.append(no_tokens_in_answer)
79+
80+
hallucination_rate = hallucinated_tokens / total_tokens
81+
82+
answers_with_hallucinations = sum([1 for x in no_hallucination_in_answers if x > 0])
83+
84+
rate_with_hallucinations = answers_with_hallucinations / len(
85+
no_hallucination_in_answers
86+
)
87+
logger.info("Results ________________________________________")
88+
logger.info(
89+
f"Hallucination rate (hallucinated_tokens/total_tokens) : "
90+
f"{hallucination_rate:.2f}"
91+
)
92+
logger.info(
93+
f"Rate of answers with at least one hallucination: "
94+
f"{rate_with_hallucinations:.2f}"
95+
)
96+
return

0 commit comments

Comments
 (0)