alexandrainst
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 3 additions & 3 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎README.md‎
Lines changed: 10 additions & 14 deletions b/‎README.md‎
Lines changed: 10 additions & 14 deletions
diff --git a/‎config/hallucination_detection.yaml‎
Lines changed: 24 additions & 6 deletions b/‎config/hallucination_detection.yaml‎
Lines changed: 24 additions & 6 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 2 additions & 0 deletions b/‎pyproject.toml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎src/factuality_eval/dataset_generation.py‎
Lines changed: 38 additions & 14 deletions b/‎src/factuality_eval/dataset_generation.py‎
Lines changed: 38 additions & 14 deletions
diff --git a/‎src/factuality_eval/hallucination_detection.py‎
Lines changed: 64 additions & 11 deletions b/‎src/factuality_eval/hallucination_detection.py‎
Lines changed: 64 additions & 11 deletions
@@ -10,7 +10,7 @@ repos:
       - id: trailing-whitespace
       - id: debug-statements
 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.13.0
+    rev: v0.14.13
     hooks:
       - id: ruff-check
         args:
@@ -27,11 +27,11 @@ repos:
           - pyi
           - jupyter
 -   repo: https://github.com/kynan/nbstripout
-    rev: 0.8.1
+    rev: 0.9.0
     hooks:
     -   id: nbstripout
 -   repo: https://github.com/pre-commit/mirrors-mypy
-    rev: v1.18.1
+    rev: v1.19.1
     hooks:
     -   id: mypy
         args:
 
@@ -1,32 +1,30 @@
 # Factuality Evaluation of LLMs
 
 ______________________________________________________________________
-[![Code Coverage](https://img.shields.io/badge/Coverage-91%25-green.svg)](https://github.com/alexandrainst/factuality_eval/tree/main/tests)
+[![Code Coverage](https://img.shields.io/badge/Coverage-48%25-orange.svg)](https://github.com/alexandrainst/factuality_eval/tree/main/tests)
 [![Documentation](https://img.shields.io/badge/docs-passing-green)](https://alexandrainst.github.io/factuality_eval)
 [![License](https://img.shields.io/github/license/alexandrainst/factuality_eval)](https://github.com/alexandrainst/factuality_eval/blob/main/LICENSE)
 [![LastCommit](https://img.shields.io/github/last-commit/alexandrainst/factuality_eval)](https://github.com/alexandrainst/factuality_eval/commits/main)
 [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/alexandrainst/factuality_eval/blob/main/CODE_OF_CONDUCT.md)
 
-
 ## Literature Review
 
 ### Evaluation Tools
 
 | Paper title | Authors | Affiliation | Published | Code | Summary | Comments | Languages | Tool |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs | Iqbal, H., Wang, Y., Wang, M., Georgiev, G., Geng, J., Gurevych, I., & Nakov, P.  | MBZUAI (Mohamed bin Zayed University of Artificial Intelligence) | 2024-08 | https://github.com/mbzuai-nlp/openfactcheck | OpenFactCheck has 3 modules: \n\n- RESPONSEEVAL: customize fact-checking system and assess the factuality of all claims in an input document\n- LLMEVAL: assess overall factuality of an LLM\n- CHECKEREVAL: evaluate automatic fact-checking systems | They created two datasets: [FactQA](https://raw.githubusercontent.com/hasaniqbal777/OpenFactCheck/main/src/openfactcheck/templates/llm/questions.csv) (6480 questions) and [FactBench](https://raw.githubusercontent.com/hasaniqbal777/OpenFactCheck/main/src/openfactcheck/templates/factchecker/claims.jsonl) (4507 claims).  | English, Urdu | OpenFactCheck |
-| Loki: An Open-Source Tool for Fact Verification | Li, H., Han, X., Wang, H., Wang, Y., Wang, M., Xing, R., ... & Baldwin | LibrAI, MBZUAI, Monash University, The University of Melbourne | 2024-10 | https://github.com/Libr-AI/OpenFactVerification |  | https://loki.librai.tech/ | Multilingual | Loki |
+| OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs | Iqbal, H., Wang, Y., Wang, M., Georgiev, G., Geng, J., Gurevych, I., & Nakov, P.  | MBZUAI (Mohamed bin Zayed University of Artificial Intelligence) | 2024-08 | <https://github.com/mbzuai-nlp/openfactcheck> | OpenFactCheck has 3 modules: \n\n- RESPONSEEVAL: customize fact-checking system and assess the factuality of all claims in an input document\n- LLMEVAL: assess overall factuality of an LLM\n- CHECKEREVAL: evaluate automatic fact-checking systems | They created two datasets: [FactQA](https://raw.githubusercontent.com/hasaniqbal777/OpenFactCheck/main/src/openfactcheck/templates/llm/questions.csv) (6480 questions) and [FactBench](https://raw.githubusercontent.com/hasaniqbal777/OpenFactCheck/main/src/openfactcheck/templates/factchecker/claims.jsonl) (4507 claims).  | English, Urdu | OpenFactCheck |
+| Loki: An Open-Source Tool for Fact Verification | Li, H., Han, X., Wang, H., Wang, Y., Wang, M., Xing, R., ... & Baldwin | LibrAI, MBZUAI, Monash University, The University of Melbourne | 2024-10 | <https://github.com/Libr-AI/OpenFactVerification> |  | <https://loki.librai.tech/> | Multilingual | Loki |
 |  |  |  |  |  |  |  |  | FactScore |
-|  |  |  |  | https://www.comet.com/site/blog/selfcheckgpt-for-llm-evaluation/ |  | A blackbox hallucination detection method that relies solely on stochastic sampling of model responses. The core intuition of their method is that factually accurate responses are typically consistent and frequent, whereas hallucinated outputs tend to vary and contradict each other. |  | SelfCheckGPT |
+|  |  |  |  | <https://www.comet.com/site/blog/selfcheckgpt-for-llm-evaluation/> |  | A blackbox hallucination detection method that relies solely on stochastic sampling of model responses. The core intuition of their method is that factually accurate responses are typically consistent and frequent, whereas hallucinated outputs tend to vary and contradict each other. |  | SelfCheckGPT |
 | Long-form factuality in large language models |  |  |  |  |  |  |  | LongForm SAFE |
 |  |  |  |  |  |  | Not open-source |  | Perplexity fact checker |
 | Hallucination to Truth: A Review of Fact-Checking and Factuality\n\nEvaluation in Large Language Models | Rahman, S. S., Islam, M. A., Alam, M. M., Zeba, M., Rahman, M. A., Chowa, S. S., ... & Azam, S. | United International University (Bangladesh),  Daffodil International University (Bangladesh), Charles Darwin University (Australia) | 2025-08 |  |  |  |  |  |
 | FACTTEST: FACTUALITY TESTING IN LARGE  LANGUAGE MODELS WITH FINITE-SAMPLE AND  DISTRIBUTION-FREE GUARANTEES | Fan Nie1 Xiaotian Hou2 Shuhang Lin2 James Zou1 Huaxiu Yao3 Linjun Zhang | Stanford University, 2Rutgers University, 3UNC-Chapel Hill | 2024-11 |  | Used to "finetune" models to not answer if the answer is likely to be false. |  |  |  |
 | Seq vs Seq: An Open Suite of Paired Encoders and Decoders |  |  |  |  | TinyLettuce is used to have a dataset consisting of hallunications and correct responses.\n\n*"**The Problem**: Training robust hallucination detection models requires large datasets of both correct and hallucinated responses. Manually creating such datasets is expensive and time-consuming.*\n\n***Our Solution****: LettuceDetect's synthetic data generation pipeline can generate realistic hallucinations from factual content."* |  |  |  |
-| Hallucination Risk Calculator & Prompt Re‑engineering Toolkit (OpenAI‑only) |  |  |  | https://hassana.io/readme.html | Calculate the risk of hallucination based on a prompt.\n\nBasically just entropy calculation?\n\nProblem is, which prompts should we supply? |  |  |  |
+| Hallucination Risk Calculator & Prompt Re‑engineering Toolkit (OpenAI‑only) |  |  |  | <https://hassana.io/readme.html> | Calculate the risk of hallucination based on a prompt.\n\nBasically just entropy calculation?\n\nProblem is, which prompts should we supply? |  |  |  |
 | (Im)possibility of Automated Hallucination Detection in\n\nLarge Language Models |  |  |  |  | Not possible if trained only on correct samples (duh) |  |  |  |
-| HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models |  |  |  | https://github.com/RUCAIBox/HaluEval | Many citations |  |  |  |
-
+| HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models |  |  |  | <https://github.com/RUCAIBox/HaluEval> | Many citations |  |  |  |
 
 ### Evaluation Benchmarks and Datasets
 
@@ -53,10 +51,9 @@ ______________________________________________________________________
 |  |  |  |  |  |  | SimpleQA |
 |  |  |  |  |  | Possibly not public/open. | PersonQA |
 | TRUSTSCORE: REFERENCE-FREE EVALUATION OFLLM RESPONSE TRUSTWORTHINESS | Danna Zheng, Danyang Liu, Mirella Lapata, Jeff Z. Pan | University of Edinburgh,\n\nHuawei Edinburgh Research Centre |  |  |  | TrustScore |
-| Know What You Don't Know: Unanswerable Questions for SQuAD |  |  | 2018-11 | https://rajpurkar.github.io/SQuAD-explorer/ | Many  | SQuAD |
+| Know What You Don't Know: Unanswerable Questions for SQuAD |  |  | 2018-11 | <https://rajpurkar.github.io/SQuAD-explorer/> | Many  | SQuAD |
 |  |  |  |  |  | is **an automatic evaluation metric for factual precision in long-form text generation**. It uses large language models and retrieval to break down generations into atomic facts and then measure the correctness with respect to a knowledge source (like Wikipedia). | FactScore |
 
-
 ### Papers from Dan
 
 [Survey on Factuality in Large Language Models](https://dl.acm.org/doi/10.1145/3742420 "https://dl.acm.org/doi/10.1145/3742420")
@@ -83,7 +80,6 @@ ______________________________________________________________________
 
 [TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness](http://arxiv.org/abs/2402.12545 "http://arxiv.org/abs/2402.12545")
 
-
 ## Why are LLMs not factual?
 
 - LLMs do not know what they do not know, sometimes overestimate their capacities and
@@ -98,9 +94,9 @@ know the answer.)
 
 - Studies assessing language models’ factuality or evaluating whether the methods are
   effective to mitigate model hallucinations use different datasets and metrics.
-	- This makes it difficult to compare, in the same conditions, the factuality of
-	  different models as well as to compare the effectiveness of different factuality
-	  enhancement approaches.
+- This makes it difficult to compare, in the same conditions, the factuality of
+   different models as well as to compare the effectiveness of different factuality
+   enhancement approaches.
 
 ## Research goals
 
 
@@ -3,15 +3,14 @@ defaults:
   - _self_
 
 base_dataset:
-  id: alexandrainst/multi-wiki-qa:da
+  id: multi-wiki-qa
+  organisation: alexandrainst
   split: train
   context_key: context
   question_key: question
   answer_key: answers
   squad_format: true
 
-model: gpt-4.1-mini
-temperature: 1.0
 
 beta_distribution:
   mean: 0.2
@@ -30,9 +29,28 @@ training:
   epochs: 5
   learning_rate: 1e-5
   weight_decay: 0.01
-  language: da
   push_to_hub: True
+  max_length: 8192
 
 models:
-  target_model_name: ettin-encoder-17m-multi-wiki-qa-da
-  pretrained_model_name: jhu-clsp/ettin-encoder-17m
+  hallu_detect_model: mmBERT-small
+  pretrained_model: jhu-clsp/mmBERT-small
+  eval_model: Qwen/Qwen3-0.6B
+  hallu_gen_model: gpt-4.1-mini
+
+language: da
+
+selfcheckgpt:
+  num_samples: 10
+  sampling_temperature: 1.0
+  reference_temperature: 0.0
+  reference_do_sample: false
+  prompt_model: gpt-4o-mini
+  output_dir: data/final/selfcheckgpt
+  max_retries: 3
+  request_timeout: null
+  context_char_limit: null
+
+generation:
+  max_examples: 1000
+  max_new_tokens: 32768
@@ -13,12 +13,14 @@ maintainers = [
 ]
 requires-python = ">=3.11,<4.0"
 dependencies = [
+    "accelerate>=1.10.1",
     "datasets>=4.0.0",
     "hydra-core>=1.3.2",
     "lettucedetect>=0.1.8",
     "nltk>=3.9.1",
     "protobuf>=6.32.1",
     "python-dotenv>=1.0.1",
+    "openai>=1.4.0",
     "tiktoken>=0.11.0",
 ]
 
 
@@ -22,6 +22,7 @@ def load_qa_data(
     answer_key: str,
     squad_format: bool,
     testing: bool,
+    max_examples: int = -1,
 ) -> tuple[list[list[str]], list[str], list[str]]:
     """Load the base dataset.
 
@@ -40,14 +41,22 @@ def load_qa_data(
             Whether the answers are in SQuAD format.
         testing:
             If True, only load a small subset of the data for testing purposes.
+        max_examples:
+            Maximum number of data samples. If -1, it will use all samples.
 
     Returns:
         A tuple of (contexts, questions, answers).
     """
     logger.info(f"Loading base dataset {base_dataset_id!r}...")
     dataset_id = base_dataset_id.split(":")[0]
     subset = base_dataset_id.split(":")[1] if ":" in base_dataset_id else None
-    ds = load_dataset(path=dataset_id, name=subset, split=split)
+
+    ds = load_dataset(path=dataset_id, name=subset)
+
+    if len(ds.keys()) > 1:  # Dataset is already split
+        ds = ds[split]
+    else:
+        ds = ds[split].train_test_split(test_size=0.2, seed=42)[split]
 
     logger.info("Preparing dataset...")
     contexts: list[list[str]] = [[ctx] for ctx in ds[context_key]]
@@ -64,6 +73,11 @@ def load_qa_data(
         contexts = contexts[:10]
         questions = questions[:10]
         answers = answers[:10]
+    elif max_examples != -1:
+        logger.info(f"Truncating dataset to {max_examples} examples...")
+        contexts = contexts[:max_examples]
+        questions = questions[:max_examples]
+        answers = answers[:max_examples]
 
     return contexts, questions, answers
 
@@ -113,8 +127,8 @@ def generate_hallucinations_from_qa_data(
     answers: list[str],
     intensities: list[float],
     model: str,
-    temperature: float,
     output_jsonl_path: Path | None,
+    temperature: float | None = None,
 ) -> Dataset:
     """Generate hallucinations from given QA data.
 
@@ -129,11 +143,12 @@ def generate_hallucinations_from_qa_data(
             A list of hallucination intensities for each QA pair.
         model:
             The model name to use for hallucination generation.
-        temperature:
-            The temperature to use for the model during generation.
         output_jsonl_path:
             The path to save the generated dataset in JSONL format, or None to skip
             saving.
+        temperature:
+            The temperature to use for the model during generation. If None, the
+            default temperature is used. Defaults to None.
 
     Returns:
         A Dataset containing both original and hallucinated QA pairs.
@@ -166,9 +181,12 @@ def generate_hallucinations_from_qa_data(
             )
         except Exception as e:
             logger.error(f"Error during generation: {e}. Skipping...")
-            continue
 
-        hallucinated_labels = get_hallucinated_labels(result)
+        hallucinated_labels = get_hallucinated_labels(hallucinated_dict=result)
+
+        # Skip samples where labels cannot be reliably determined
+        if hallucinated_labels is None:
+            continue
 
         # Save the record
         record = dict(
@@ -237,25 +255,31 @@ def generate_hash(context: list[str], question: str, answer: str) -> str:
     return hashlib.md5((context[0] + question + answer).encode("utf-8")).hexdigest()
 
 
-def get_hallucinated_labels(hallucinated_dict: dict) -> list[dict]:
+def get_hallucinated_labels(hallucinated_dict: dict) -> list[dict] | None:
     """Get the hallucinated labels from the generation result.
 
     Args:
         hallucinated_dict:
             The dictionary from the hallucination generator.
 
     Returns:
-        A list of dictionaries with start, end, and label for each hallucinated part.
+        A list of dictionaries with start, end, and label for each hallucinated part,
+        or None if the labels cannot be reliably determined.
     """
     hallucinated_labels = []
     for part in hallucinated_dict["hallucinated_parts"]:
-        if hallucinated_dict["hallucinated_answer"].count(part) > 1:
-            raise ValueError(
-                f"The part {part!r} appears multiple times in the hallucinated answer "
-                f"{hallucinated_dict['hallucinated_answer']!r}, so could not correctly "
-                "mark the spans."
+        answer = hallucinated_dict["hallucinated_answer"]
+        count = answer.count(part)
+
+        if count > 1:
+            # Cannot reliably label - discard this sample
+            logger.warning(
+                f"Discarding sample - hallucinated part {part!r} appears {count} times "
+                f"in answer, cannot determine which occurrence is hallucinated."
             )
-        start = hallucinated_dict["hallucinated_answer"].find(part)
+            return None
+
+        start = answer.find(part)
         if start != -1:
             hallucinated_labels.append(
                 {"start": start, "end": start + len(part), "label": "hallucinated"}
 
@@ -1,10 +1,13 @@
 """Detection of hallucinations in a dataset."""
 
+import logging
 from collections import defaultdict
 
 from datasets import Dataset
 from lettucedetect.models.inference import HallucinationDetector
 
+logger = logging.getLogger(__name__)
+
 
 def detect_hallucinations(
     dataset: Dataset, model: str = "KRLabsOrg/tinylettuce-ettin-17m-en"
@@ -18,26 +21,76 @@ def detect_hallucinations(
     Returns:
         A dictionary with the predicted answers and ground truth hallucinated parts.
     """
-    detector = HallucinationDetector(method="transformer", model_path=model)
+    detector = HallucinationDetector(
+        method="transformer", model_path=model, device_map="auto", torch_dtype="auto"
+    )
 
     predict_answers = []
     all_hallucinated_parts = []
-    for context, question, answer, hallucinated_parts in zip(
-        dataset["context"],
-        dataset["question"],
-        dataset["answer"],
-        dataset["hallucinated_parts"],
+    for context, question, answer in zip(
+        dataset["context"], dataset["question"], dataset["answer"]
     ):
         # Use the detector to predict if the answer is hallucinated
-        predict_answer = detector.predict(
-            context=context, question=question, answer=answer
-        )
-
+        try:
+            predict_answer = detector.predict(
+                context=context, question=question, answer=answer
+            )
+        except Exception as e:
+            logger.error(f"Error during hallucination detection: {e}. Skipping...")
+            continue
         predict_answers.append(predict_answer)
-        all_hallucinated_parts.append(hallucinated_parts)
+
+    if "hallucinated_parts" in dataset.column_names:
+        for hallucinated_part in dataset["hallucinated_parts"]:
+            all_hallucinated_parts.append(hallucinated_part)
 
     data_dict: dict[str, list] = defaultdict(list)
     data_dict["predict_answers"] = predict_answers
     data_dict["ground_truth"] = all_hallucinated_parts
 
     return data_dict
+
+
+def evaluate_predicted_answers(hallucinations: dict) -> None:
+    """Evaluate the predicted answers for hallucinations.
+
+    Args:
+        hallucinations:
+            A dictionary with the predicted answers and ground truth hallucinated parts.
+    """
+    logger.info("Evaluating model answers for hallucinations...")
+
+    no_hallucination_in_answers = []
+    no_tokens_in_answers = []
+
+    hallucinated_tokens = 0
+    total_tokens = 0
+    for predict_answer in hallucinations["predict_answers"]:
+        no_hallucination_in_answer = 0
+        no_tokens_in_answer = 0
+        for tokens in predict_answer:
+            hallucinated_tokens += tokens["pred"]
+            total_tokens += 1
+
+            no_hallucination_in_answer += tokens["pred"]
+            no_tokens_in_answer += 1
+        no_hallucination_in_answers.append(no_hallucination_in_answer)
+        no_tokens_in_answers.append(no_tokens_in_answer)
+
+    hallucination_rate = hallucinated_tokens / total_tokens
+
+    answers_with_hallucinations = sum([1 for x in no_hallucination_in_answers if x > 0])
+
+    rate_with_hallucinations = answers_with_hallucinations / len(
+        no_hallucination_in_answers
+    )
+    logger.info("Results ________________________________________")
+    logger.info(
+        f"Hallucination rate (hallucinated_tokens/total_tokens) : "
+        f"{hallucination_rate:.2f}"
+    )
+    logger.info(
+        f"Rate of answers with at least one hallucination: "
+        f"{rate_with_hallucinations:.2f}"
+    )
+    return
Original file line number	Diff line number	Diff line change
`@@ -13,12 +13,14 @@ maintainers = [`
`13`	`13`	`]`
`14`	`14`	`requires-python = ">=3.11,<4.0"`
`15`	`15`	`dependencies = [`
	`16`	`+ "accelerate>=1.10.1",`
`16`	`17`	`"datasets>=4.0.0",`
`17`	`18`	`"hydra-core>=1.3.2",`
`18`	`19`	`"lettucedetect>=0.1.8",`
`19`	`20`	`"nltk>=3.9.1",`
`20`	`21`	`"protobuf>=6.32.1",`
`21`	`22`	`"python-dotenv>=1.0.1",`
	`23`	`+ "openai>=1.4.0",`
`22`	`24`	`"tiktoken>=0.11.0",`
`23`	`25`	`]`
`24`	`26`