You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Update lock, toml and config
* Prompt utilities per language
* Save hallucinated labels
* Train hallucination detector
* Catch if text appear multiple times in hallucinated answers
Co-authored-by: Dan Saattrup Smart <47701536+saattrupdan@users.noreply.github.com>
* Update languages to literal type
Co-authored-by: Dan Saattrup Smart <47701536+saattrupdan@users.noreply.github.com>
* Apply suggestions from code review (minor changes)
Co-authored-by: Dan Saattrup Smart <47701536+saattrupdan@users.noreply.github.com>
* Minor changes based on review
* Changes from review
* Detect hallucinations in model to evaluate
* Minor edits
* Clean up yaml
* Clean up yaml
* Bugfix
* Temperature as kwargs
* Implementation of selfcheckgpt
* Bugfic
* Bugfix
* Implementation of selfcheckgpt
* cleanup
* selfcheckgpt update
* Clean up and temperature changes
* Bugfix
* Remove junk from Qwen outputs
* selfcheckgpt update
* Change yaml settings, and minor cleanups
* Increase max output tokens for selfcheckgpt
* Add support for OpenAI models
* Clean up content in answers
* OpenAI support for SelfCheckGPT
* Prompt utils for selfcheckgpt
* Selfcheckgpt cleanup
* Selfcheckgpt simplified
* Code-check fix
* Resolving copilot code review
* Resolving copilot code review
* Fix for test
* Fix for mypy
* Mainly mypy checks
* Fix code-check
* Fix code-check
* Need to add package as dependency for mypy
* Relative import for mypy
* English as default may solve the mypy issue
* Try suppress the mypy error
* Exclude train.py from pre-commit mypy check
* Explicit literal for mypy
* Add all languages for EuroEval
* Add QA prompt for each language, used for formatting
* Remove forced use_safetensors
* Implement review comments on mainly formatting and docstrings
* New logic for clearing model outputs from special tokens
* Str formatting bug
* Update src/factuality_eval/hallucination_detection.py
Co-authored-by: Dan Saattrup Smart <47701536+saattrupdan@users.noreply.github.com>
* Implement review
* Update src/factuality_eval/model_generation.py
Co-authored-by: Dan Saattrup Smart <47701536+saattrupdan@users.noreply.github.com>
* Update src/factuality_eval/model_generation.py
Co-authored-by: Dan Saattrup Smart <47701536+saattrupdan@users.noreply.github.com>
---------
Co-authored-by: Dan Saattrup Smart <47701536+saattrupdan@users.noreply.github.com>
| OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs | Iqbal, H., Wang, Y., Wang, M., Georgiev, G., Geng, J., Gurevych, I., & Nakov, P. | MBZUAI (Mohamed bin Zayed University of Artificial Intelligence) | 2024-08 |https://github.com/mbzuai-nlp/openfactcheck| OpenFactCheck has 3 modules: \n\n- RESPONSEEVAL: customize fact-checking system and assess the factuality of all claims in an input document\n- LLMEVAL: assess overall factuality of an LLM\n- CHECKEREVAL: evaluate automatic fact-checking systems | They created two datasets: [FactQA](https://raw.githubusercontent.com/hasaniqbal777/OpenFactCheck/main/src/openfactcheck/templates/llm/questions.csv) (6480 questions) and [FactBench](https://raw.githubusercontent.com/hasaniqbal777/OpenFactCheck/main/src/openfactcheck/templates/factchecker/claims.jsonl) (4507 claims). | English, Urdu | OpenFactCheck |
18
-
| Loki: An Open-Source Tool for Fact Verification | Li, H., Han, X., Wang, H., Wang, Y., Wang, M., Xing, R., ... & Baldwin | LibrAI, MBZUAI, Monash University, The University of Melbourne | 2024-10 |https://github.com/Libr-AI/OpenFactVerification||https://loki.librai.tech/| Multilingual | Loki |
16
+
| OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs | Iqbal, H., Wang, Y., Wang, M., Georgiev, G., Geng, J., Gurevych, I., & Nakov, P. | MBZUAI (Mohamed bin Zayed University of Artificial Intelligence) | 2024-08 |<https://github.com/mbzuai-nlp/openfactcheck>| OpenFactCheck has 3 modules: \n\n- RESPONSEEVAL: customize fact-checking system and assess the factuality of all claims in an input document\n- LLMEVAL: assess overall factuality of an LLM\n- CHECKEREVAL: evaluate automatic fact-checking systems | They created two datasets: [FactQA](https://raw.githubusercontent.com/hasaniqbal777/OpenFactCheck/main/src/openfactcheck/templates/llm/questions.csv) (6480 questions) and [FactBench](https://raw.githubusercontent.com/hasaniqbal777/OpenFactCheck/main/src/openfactcheck/templates/factchecker/claims.jsonl) (4507 claims). | English, Urdu | OpenFactCheck |
17
+
| Loki: An Open-Source Tool for Fact Verification | Li, H., Han, X., Wang, H., Wang, Y., Wang, M., Xing, R., ... & Baldwin | LibrAI, MBZUAI, Monash University, The University of Melbourne | 2024-10 |<https://github.com/Libr-AI/OpenFactVerification>||<https://loki.librai.tech/>| Multilingual | Loki |
19
18
||||||||| FactScore |
20
-
|||||https://www.comet.com/site/blog/selfcheckgpt-for-llm-evaluation/|| A blackbox hallucination detection method that relies solely on stochastic sampling of model responses. The core intuition of their method is that factually accurate responses are typically consistent and frequent, whereas hallucinated outputs tend to vary and contradict each other. || SelfCheckGPT |
19
+
|||||<https://www.comet.com/site/blog/selfcheckgpt-for-llm-evaluation/>|| A blackbox hallucination detection method that relies solely on stochastic sampling of model responses. The core intuition of their method is that factually accurate responses are typically consistent and frequent, whereas hallucinated outputs tend to vary and contradict each other. || SelfCheckGPT |
21
20
| Long-form factuality in large language models |||||||| LongForm SAFE |
22
21
||||||| Not open-source || Perplexity fact checker |
23
22
| Hallucination to Truth: A Review of Fact-Checking and Factuality\n\nEvaluation in Large Language Models | Rahman, S. S., Islam, M. A., Alam, M. M., Zeba, M., Rahman, M. A., Chowa, S. S., ... & Azam, S. | United International University (Bangladesh), Daffodil International University (Bangladesh), Charles Darwin University (Australia) | 2025-08 ||||||
24
23
| FACTTEST: FACTUALITY TESTING IN LARGE LANGUAGE MODELS WITH FINITE-SAMPLE AND DISTRIBUTION-FREE GUARANTEES | Fan Nie1 Xiaotian Hou2 Shuhang Lin2 James Zou1 Huaxiu Yao3 Linjun Zhang | Stanford University, 2Rutgers University, 3UNC-Chapel Hill | 2024-11 || Used to "finetune" models to not answer if the answer is likely to be false. ||||
25
24
| Seq vs Seq: An Open Suite of Paired Encoders and Decoders ||||| TinyLettuce is used to have a dataset consisting of hallunications and correct responses.\n\n*"**The Problem**: Training robust hallucination detection models requires large datasets of both correct and hallucinated responses. Manually creating such datasets is expensive and time-consuming.*\n\n***Our Solution****: LettuceDetect's synthetic data generation pipeline can generate realistic hallucinations from factual content."*||||
26
-
| Hallucination Risk Calculator & Prompt Re‑engineering Toolkit (OpenAI‑only) ||||https://hassana.io/readme.html| Calculate the risk of hallucination based on a prompt.\n\nBasically just entropy calculation?\n\nProblem is, which prompts should we supply? ||||
25
+
| Hallucination Risk Calculator & Prompt Re‑engineering Toolkit (OpenAI‑only) ||||<https://hassana.io/readme.html>| Calculate the risk of hallucination based on a prompt.\n\nBasically just entropy calculation?\n\nProblem is, which prompts should we supply? ||||
27
26
| (Im)possibility of Automated Hallucination Detection in\n\nLarge Language Models ||||| Not possible if trained only on correct samples (duh) ||||
28
-
| HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models ||||https://github.com/RUCAIBox/HaluEval| Many citations ||||
29
-
27
+
| HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models ||||<https://github.com/RUCAIBox/HaluEval>| Many citations ||||
| TRUSTSCORE: REFERENCE-FREE EVALUATION OFLLM RESPONSE TRUSTWORTHINESS | Danna Zheng, Danyang Liu, Mirella Lapata, Jeff Z. Pan | University of Edinburgh,\n\nHuawei Edinburgh Research Centre |||| TrustScore |
56
-
| Know What You Don't Know: Unanswerable Questions for SQuAD ||| 2018-11 |https://rajpurkar.github.io/SQuAD-explorer/| Many | SQuAD |
54
+
| Know What You Don't Know: Unanswerable Questions for SQuAD ||| 2018-11 |<https://rajpurkar.github.io/SQuAD-explorer/>| Many | SQuAD |
57
55
|||||| is **an automatic evaluation metric for factual precision in long-form text generation**. It uses large language models and retrieval to break down generations into atomic facts and then measure the correctness with respect to a knowledge source (like Wikipedia). | FactScore |
58
56
59
-
60
57
### Papers from Dan
61
58
62
59
[Survey on Factuality in Large Language Models](https://dl.acm.org/doi/10.1145/3742420"https://dl.acm.org/doi/10.1145/3742420")
0 commit comments