TWEAK at NAACL 2024: Decoding Without Hallucinations

LLMs

NLP

hallucination

knowledge graphs

conference

paper

research

generative AI

TWEAK ranks decoding candidates by how well their continuations support the input facts — a decoding-only fix that improves faithfulness with minimal quality loss.

Author

synesis

Published

June 15, 2024

See how we TWEAK’ed the decoding process to reduce hallucination when verbalizing knowledge graphs, at NAACL 2024 in Mexico City!

For the runners in the conference participants, there are already two tour running events scheduled (10K and 5K). Check Whova!

Poster (PDF): tweak-poster.pdf

This is the conference-week follow-up to the original acceptance announcement (April 2024), which also covered our LAGRANGE paper at LREC-COLING.

References

[1] Qiu, Yifu, Varun Embar, Shay Cohen, and Benjamin Han. “Think While You Write: Hypothesis Verification Promotes Faithful Knowledge-to-Text Generation.” 2023. Apple Machine Learning Research: https://machinelearning.apple.com/research/write-hypothesis

Abstract: Neural knowledge-to-text generation models often struggle to faithfully generate descriptions for the input facts: they may produce hallucinations that contradict the given facts, or describe facts not present in the input. To reduce hallucinations, we propose a novel decoding method, TWEAK (Think While Effectively Articulating Knowledge). TWEAK treats the generated sequences at each decoding step and its future sequences as hypotheses, and ranks each generation candidate based on how well their corresponding hypotheses support the input facts using a Hypothesis Verification Model (HVM). We first demonstrate the effectiveness of TWEAK by using a Natural Language Inference (NLI) model as the HVM and report improved faithfulness with minimal impact on the quality. We then replace the NLI model with our task-specific HVM trained with a first-of-a-kind dataset, FATE (Fact-Aligned Textual Entailment), which pairs input facts with their faithful and hallucinated descriptions with the hallucinated spans marked. The new HVM improves the faithfulness and the quality further and runs faster. Overall the best TWEAK variants improve on average 2.22/7.17 points on faithfulness measured by FactKB over WebNLG and TekGen/GenWiki, respectively, with only 0.14/0.32 points degradation on quality measured by BERTScore over the same datasets. Since TWEAK is a decoding-only approach, it can be integrated with any neural generative model without retraining.

Originally posted on LinkedIn.