Summary of Bias in Fake News Detection of LLMs

Summary Bias in Fake News Detection of LLMs arxiv.org

7,563 words - PDF document - View PDF document

One Line

Fake news detectors often misclassify content generated by language model models (LLM) as fake, but detection accuracy can be improved through the use of adversarial training and datasets.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

Bias in Fake News Detection of LLMs

Source: arxiv.org - PDF - 7,563 words - view

Introduction

• Fake news detectors are biased against texts generated by Large Language Models (LLMs).

• Existing detectors misclassify human-written fake news as genuine while flagging LLM-generated content as fake news.

• This bias is due to distinct linguistic patterns inherent to LLM outputs.

The Challenge of Fake News

• Fake news undermines trust and poses threats to society.

• LLMs have the capability to generate believable fake content at an unprecedented scale.

• Adversaries are increasingly using LLMs to automate fake news curation, leading to a surge in fake news.

Realistic Evaluation Setting

• Evaluating fake news detectors in a setting that includes both human-written and LLM-generated fake news.

• Reflects real-world situations more accurately.

• Tests the resilience and effectiveness of detectors in an evolving fake news landscape.

Bias Towards LLM-Generated Fake News

• Detectors perform better in detecting LLM-generated fake news compared to human-written fake news.

• Previous concerns about the challenges of identifying LLM-generated fake news are contradicted.

• Misclassification of LLM-generated real news as fake news due to bias towards LLM-generated texts.

Mitigation Strategy - Adversarial Training

• Investigating whether detectors take 'shortcuts' to learn LLM-generated fake news.

• Analysis of content-based features and proposing a debiasing technique.

• Adversarial training with LLM-paraphrased real news effectively reduces biases and improves detector performance.

New Datasets for Research

• GossipCop++ and PolitiFact++ datasets released by the researchers.

• Contains human-validated articles along with LLM-generated fake and real news.

• Serve as benchmarks and valuable resources for further research in developing and evaluating fake news detectors.

Addressing Bias in Fake News Detection

• Fake news detectors are biased against LLM-generated content.

• Adversarial training with LLM-paraphrased genuine news can mitigate this bias and improve detection accuracy.

• GossipCop++ and PolitiFact++ datasets provide valuable resources for further research.

• It is crucial to address bias in fake news detection to ensure the effectiveness of such systems.

[Optional: Visuals can be added to illustrate the bias in detector performance, linguistic patterns in LLM outputs, and the impact of adversarial training on improving detection accuracy.]

Key Points

Fake news detectors are biased against texts generated by Large Language Models (LLMs).
Existing detectors are more likely to flag LLM-generated content as fake news while misclassifying human-written fake news as genuine.
The bias is due to distinct linguistic patterns inherent to LLM outputs.
Adversarial training with LLM-paraphrased genuine news can mitigate this bias and improve detection accuracy.
Researchers have released two comprehensive datasets, GossipCop++ and PolitiFact++, for further research in developing and evaluating fake news detectors.

Summaries

20 word summary

Fake news detectors are biased against LLM-generated content, misclassifying it as fake. Adversarial training and datasets can improve detection accuracy.

62 word summary

A study found that fake news detectors are biased against texts generated by Large Language Models (LLMs), flagging LLM-generated content as fake while misclassifying human-written fake news. Researchers proposed using adversarial training with LLM-paraphrased genuine news to address this bias, improving detection accuracy. They also released datasets for further research and highlighted the challenge of fake news and the impact of LLMs.

155 word summary

Fake news detectors have been found to be biased against texts generated by Large Language Models (LLMs), according to a study conducted by Jinyan Su, Terry Yue Zhuo, Jonibek Mansurov, Di Wang, and Preslav Nakov. The detectors were more likely to flag LLM-generated content as fake news while misclassifying human-written fake news as genuine. To address this bias, the researchers proposed a mitigation strategy using adversarial training with LLM-paraphrased genuine news, which improved detection accuracy for both human and LLM-generated news. The researchers released two comprehensive datasets, GossipCop++ and PolitiFact++, to facilitate further research. The study emphasized the challenge of fake news and the impact of LLMs in generating believable content. The researchers introduced a new setting for evaluating detectors that includes both human-written and LLM-generated fake news. The analysis revealed a bias towards LLM-generated fake news, and the researchers proposed a debiasing technique using adversarial training with LLM-paraphrased real news to mitigate this bias.

472 word summary

Fake News Detectors have been found to be biased against texts generated by Large Language Models (LLMs), according to a study conducted by Jinyan Su, Terry Yue Zhuo, Jonibek Mansurov, Di Wang, and Preslav Nakov. The study aimed to evaluate the performance of fake news detectors in scenarios involving both human-written and LLM-generated misinformation. The findings revealed a significant bias in many existing detectors, as they were more likely to flag LLM-generated content as fake news while misclassifying human-written fake news as genuine. This bias appeared to be due to distinct linguistic patterns inherent to LLM outputs.

To address this bias, the researchers proposed a mitigation strategy that leverages adversarial training with LLM-paraphrased genuine news. This approach improved the detection accuracy for both human and LLM-generated news. In order to facilitate further research in this domain, the researchers released two comprehensive datasets, GossipCop++ and PolitiFact++, which contain human-validated articles along with LLM-generated fake and real news.

The study highlighted the critical challenge of fake news, which undermines trust and poses threats to society. The emergence of LLMs has intensified these concerns, as they have the capability to generate believable fake content at an unprecedented scale. Adversaries are increasingly using LLMs to automate fake news curation, resulting in a surge in the amount of fake news. The researchers emphasized the need to study how LLMs affect fake news detection, particularly the detection of LLM-generated fake news.

The researchers introduced a new and realistic setting for evaluating fake news detectors, where detectors must identify both human-written and LLM-generated fake news. This reflects real-world situations more accurately, considering the increasing usage of LLMs in disseminating disinformation. Testing detectors against human and LLM-generated content allows for the assessment of their resilience and effectiveness in an evolving fake news landscape.

The analysis of various fake news detectors revealed that they demonstrated a bias towards LLM-generated fake news, even when it was truthful. The detectors performed better in detecting LLM-generated fake news compared to human-written fake news, contrary to previous concerns about the challenges of identifying LLM-generated fake news. The researchers paraphrased human-written real news using ChatGPT and found that the detectors performed much worse on LLM-paraphrased real news than human-written ones. This bias towards LLM-generated texts led to misclassification of LLM-generated real news as fake news.

To mitigate this bias, the researchers investigated whether detectors took 'shortcuts' to learn LLM-generated fake news. They analyzed content-based features of news articles and proposed a debiasing technique that leveraged adversarial training with LLM-paraphrased real news. This strategy effectively reduced biases and improved the performance of fake news detectors on both human-written and LLM-generated content.

In conclusion, the study revealed a significant bias in fake news detectors towards LLM-generated content. The researchers proposed a mitigation strategy that improved detection accuracy and released two comprehensive datasets for further research in this domain.

506 word summary

Fake News Detectors are biased against texts generated by Large Language Models (LLMs), according to a study conducted by Jinyan Su, Terry Yue Zhuo, Jonibek Mansurov, Di Wang, and Preslav Nakov. The study aims to evaluate the performance of fake news detectors in scenarios involving both human-written and LLM-generated misinformation. The findings reveal a significant bias in many existing detectors, as they are more likely to flag LLM-generated content as fake news while misclassifying human-written fake news as genuine. This bias appears to be due to distinct linguistic patterns inherent to LLM outputs.

To address this bias, the researchers propose a mitigation strategy that leverages adversarial training with LLM-paraphrased genuine news. This approach improves the detection accuracy for both human and LLM-generated news. To facilitate further research in this domain, the researchers release two comprehensive datasets, GossipCop++ and PolitiFact++, which contain human-validated articles along with LLM-generated fake and real news.

The study begins by highlighting the critical challenge of fake news, which undermines trust and poses threats to society. The emergence of LLMs has intensified these concerns, as they have the capability to generate believable fake content at an unprecedented scale. Adversaries are increasingly using LLMs to automate fake news curation, resulting in a surge in the amount of fake news. The researchers emphasize the need to study how LLMs affect fake news detection, particularly the detection of LLM-generated fake news.

The researchers introduce a new and realistic setting for evaluating fake news detectors, where detectors must identify both human-written and LLM-generated fake news. This reflects real-world situations more accurately, considering the increasing usage of LLMs in disseminating disinformation. Testing detectors against human and LLM-generated content allows for the assessment of their resilience and effectiveness in an evolving fake news landscape.

The analysis of various fake news detectors reveals that they demonstrate a bias towards LLM-generated fake news, even when it is truthful. The detectors perform better in detecting LLM-generated fake news compared to human-written fake news, contrary to previous concerns about the challenges of identifying LLM-generated fake news. The researchers paraphrase human-written real news using ChatGPT and find that the detectors perform much worse on LLM-paraphrased real news than human-written ones. This bias towards LLM-generated texts leads to misclassification of LLM-generated real news as fake news.

To mitigate this bias, the researchers investigate whether detectors take 'shortcuts' to learn LLM-generated fake news. They analyze content-based features of news articles and propose a debiasing technique that leverages adversarial training with LLM-paraphrased real news. This strategy effectively reduces biases and improves the performance of fake news detectors on both human-written and LLM-generated content.

The researchers provide two new datasets, GossipCop++ and PolitiFact++, which contain human-validated articles along with LLM-generated fake and real news. These datasets serve as benchmarks and valuable resources for further research into developing and evaluating fake news detectors.

In conclusion, the study reveals a significant bias in fake news detectors towards LLM-generated content. The researchers propose a mitigation strategy that improves detection accuracy and release two comprehensive datasets for further research in this domain.

Raw indexed text (47,575 chars / 7,563 words / 1,813 lines)

Fake News Detectors are Biased against

Texts Generated by Large Language Models

Jinyan Su 1† , Terry Yue Zhuo 2† , Jonibek Mansurov 1 , Di Wang 3 , Preslav Nakov 1

Mohamed bin Zayed University of Artificial Intelligence

Monash University and CSIRO’s Data61

King Abdullah University of Science and Technology

{Jinyan.Su, preslav.nakov}@mbzuai.ac.ae

[email protected], [email protected]

Abstract

The spread of fake news has emerged as a criti-

cal challenge, undermining trust and posing

threats to society. In the era of Large Lan-

guage Models (LLMs), the capability to gen-

erate believable fake content has intensified

these concerns. In this study, we present a

novel paradigm to evaluate fake news detec-

tors in scenarios involving both human-written

and LLM-generated misinformation. Intrigu-

ingly, our findings reveal a significant bias in

many existing detectors: they are more prone to

flagging LLM-generated content as fake news

while often misclassifying human-written fake

news as genuine. This unexpected bias appears

to arise from distinct linguistic patterns inher-

ent to LLM outputs. To address this, we intro-

duce a mitigation strategy that leverages adver-

sarial training with LLM-paraphrased genuine

news. The resulting model yielded marked im-

provements in detection accuracy for both hu-

man and LLM-generated news. To further cat-

alyze research in this domain, we release two

comprehensive datasets, GossipCop++ and

PolitiFact++, thus amalgamating human-

validated articles with LLM-generated fake and

real news.

Introduction

In an age of universal deceit, telling the

truth is a revolutionary act.

— George Orwell

The dissemination of false information can cause

chaos, hatred, and trust issues, and can even-

tually hinder the development of society as a

whole (Wasserman and Madrid-Morales, 2019).

Among them, fake news is often used to manip-

ulate certain populations and had a catastrophic

impact on multiple events, such as Brexit (Bastos

and Mercea, 2019), the COVID-19 pandemic (van

†: Equal contribution.

Der Linden et al., 2020), and the 2022 Russian as-

sault on Ukraine (Mbah and Wasum, 2022). To

spread such fake news, adversaries conventionally

will deploy propaganda techniques and manually

write the fake news (Huang et al., 2022).

Creating convincing disinformation manually

is a labor-intensive and time-consuming process,

which may limit the scale and speed at which such

content can be produced. This makes it less effi-

cient and desirable for adversaries who aim for

widespread and rapid dissemination of false in-

formation (Zellers et al., 2019). With the devel-

opment of language models like GPT-2 (Radford

et al., 2019) and BART (Lewis et al., 2019), more

and more adversaries tend to utilize these mod-

els to automate fake news curation, resulting in

a surge in the amount of fake news (Weidinger

et al., 2021). The recent advances in large lan-

guage models (LLMs) have exacerbated the situ-

ation, as their increased capabilities can generate

more convincing and nuanced disinformation at

an unprecedented scale (Pan et al., 2023; Zhuo

et al., 2023a). For instance, the emergence and

application of LLMs (Brown et al., 2020; Touvron

et al., 2023; Li et al., 2023) like GPT-3 and Chat-

GPT have markedly impacted the media landscape.

From January 1, 2022, to April 1, 2023, there was

a dramatic surge in synthetic articles, especially

on misinformation news websites (Hanley and Du-

rumeric, 2023). Relative to the previous year, there

was an increase of 79.4% in the production of syn-

thetic news articles on mainstream websites. How-

ever, this pales compared to the astounding 342%

increase seen on misinformation-oriented sites over

the same period.

With the increasing concerns that humans are

likely deceived or misled by LLM-generated fake

news, there is an urgent need to study how the era

of LLMs can affect fake news detection. Previous

works have only trained fake news detectors to de-

tect human-written or language-model-generatedfake news (Figueira and Oliveira, 2017; Zellers

et al., 2019; Schuster et al., 2020). Compared

to these studies, we consider a more realistic sce-

nario where the detectors must identify both human-

written and LLM-generated fake news. Intuitively,

we add the same amount of LLM-generated fake

news as human-written to the training and test sets.

Different from Zellers et al. (2019) and Pagnoni

et al. (2022) aiming to defend against synthetic

fake news via specific designs, our goal is to ex-

amine the performance of generic fake news de-

tectors in detecting naturally written fake news by

LLMs and humans. To synthesize the natural fake

news via LLMs, we design a systematic framework

to instruct LLMs with identifiable structures. We

choose ChatGPT as the backbone model, as it is

one of the most representative instruction-tuned

LLMs that can generate human-like context.

• We introduce a new and realistic setting for

evaluating fake news detectors. In this sce-

nario, detectors must identify both human-

written and LLM-generated fake news. This

reflects real-world situations more accurately,

considering the increasing usage of LLMs in

disseminating disinformation. Testing detec-

tors against human and LLM-generated con-

tent allows us to assess their resilience and

effectiveness in an evolving fake news land-

scape.

• Our analysis uncovers surprising findings.

Despite existing concerns about the ability

of fake news detectors to identify LLM-

generated fake news, we find these detectors

demonstrate a bias. They disproportionately

classify LLM-generated content as fake news,

even when it is truthful.

Throughout our experiments on various fake

news detectors like BERT, RoBERTa and ELEC-

TRA (Khan et al., 2021), we surprisingly find that

they can detect LLM-generated fake news better

than human-written ones, in contrast to previous

concerns about the challenges of identifying LLM-

generated fake news (Pan et al., 2023). To further

understand this finding, we continue paraphrasing

human-written real news via ChatGPT and evaluate

whether the detectors can correctly identify both

LLM-paraphrased and human-written real news.

We find that fake news detectors perform much

worse on LLM-paraphrased real news than human-

written ones. Based on these observations, we con-

clude that fake news detectors are biased towards

LLM-generate texts and tend to classify them as

fake news regardless of their truthfulness.

To mitigate such biases, we first study whether

fake news detectors may take ‘shortcuts’ to learn

the LLM-generated fake news. Inspired by content-

based features of news articlesHorne and Adali

(2017); Nørregaard et al. (2019), we analyze the

new landscape (NELA) features and provide sev-

eral hypotheses based on the statistical evidence.

We demonstrate that bias can be mitigated by train-

ing on selective features with two regression detec-

tors. We further propose a debiasing technique

for fake news detectors by leveraging adversar-

ial training with LLM-paraphrased real news. We

show that our approach can effectively mitigate the

biases and narrow the performance gap between

LLM-generated and human-written texts.

Our contributions can be summarized as follows:

• We delve deeper into these observations, sug-

gesting potential explanations for the detected

bias via content-based NELA features. We

propose that these detectors may learn ‘short-

cuts’, identifying fake news based on unique

linguistic features in LLM-generated texts.

• On the basis of this bias, we develop a miti-

gation technique leveraging adversarial train-

ing (Bai et al., 2021) with LLM-paraphrased

real news. This strategy effectively reduces bi-

ases, enhancing the performance of fake news

detectors on both human-written and LLM-

generated content.

• We also provide two new datasets,

GossipCop++ and PolitiFact++,

for the research community. Along with

the original human-written news articles,

these datasets contain high-quality 97 and

4,084 LLM-synthesized fake news articles,

respectively. We believe they can serve as

benchmarks and valuable resources for further

research into developing and evaluating fake

news detectors.

2.1

Related Work

Fake News Synthesis

There has been a focus in prior research on using

deep learning to produce misinformation with the

aim of facilitating the spread of machine-generated

fake news. Zellers et al. (2019) leverage GPT-

2 (Radford et al., 2019) to pre-train a large-scalenews corpus and show that the generator effectively

synthesizes fake news. Later, Huang et al. (2023)

improve the controllability of the synthesized fake

news by conditioning the generation on knowledge

elements, including entities, relations and events,

extracted from the original news article. Shu et al.

(2021) enhance the factuality of the generated ar-

ticle by introducing a fact retriever that fetches

relevant information from external corpora. Mos-

allanezhad et al. (2022) exploit adversarial rein-

forcement learning to generate topic-preserving

fake news articles. These studies have developed

methods for generating fake news that is hard to dis-

tinguish from real news for humans. More recently,

Huang et al. (2023) incorporated propaganda tech-

niques to synthesize the fake news via data aug-

mentation (Feng et al., 2021; Zhuo et al., 2023b).

However, these approaches require costly designs

to synthesize the text. In this work, we tend to

utilize large language models to synthesize fake

news via prompting. Compared to the prior studies,

we need no model training while guaranteeing the

quality of synthesized fake news.

2.2

Fake News Detection

Previous works on fake news detection have

mainly explored two directions: content-based and

knowledge-based detection (Manzoor et al., 2019).

For content-based detection, researchers have stud-

ied how well the pre-trained classifiers can detect

machine-generated text (Su et al., 2023). Zellers

et al. (2019) show that finetuning RoBERTa can

detect synthesized fake news with 95% accuracy

and that the performance transfers across decoding

strategies and to smaller generators. Ippolito et al.

(2020) find that the best-performing detectors are

those that deceive humans because decoding strate-

gies must balance fluency with lexical and syntac-

tic novelty. Different from content-based detection,

knowledge-based detection emphasizes auxiliary

knowledge for news verification. These methods

typically utilize external knowledge about entity re-

lationships or social knowledge about online posts

for fake news detection. While existing methods

have demonstrated the usefulness of heterogeneous

social relations and external information (Shu et al.,

2021; Sheng et al., 2021), they either do not model

the interactions between the news content and dif-

ferent types of knowledge data or model them at

a coarse-grained (e.g., sentence) level, which lim-

its their performance. In this study, we focus on

content-based detection and use a series of repre-

sentative pre-trained detectors to detect both large-

language-model-generated and human-written fake

news.

Task Definition

Neural fake news detection, an ever-evolving do-

main, has witnessed significant shifts with the emer-

gence of LLMs. It is imperative to understand

the dataset compositions and the challenges after

LLMs emerge. Therefore, we outline the task defi-

nitions across two eras, namely Pre-LLM Era and

LLM Era.

3.1

Pre-LLM Era: Traditional Neural Fake

News Detection

In the era of Pre-LLM, the training dataset conven-

tionally contains two types of data, human-written

real news (D HR ) and fake news (D HF ),

HR HR

D HR = {(x HR

1 , y 1 ), (x 2 , y 2 ), . . . , (x N , y N )}

(1)

D HF = {(x HF

y N

)}

(2)

where x i represents the i th news article, y i denotes

the label for x i , with y i ∈ {0, 1} (0 for real, 1 for

fake) and N is the total number of articles in each

dataset.

Historically, adversarial attempts to fabricate

fake news predominantly stemmed from humans,

leading to a dataset composition reflecting this

reality. Hence, the neural fake news detector

M (x; θ, D) is tailored to discern between authentic

human-written real news and fake news, training on

D HR and D HF with the following loss function:

Loss(θ) =

L(M (x i ; θ, D HR ∪ D HF ), y i ),

i=1

(3)

where L is a typical binary cross-entropy loss.

3.2

LLM Era: Advanced Fake News

Detection

The introduction of LLMs ushered in an era of

amplified complexities, resulting in the importance

of additional training on LLM-generated fake news

(D M F ):

D M F = {(x M

), (x M

1 , y 1

2 , y 2

. . . , (x M

N , y N )},

(4)where x M F i represents the i th LLM-generated

F , with

news article, y i M F denotes the label for x M

y i M F ∈ {0, 1} (0 in this case) and N is the total

number of articles in each dataset.

In this contemporary setting, the prolific capa-

bilities of LLMs manifest in their ability to craft

narratives that rival human-written content in qual-

ity and authenticity. The detectors trained solely on

traditional datasets may inadvertently overlook the

nuances of LLM-generated content. Therefore, in

this setting, the fake news detectors will be trained

on the combination of D HR , D HF and D M F ,

′

Loss(θ ) =

L(M (x i ; θ ′ ,

i=1

D HR ∪ D HF ∪ D M F ), y i ).

(5)

This model ensures holistic and robust detector

training. By integrating both human and LLM-

generated fabrications, fake news detectors are

better equipped to navigate the multifaceted chal-

lenges of the current fake news paradigm.

4.1

Prompting Large Language Models to

Generate Fake News

ChatGPT As A Fake News Generator

To generate fake news using LLMs, we tend to

elucidate the optimal strategies an adversary might

leverage to fabricate such deceptive content.

Economically, ChatGPT presents a compelling

proposition. Unlike its counterparts, such as GPT-

3 (Brown et al., 2020), interfacing with ChatGPT

via its web API or iOS application incurs no di-

rect financial costs, positioning it as an economical

vector for potential misinformation campaigns.

On the technical spectrum, there exists a range

of open-source LLMs, notably LLaMA (Touvron

et al., 2023). Engaging these models for fake news

synthesis necessitates profound technical expertise,

given their substantial computational demands for

local deployment. Beyond mastering their opera-

tional dynamics, effective deployment also hinges

on specific hardware provisions, with a pronounced

emphasis on GPUs, to realize their full potential.

In light of these factors, ChatGPT is delineated

as the prime LLM for our investigative foray into

fake news generation. A salient limitation of naive

prompting, however, is the emergence of identi-

fiable structures in the generated content. Such

structures, characterized by recurrent formatting

patterns or predictable metadata placements, can

betray the machine-generated nature of the con-

tent, undermining its deceptive intent. Recogniz-

ing the impracticality of manual scrutiny over ex-

tensive datasets to negate these patterns, we intro-

duce a refined methodology: Structured Mimicry

Prompting (SMP). SMP employs a tailored prompt-

ing paradigm to discretely process the core narra-

tive and the article’s title, as depicted in Figure

1. This strategic approach enables LLMs to em-

ulate the nuance and depth inherent to authentic

misleading narratives.

4.2

Fake News Detection Datasets in LLM

Era

When selecting the data source to construct our

datasets in the LLM era, we consider the follow-

ing two criteria. First, the news articles must

be human-written and have been widely used in

the Pre-LLM era. This ensures that the seed hu-

man fake news in SMP has high quality. Sec-

ond, the news events described in the articles

must be important to the general audience. Mo-

tivated by these two criteria, we repurpose the fake

news data repository FakeNewsNet (Shu et al.,

2020) as our data source. FakeNewsNet contains

two datasets, PolitiFact and GossipCop.

To improve the data quality and ease the fake

news generation, we filter out news articles that

do not contain titles or descriptions. By adopt-

ing the SMP prompting technique via ChatGPT,

we compose 97 and 4,084 LLM-generated fake

news for PolitiFact and GossipCop, re-

spectively. Combining the original datasets, we

propose two new datasets, PolitiFact++ and

GossipCop++.

In order to verify the effectiveness of SMP,

we use MAUVE metric (Pillutla et al., 2021) to

compute the distribution similarity between the

human-written fake news and LLM-generated ones.

By naively prompting ChatGPT with “Generate a

fake news article with a title” on PolitiFact

and GossipCop, we collect all the generated

outputs and compute the MAUVE scores. We

find that the MAUVE scores for PolitiFact

and GossipCop are 3.1% and 1.2%, respec-

tively. By utilizing SMP with ChatGPT, we observe

that the MAUVE scores for PolitiFact and

GossipCop are 72.5% and 71.8% respectively,

indicating that the LLM-generated fake news is

more highly aligned with human-written ones, com-Example of data generation

Prompt

Here is a fake news article title: {Kim & Kanye Install At-Home Panic

Room After Paris Robbery} and its description {Keeping the kids safe is

the couples number one priority,' says a source, after revealing Kim

Kardashian and Kanye West are installing a panic room in their house},

fabricate a fake news article based on this fake news article title and

description I gave you and return fake news article as well as your

title with the format Title:__ ### Article:__ (make sure to use ### as

the delimiter)

Output

Title: Kim & Kanye's Extravagant Upgrade: A Home Spa and Panic Room ### Article:

In a shocking move, celebrity power couple Kim Kardashian and Kanye West have reportedly taken their home

security to the next level with the installation of a state-of-the-art panic room. This decision comes in the

aftermath of the infamous Paris robbery, which left the couple concerned about the safety of their family….

Extract the title and article part automatically

Figure 1: SMP: Prompting LLMs to generate fake news articles.

Dataset

Model

RoBERTa

BERT

GossipCop++

ELECTRA

ALBERT

DeBERTa

RoBERTa

BERT

PolitiFact++

ELECTRA

ALBERT

DeBERTa

Accuracy

Overall

F1 Recall Precision Auroc

Large

Base

Large

Base

Large

Base

Large

Base

Large

Base 80.91

85.56

87.39

87.45

80.17

86.05

92.96

85.68

92.59

93.02 77.97

69.65

70.13

66.59

71.73

63.53

56.43

59.24

67.56

57.41 99.88

99.76

99.39

99.27

99.88

99.63

98.53

97.92

99.88

98.41 84.91

85.13

86.08

85.19

82.99

83.81

85.22

82.13

88.16

85.47 85.50

85.06

85.89

84.85

83.45

83.44

83.98

81.47

87.61

84.28 88.92

84.70

84.76

82.93

85.80

81.58

77.48

78.58

83.72

77.91 82.32

85.43

87.05

86.86

81.23

85.39

91.67

84.58

91.87

91.78 94.33

92.75

92.67

91.56

91.84

90.83

90.16

88.72

94.38

91.33

Large

Base

Large

Base

Large

Base

Large

Base

Large

Base 30.41

68.56

48.97

69.59

63.92

82.47

90.72

75.26

70.62

90.21 68.04

61.86

53.61

38.14

62.89

50.52

29.90

40.21

53.61

43.30 100.00

100.00

98.97

100.00

98.97

100.00

100.00 57.22

74.74

62.63

69.33

72.68

78.87

77.58

72.68

73.71

80.93 66.26

76.21

67.12

69.25

74.88

78.07

74.18

71.96

74.50

78.98 84.02

80.93

76.29

69.07

81.44

75.26

64.43

70.10

76.80

71.65 54.70

72.02

59.92

69.43

69.30

81.11

87.41

73.91

72.33

87.97 79.32

83.30

74.02

78.25

83.94

87.04

86.09

81.80

82.43

88.00

Table 1: Performance metrics of various fake news detectors on the GossipCop++ and PolitiFact++ datasets.

HR: Human-written Real news. HF: Human-written Fake news. MF: LLM-generated Fake news.

pared to the ones with naive prompting.

Experiment Setup

In our experiments, we aim to (1) systematically

study the performance of fake news detectors in the

LLM era, (2) examine the issues of these fake news

detectors, and (3) mitigate these identified issues.

Dataset

PolitiFact++

GossipcopCop++

Table 2:

Details

GossipcopCop++.

5.1

4084

194

4169

PolitiFact++

and

Datasets

We use PolitiFact++ and GossipCop++ as

the training and test dataset, respectively, which areproposed in Section 4.1. We show the details of

two datasets in Table2, generated content, thereby skewing their detection

capabilities.

5.2 6.2 RQ2: Why are fake news detectors biased

towards LLM-generated news?

Fake News Detectors

We choose five widely adopted language models as

fake news detectors, RoBERTa (Liu et al., 2019),

BERT (Kenton and Toutanova, 2019), ELEC-

TRA (Clark et al., 2019), ALBERT (Lan et al.,

2020), DeBERTa (He et al., 2020), with their vari-

ants (Large and Base models). These language

models have demonstrated their superior perfor-

mance in classifying fake news articles. We train

these models on A100 GPUs and use the default hy-

perparameters as the same as Pagnoni et al. (2022)

using a learning rate of 1e-6 and training for 10

epochs.

Results and Analysis

In this section, we present the results of our investi-

gation and discuss the findings according to each

research question (RQ).

6.1

RQ1: How well can fake news detectors

perform on PolitiFact++ and

GossipCop++?

To evaluate the performance of selected fake news

detectors, we report the accuracy of each part of the

data (human-written real news, human-written fake

news, and LLM-generated fake news), F1 scores,

recalls, precisions and AUROCs in Table 1. No-

tably, DeBERTa variants outperform other models,

registering an F1 score of 87.61 on GossipCop++

and 78.98 on PolitiFact++. A deeper dive into

the accuracy metrics reveals a pronounced disparity

in detecting human-written versus LLM-generated

fake news. Remarkably, the detectors exhibit near-

perfect accuracy in identifying LLM-generated

fake news, yet falter significantly with human-

written fake news. Among the evaluated models,

RoBERTa-Large demonstrates a more consistent

ability to classify fake news, outperforming its

counterparts in detecting both human-written and

LLM-generated fake news. Nonetheless, even for

RoBERTa-Large, a discernible gap persists, with

accuracy discrepancies exceeding 20% and 30%

on GossipCop++ and PolitiFact++, respec-

tively. These findings suggest an inherent bias in

fake news detectors towards machine-generated

content, particularly those crafted by LLMs. A

plausible explanation is that detectors might exploit

certain patterns or ‘shortcuts’ inherent to LLM-

To comprehend the observed bias in fake news

detectors towards content generated by LLMs, we

embarked on an in-depth analysis of content-based

features. Drawing inspiration from prior work on

news veracity detection (Horne and Adali, 2017),

we computed News Landscape (NELA) features.

These features, derived from the NELA toolkit,

encapsulate six dimensions of news content: style,

complexity, bias, affect, morale, and event. We

applied these features to both GossipCop++ and

PolitiFact++. Employing Tukey’s pairwise

test (Tukey, 1949), we discerned significant feature

disparities among human-written fake news, LLM-

generated fake news, and human-written real news.

Our analysis, as presented in Table 7, reveals

that most of the NELA features differ significantly

between human-written and LLM-generated fake

news. Moreover, the divergence between LLM-

generated fake news and human-written real news

is more pronounced than between human-written

fake and real news. This underscores the relative

ease of detecting LLM-generated fake news, shed-

ding light on the bias observed in RQ1. The NELA

features for PolitiFact++ are detailed in Ap-

pendix 7.

To further understand the influence of these fea-

tures on detection performance, we evaluated two

regression models: logistic regression and decision

tree. These models were chosen to explore the

potential for countering biases in detecting LLM-

generated fake news. For GossipCop++, we

retained NELA features that exhibited no signifi-

cant disparity between human-written and LLM-

generated fake news. For PolitiFact++, given

the paucity of such NELA features, we also in-

corporated features that significantly differentiated

human-written fake news from real news.

Table 4 presents the results of both models. No-

tably, the debiased logistic regression model for

GossipCop++ exhibits a decrease in accuracy

for LLM-generated fake news (from 95.79% to

86.51%) but an increase in accuracy for human-

written fake news (from 47.33% to 53.89%). Sim-

ilar trends are observed for the PolitiFact++

dataset.

Upon evaluating the debiased models, a notableshift in performance dynamics emerges. While

the proficiency in identifying LLM-generated fake

news wanes, there is an increase in the performance

of detecting human-written fake news. This shift

can be attributed to the prior models’ propensity to

capitalize on features intrinsic to LLM-generated

content. However, a slight decline in overall de-

tection efficacy, especially for human-written real

news, necessitates scrutiny. Our efforts to debias

might inadvertently overlook pivotal features cru-

cial for discerning genuine from fabricated con-

tent. This underscores the importance of judicious

feature selection and a profound understanding of

dataset biases. It is pivotal to recognize that stellar

performance on a specific subset might veil un-

derlying biases. The overarching challenge lies in

crafting models that harmonize precision with fair-

ness. Overreliance on distinct LLM-generated fake

news characteristics could compromise a model’s

broader applicability.

style

complexity

bias

affect

6.3

RQ3: How can we mitigate bias in fake

news detectors?

Our analysis in Section 6.2 revealed a pronounced

bias in detectors, which tends to overfit the unique

features of LLM-generated fake news. To address

this issue, we introduce an adversarial training-

inspired strategy, augmenting our training set with

high-quality LLM-generated real news. To this

end, we got 132 and 8,168 paraphrased real news

articles for GossipCop++ and PolitiFact++

after manual filtering, respectively. By employ-

ing ChatGPT to generate paraphrased content re-

sembling genuine news articles, we aim to foster

a detector that is adept across diverse news con-

tent rather than being narrowly focused on a spe-

cific subset. This section details our methodology

and assesses the quality of the LLM-generated real

news relative to its source.

6.3.1

Quality Assessment of LLM-Generated

real news

To ascertain the quality and authenticity of LLM-

generated news, we embarked on a rigorous eval-

uation. We randomly sampled 100 pairs from the

two datasets respectively, each pairing a human-

authored article with its LLM-generated counter-

part. Our goal is to generate real news that cap-

tures the essence of the original while being indis-

tinguishable from human-authored content. Two

authors, familiar with the research context yet ob-

jective, were annotators for the human evaluation

moral

event

quotes

exclaim

allpunc

allcaps

stops

NNS

NNP

PRP

PRP$

WP$

WRB

VBD

VBG

VBN

VBZ

WDT

ttr

avg wordlen

word count

smog index

coleman liau index

bias words

assertatives

hedges

implicatives

report verbs

positive opinion words

negative opinion words

vadneg

vadneu

wneg

wpos

wneu

sneg

spos

IngroupVirtue

IngroupVice

AuthorityVice

PurityVirtue

num dates

HF > MF

HF < MF

HF > MF

HF < MF

HF > MF

HF < MF

HF > MF

HF < MF

HF > MF

HF < MF

HF > MF

HF < MF

HF > MF

HF < MF

MF < HR

MF > HR

MF < HR

MF > HR

MF < HR

MF > HR

MF < HR

MF > HR

MF < HR

MF > HR

MF < HR

MF > HR

MF < HR

MF > HR

MF < HR

MF > HR

MF < HR

MF > HR

HF > HR

HF < HR

HF > HR

HF < HR

Table 3: Comparison of content-based features across

Human-written Fake news (HF), LLM-generated Fake

news (MF), and Human-written Real news (HR) for

the GossipCop++ dataset. The table showcases dif-

ferences in style, complexity, bias, affect, morale, and

event features. The colour intensity represents the sig-

nificance of the difference (p value), with darker shades

indicating higher significance.

components. We employed the following metrics

to critically evaluate the LLM-generated content:

1. Semantic Consistency with SimCSE: The

metric, leveraging the SimCSE model (Gao

et al., 2021), calculates the cosine similar-

ity between embeddings of the original and

LLM-generated news. A higher score signi-

fies strong semantic alignment, ensuring the

core narrative is retained.

2. Readability Assessment: The metric mea-

sures the text’s comprehensibility. Annota-

tors need to compare the original and LLM-

generated news, rating their clarity and under-

standability on a set scale.

3. Authenticity Perception: The metric evalu-

ates the content’s perceived credibility. Anno-Dataset

Accuracy

Model

GossipCop++

PolitiFact++

logistic regression

logistic regression(debiased)

decision tree

decision tree(debiased)

logistic regression

regression(debiased)

decision tree

decision tree(debiased)

HR HF MF Overall

77.09(0.2)

71.51(0.2)

70.32(0.5)

67.43(0.5)

63.37(4.7)

63.39(5.0)

76.26(2.0)

76.28(2.2) 47.33(1.0)

53.89(0.4)

54.80(0.9)

57.91(1.0)

70.21(5.1)

75.47(4.7)

58.84(6.4)

69.26(4.8) 95.79(0.3)

86.51(0.1)

86.90(0.5)

78.70(0.9)

93.84(1.9)

89.68(3.7)

92.95(3.7)

81.47(6.8) 74.33(0.2)

70.86(0.1)

70.59(0.2)

67.87(0.2)

72.67(2.4)

72.95(2.1)

76.02(1.0)

75.78(1.4)

F1 Recall Precision Auroc

73.59(0.3)

70.66(0.1)

70.66(0.3)

68.00(0.3)

75.11(1.8)

75.34(1.6)

75.89(1.3)

75.61(1.5) 75.75(0.1)

71.13(0.1)

70.49(0.2)

67.72(0.2)

69.57(2.9)

69.83(2.7)

76.27(1.3)

76.17(1.7) 71.56(0.6)

70.20(0.2)

70.85(0.6)

68.30(0.8)

81.97(1.8)

82.51(3.0)

75.75(2.4)

75.25(2.4) 74.41(0.2)

70.86(0.1)

70.60(0.2)

67.88(0.2)

73.67(2.2)

74.22(2.0)

76.18(1.0)

75.93(1.5)

Table 4: Performance metrics of logistic regression and decision tree models on the GossipCop++ and

PolitiFact++ datasets. HR: Human-written Real news. HF: Human-written Fake news. MF: LLM-generated

Fake news.

Dataset

Before HF

Debiased Difference Before MR

Debiased Difference

Large

Base

Large

Base

Large

Base

Large

Base

Large

Base 77.97

69.65

70.13

66.59

71.73

63.53

56.43

59.24

67.56

57.41 84.46

78.21

77.85

72.46

77.60

70.50

62.42

65.73

77.36

70.62 6.49↑

8.57↑

7.71↑

5.88↑

6.98↑

6.00↑

6.49↑

9.79↑

13.22 ↑ 24.24

31.21

52.63

46.02

31.95

33.54

58.02

49.69

38.43

41.49 90.70

90.58

89.47

93.27

90.82

90.21

93.51

95.10

94.12

94.86 66.46↑

59.36↑

36.84↑

47.25↑

58.87↑

56.67↑

35.50↑

45.41↑

55.69↑

53.37 ↑

Large

Base

Large

Base

Large

Base

Large

Base

Large

Base 68.04

61.86

53.61

38.14

62.89

50.52

29.90

40.21

53.61

43.30 73.20

58.76

62.89

55.67

73.20

61.86

40.21

50.52

75.26

75.26 5.15↑

-3.09↓

9.28↑

17.53↑

10.31↑

11.34↑

10.31↑

21.65↑

31.96↑ 25.77

27.84

43.30

49.48

32.99

31.96

59.79

48.45

42.27

39.18 89.69

87.63

92.78

89.69

91.75

96.91

93.81

92.78 63.92↑

59.79↑

44.33↑

43.30↑

56.70↑

59.79↑

31.96↑

48.45↑

51.55↑

53.61↑

Model

RoBERTa

BERT

GossipCop++

ELECTRA

ALBERT

DeBERTa

RoBERTa

BERT

PolitiFact++

ELECTRA

ALBERT

DeBERTa

Table 5: Performance comparison of various models on the GossipCop++ and PolitiFact++ datasets before

and after debiasing. The ‘Difference’ column highlights the performance change post-debiasing. HF:human-written

fake news. MR: LLM-generated real news.

tators compared both news versions, assessing

their perceived authenticity.

4. Stylistic Alignment: Annotators need to eval-

uate the stylistic consistency of the LLM-

generated content with traditional news writ-

ing standards. They compared the LLM-

generated news with standard articles, rating

their stylistic congruence.

Metric HR MR Significance Cohen’s Kappa

Semantic Consistency

Readability Assessment

Authenticity Perception

Stylistic Alignment -

4.5

4.6

4.7 0.95

4.6

4.5

4.6 p > 0.05

p > 0.05

p > 0.05 -

0.85

0.88

0.86

Table 6: Evaluation metrics for original (HR) versus

LLM-generated (MR) real news. The table presents

scores for semantic consistency (scale of 0 to 1), read-

ability, authenticity perception, and stylistic alignment

(all on a scale of 0 to 5). Inter-annotator agreement is

also provided via Cohen’s Kappa scores.

Table 6 shows that LLM-generated real news

scores align closely with those of original news

across all metrics. The Semantic Consistency

score, as measured by SimCSE, underscores the

significant semantic congruence between the LLM-

generated and original news articles. This is fur-

ther corroborated by the readability, authenticity

perception, and stylistic alignment scores. The non-

significant p-values emphasize that LLM-generated

content is virtually indistinguishable from human-

authored news. Additionally, the robust Cohen’s

Kappa scores (McHugh, 2012) highlight the con-

sistency in evaluations, attesting to the high quality

and authenticity of the LLM-generated news.

6.3.2

Mitigating Bias in Detectors

Building upon our earlier findings of biases in fake

news detectors, we sought to devise a mitigation

strategy. The overarching goal was to ensure that

the detectors generalize well across diverse news

types rather than being overly attuned to LLM-

generated content.Our debiasing approach draws inspiration from

adversarial training (Bai et al., 2021). In essence,

we aimed to challenge the model during its train-

ing phase, compelling it to focus on the intrinsic

features of fake news rather than specific idiosyn-

crasies of LLM-generated content. The methodol-

ogy encompassed:

1. Validating the quality of LLM-generated real

news to ensure it mirrors human-written con-

tent.

2. Augmenting the training regimen to incorpo-

rate a broader spectrum of news sources.

conducted

experiments

the

GossipCop++ and PolitiFact++ datasets,

and the results are reported in Table 5. The

RoBERTa-Large model, when tested on the

GossipCop++ dataset, exhibited a 6.49 percent-

age point enhancement in detecting human-written

fake news and a significant 66.46 percentage

point improvement for LLM-generated real news.

This trend of improvement is evident across

most models. However, an exception is the

RoBERTa-Base model on the PolitiFact++

dataset, which saw a 3.09 percentage point decline

for human-written real news, but still achieved a

substantial 59.79 percentage point increase for

LLM-generated real news. The decline in the

performance, particularly for human-written fake

news, might be attributed to the model’s sensitivity

to the nuances of the dataset or the inherent

challenges posed by the PolitiFact++ dataset.

In summation, our adversarially-inspired debi-

asing strategy has demonstrated its efficacy in bol-

stering the generalization capabilities of fake news

detectors. The empirical results underscore the via-

bility of our approach in the quest for more robust

and universally adept fake news detection systems.

Conclusion

In this study, we introduced a novel paradigm

for fake news detection, factoring in both human-

written and LLM-generated news articles. Our

investigations uncovered an unexpected bias: de-

tectors frequently misclassify truthful LLM out-

puts as fake. Delving deeper, we identified poten-

tial linguistic ‘shortcuts’ these detectors take. Our

mitigation strategy, founded on adversarial train-

ing with LLM-paraphrased real news, effectively

reduced this bias. We further contributed by of-

fering two enriched datasets, GossipCop++ and

PolitiFact++, enhancing the scope for future

research in this domain.

Limitations

The

datasets,

GossipCop++

and

PolitiFact++, while expansive, repre-

sent specific genres of news and might not

encompass the entire spectrum of news content.

The types of news included are influenced by the

culture, language, and region from which they

originate. Consequently, the biases and nuances

we identify may be particular to these datasets

and not universally applicable. Our identification

of bias towards LLM-generated content might

seem deterministic, suggesting that all detectors

will inevitably be biased against LLM outputs.

However, it is crucial to understand that the

bias emerges from the training data and model

architectures we used. Different configurations

might produce varied results. The mitigation

strategy, while effective in our tests, is not a

one-size-fits-all solution. Its efficacy is contingent

on the nature of the bias and the specific LLMs

in play. Lastly, the linguistic ‘shortcuts’ and

identified NELA features as potential reasons

for the bias are based on our observations and

analysis. While they offer a plausible explanation,

they might not capture the entirety of the model’s

decision-making process. Different models or a

change in training data might lead to different sets

of influential features. Future research can delve

deeper into these intricacies to provide a more

comprehensive understanding.

References

Tao Bai, Jinqi Luo, Jun Zhao, Bihan Wen, and Qian

Wang. 2021. Recent advances in adversarial train-

ing for adversarial robustness. arXiv preprint

arXiv:2102.01356.

Marco T Bastos and Dan Mercea. 2019. The brexit

botnet and user-generated hyperpartisan news. Social

science computer review, 37(1):38–54.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind

Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, et al. 2020. Language models are few-shot

learners. Advances in neural information processing

systems, 33:1877–1901.

Kevin Clark, Minh-Thang Luong, Quoc V Le, and

Christopher D Manning. 2019. Electra: Pre-training

text encoders as discriminators rather than generators.In International Conference on Learning Representa-

tions.

Steven Y Feng, Varun Gangal, Jason Wei, Sarath Chan-

dar, Soroush Vosoughi, Teruko Mitamura, and Ed-

uard Hovy. 2021. A survey of data augmentation

approaches for nlp. In Findings of the Association

for Computational Linguistics: ACL-IJCNLP 2021,

pages 968–988.

Álvaro Figueira and Luciana Oliveira. 2017. The cur-

rent state of fake news: challenges and opportunities.

Procedia computer science, 121:817–825.

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021.

Simcse: Simple contrastive learning of sentence em-

beddings. In 2021 Conference on Empirical Meth-

ods in Natural Language Processing, EMNLP 2021,

pages 6894–6910. Association for Computational

Linguistics (ACL).

Hans WA Hanley and Zakir Durumeric. 2023.

Machine-made media: Monitoring the mobiliza-

tion of machine-generated articles on misinforma-

tion and mainstream news websites. arXiv preprint

arXiv:2305.09820.

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and

Weizhu Chen. 2020. Deberta: Decoding-enhanced

bert with disentangled attention. In International

Conference on Learning Representations.

Benjamin Horne and Sibel Adali. 2017. This just in:

Fake news packs a lot in title, uses simpler, repetitive

content in text body, more similar to satire than real

news. In Proceedings of the international AAAI con-

ference on web and social media, volume 11, pages

759–766.

Kung-Hsiang Huang, Kathleen McKeown, Preslav

Nakov, Yejin Choi, and Heng Ji. 2022. Faking

fake news for real fake news detection: Propaganda-

loaded training data generation. arXiv preprint

arXiv:2203.05386.

Kung-Hsiang Huang, Kathleen McKeown, Preslav

Nakov, Yejin Choi, and Heng Ji. 2023. Faking

fake news for real fake news detection: Propaganda-

loaded training data generation. In Proceedings

of the 61st Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers),

pages 14571–14589, Toronto, Canada. Association

for Computational Linguistics.

Daphne Ippolito, Daniel Duckworth, Chris Callison-

Burch, and Douglas Eck. 2020. Automatic detection

of generated text is easiest when humans are fooled.

In Proceedings of the 58th Annual Meeting of the As-

sociation for Computational Linguistics, pages 1808–

1822.

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina

Toutanova. 2019. Bert: Pre-training of deep bidirec-

tional transformers for language understanding. In

Proceedings of NAACL-HLT, pages 4171–4186.

Junaed Younus Khan, Md Tawkat Islam Khondaker,

Sadia Afroz, Gias Uddin, and Anindya Iqbal. 2021.

A benchmark study of machine learning models for

online fake news detection. Machine Learning with

Applications, 4:100032.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman,

Kevin Gimpel, Piyush Sharma, and Radu Soricut.

2020. Albert: A lite bert for self-supervised learning

of language representations.

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan

Ghazvininejad, Abdelrahman Mohamed, Omer Levy,

Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: De-

noising sequence-to-sequence pre-training for natural

language generation, translation, and comprehension.

arXiv preprint arXiv:1910.13461.

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas

Muennighoff, Denis Kocetkov, Chenghao Mou, Marc

Marone, Christopher Akiki, Jia Li, Jenny Chim, et al.

2023. Starcoder: may the source be with you! arXiv

preprint arXiv:2305.06161.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-

dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,

Luke Zettlemoyer, and Veselin Stoyanov. 2019.

Roberta: A robustly optimized bert pretraining ap-

proach. arXiv preprint arXiv:1907.11692.

Syed Ishfaq Manzoor, Jimmy Singla, et al. 2019. Fake

news detection using machine learning approaches:

A systematic review. In 2019 3rd international

conference on trends in electronics and informatics

(ICOEI), pages 230–234. IEEE.

Ruth Endam Mbah and Divine Forcha Wasum. 2022.

Russian-ukraine 2022 war: A review of the eco-

nomic impact of russian-ukraine crisis on the usa,

uk, canada, and europe. Advances in Social Sciences

Research Journal, 9(3):144–153.

Mary L McHugh. 2012. Interrater reliability: the kappa

statistic. Biochemia medica, 22(3):276–282.

Ahmadreza Mosallanezhad, Mansooreh Karami, Kai

Shu, Michelle V Mancenido, and Huan Liu. 2022.

Domain adaptive fake news detection via reinforce-

ment learning. In Proceedings of the ACM Web Con-

ference 2022, pages 3632–3640.

Jeppe Nørregaard, Benjamin D Horne, and Sibel Adalı.

2019. Nela-gt-2018: A large multi-labelled news

dataset for the study of misinformation in news arti-

cles. In Proceedings of the international AAAI con-

ference on web and social media, volume 13, pages

630–638.

Artidoro Pagnoni, Martin Graciarena, and Yulia

Tsvetkov. 2022. Threat scenarios and best practices

to detect neural fake news. In Proceedings of the

29th International Conference on Computational Lin-

guistics, pages 1233–1249.Yikang Pan, Liangming Pan, Wenhu Chen, Preslav

Nakov, Min-Yen Kan, and William Yang Wang. 2023.

On the risk of misinformation pollution with large

language models. arXiv preprint arXiv:2305.13661.

Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers,

John Thickstun, Sean Welleck, Yejin Choi, and Zaid

Harchaoui. 2021. Mauve: Measuring the gap be-

tween neural text and human text using divergence

frontiers. Advances in Neural Information Process-

ing Systems, 34:4816–4828.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,

Dario Amodei, Ilya Sutskever, et al. 2019. Language

models are unsupervised multitask learners. OpenAI

blog, 1(8):9.

Tal Schuster, Roei Schuster, Darsh J Shah, and Regina

Barzilay. 2020. The limitations of stylometry for

detecting machine-generated fake news. Computa-

tional Linguistics, 46(2):499–510.

Qiang Sheng, Xueyao Zhang, Juan Cao, and Lei Zhong.

2021. Integrating pattern-and fact-based fake news

detection via model preference learning. In Proceed-

ings of the 30th ACM international conference on

information & knowledge management, pages 1640–

1650.

Kai Shu, Yichuan Li, Kaize Ding, and Huan Liu. 2021.

Fact-enhanced synthetic news generation. In Pro-

ceedings of the AAAI Conference on Artificial Intelli-

gence, volume 35, pages 13825–13833.

Kai Shu, Deepak Mahudeswaran, Suhang Wang, Dong-

won Lee, and Huan Liu. 2020. Fakenewsnet: A data

repository with news content, social context, and spa-

tiotemporal information for studying fake news on

social media. Big data, 8(3):171–188.

Jinyan Su, Terry Yue Zhuo, Di Wang, and Preslav Nakov.

2023. Detectllm: Leveraging log rank information

for zero-shot detection of machine-generated text.

arXiv preprint arXiv:2306.05540.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier

Martinet, Marie-Anne Lachaux, Timothée Lacroix,

Baptiste Rozière, Naman Goyal, Eric Hambro,

Faisal Azhar, et al. 2023. Llama: Open and effi-

cient foundation language models. arXiv preprint

arXiv:2302.13971.

John W Tukey. 1949. Comparing individual means in

the analysis of variance. Biometrics, pages 99–114.

Sander van Der Linden, Jon Roozenbeek, and Josh

Compton. 2020. Inoculating against fake news about

covid-19. Frontiers in psychology, page 2928.

Herman Wasserman and Dani Madrid-Morales. 2019.

An exploratory study of “fake news” and media trust

in kenya, nigeria and south africa. African Journal-

ism Studies, 40(1):107–123.

Laura Weidinger, John Mellor, Maribeth Rauh, Conor

Griffin, Jonathan Uesato, Po-Sen Huang, Myra

Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh,

et al. 2021. Ethical and social risks of harm from

language models. arXiv preprint arXiv:2112.04359.

Rowan Zellers, Ari Holtzman, Hannah Rashkin,

Yonatan Bisk, Ali Farhadi, Franziska Roesner, and

Yejin Choi. 2019. Defending against neural fake

news. Advances in neural information processing

systems, 32.

Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and

Zhenchang Xing. 2023a. Exploring ai ethics of

chatgpt: A diagnostic analysis. arXiv preprint

arXiv:2301.12867.

Terry Yue Zhuo, Zhou Yang, Zhensu Sun, Yufei Wang,

Li Li, Xiaoning Du, Zhenchang Xing, and David

Lo. 2023b. Data augmentation approaches for

source code models: A survey. arXiv preprint

arXiv:2305.19915.style

complexity

bias

affect

moral

event

quotes

exclaim

allpunc

allcaps

stops

NNS

NNP

PRP

PRP$

WP$

WRB

VBD

VBG

VBN

VBZ

WDT

ttr

avg wordlen

word count

smog index

coleman liau index

bias words

assertatives

hedges

implicatives

report verbs

positive opinion words

negative opinion words

vadneg

vadneu

wneg

wpos

wneu

sneg

spos

IngroupVirtue

IngroupVice

AuthorityVice

PurityVirtue

num dates

HF > MF

HF < MF

HF > MF

HF < MF

HF > MF

HF < MF

HF > MF

HF < MF

HF > MF

HF < MF

HF > MF

HF < MF

HF > MF

HF < MF

MF < HR

MF > HR

MF < HR

MF > HR

MF < HR

MF > HR

MF < HR

MF > HR

MF < HR

MF > HR

MF < HR

MF > HR

MF < HR

MF > HR

MF < HR

MF > HR

MF < HR

MF > HR

HF > HR

HF < HR

HF > HR

HF < HR

Table 7: Comparison of content-based features across

Human-written Fake news (HF), LLM-generated Fake

news (MF), and Human-written Real news (HR) for

the GossipCop++ dataset. The table showcases dif-

ferences in style, complexity, bias, affect, morale, and

event features. The colour intensity represents the signif-

icance of the difference, with darker shades indicating

higher significance.