Summary of Large Language Models and Causal Inference

Summary Large Language Models and Causal Inference arxiv.org

8,292 words - PDF document - View PDF document

One Line

This article introduces a new dataset to test large language models' ability to infer causation from correlation and evaluates their performance, highlighting their limited causal inference skills and proposing a new dataset generation process.

Key Points

Large language models have limited causal inference skills and perform poorly on the Corr2Cause task.
A new dataset of 400K samples is proposed to test causal reasoning abilities and evaluate the performance of 17 LLMs.
Directed graphical causal models (DGCMs) are used to represent causal relationships among variables.
RoBERTa-Large MNLI is the best-performing model for causal inference, but identifying V-structure remains challenging.
The authors suggest future work to enhance LLMs' skills with out-of-distribution perturbations and connect the benchmark to real-world false beliefs.

Summaries

144 word summary

This article discusses tests for the robustness of large language models in identifying spurious correlations in data. A new dataset, C ORR 2C AUSE, is introduced to test large language models' (LLMs) ability to infer causation from correlation. The paper presents a method for verbalizing causal relations between variables using natural language and evaluates the performance of various large language models on a dataset designed for causal inference. Large language models have limited causal inference skills, as shown by their poor performance on the Corr2Cause task. The authors propose a new dataset of 400K samples to test causal reasoning abilities and discuss the use of directed graphical causal models (DGCMs) to represent causal relationships among variables. The document presents a dataset generation process for large language models and causal inference, which involves constructing causal graphs using isomorphism checks, generating unique DAGs, and identifying MECs.

494 word summary

Large language models (LLMs) have limited causal inference skills, as shown by their poor performance on a novel task called Corr2Cause. The authors propose a new dataset of 400K samples to test causal reasoning abilities and argue that the ability to perform Corr2Cause inference is a must-have skill for LLMs. The document discusses the use of directed graphical causal models (DGCMs) to represent causal relationships among variables and evaluates the performance of seventeen LLMs on a dataset of over 400K samples. The authors suggest future work to explore ways to enhance the skill of LLMs with out-of-distribution perturbations. Furthermore, they explore whether LLMs can learn the skill through finetuning, but find that they still perform close to the random baseline. The document presents a dataset generation process for large language models and causal inference, which involves constructing causal graphs using isomorphism checks, generating unique DAGs, and identifying MECs. The focus is on smaller graphs with up to 6 nodes, and the dataset statistics are provided in Table 1. The paper presents a method for verbalizing causal relations between variables using natural language. It focuses on six common causal relations and identifies statistical correlations to determine if variables are independent or correlated based on d-separation sets. The article discusses experiments performed on various large language models to test their performance on a dataset designed for causal inference. The best-performing model is RoBERTa-Large MNLI. The study also identifies the V-structure as the most challenging causal relationship to identify. The authors suggest future studies should use out-of-distribution data as a test set to benchmark LLMs' performance in causal inference. The study proposes two robustness tests to determine the models' ability to learn causal inference skills, with RoBERTa-Large MNLI being the best-performing model. A new dataset, C ORR 2C AUSE, is introduced to test large language models' (LLMs) ability to infer causation from correlation. The authors evaluate an extensive list of LLMs on this new task and show that off-the-shelf LLMs perform poorly. They recommend using this dataset to benchmark the causal inference skills for LLMs and welcome future work to connect the idea of this benchmark to more real-world false beliefs based on confusing correlation with causation. The document also references various papers and conferences related to language models, causal inference, and natural language processing, as well as different language models and their pretraining approaches. Finally, the authors provide details on their implementation of GPT-based models for finetuning data. This article discusses tests for the robustness of large language models, including paraphrasing and variable refactorization, to identify spurious correlations in data. The authors use verbalization templates to form hypotheses for six causal relations and report the point-wise mutual information between the label and n-grams with no more than four tokens. The authors train the models until convergence, using a batch size of 8 and tuning the learning rate on the validation set. They use the finetuning API for non-BERT models and the transformers library for BERT-based models.

1515 word summary

The article discusses robustness tests for large language models, including paraphrasing and variable refactorization, to check for spurious correlations in data. The authors use verbalization templates to compose hypotheses for six causal relations and report the point-wise mutual information between the label and n-grams with no more than four tokens. The authors train the models until convergence, using a batch size of 8 and tuning the learning rate on the validation set. They use the finetuning API for non-BERT models and the transformers library for BERT-based models. This document discusses the use of large language models for causal inference, with a focus on GPT-based models. Various papers and resources related to natural language processing, causal discovery, and machine learning are cited throughout the document. The authors provide details on their implementation of GPT-based models for finetuning data. This document references various papers and conferences related to language models, causal inference, and natural language processing. The papers include studies on commonsense reasoning about social interactions, counterfactual story reasoning and generation, and modeling semantic containment and exclusion in natural language inference. The document also mentions different language models, such as BERT, RoBERTa, and DistilBERT, and their pretraining approaches. Additionally, it includes references to books on causal inference and practical graph isomorphism. Finally, the document highlights the GPT-4 technical report and a framework for adversarial attacks, data augmentation, and adversarial training in NLP. Large language models are being used for natural language generation, translation, and comprehension. The BART model is a denoising sequence-to-sequence language model that opens up new possibilities for causality. Causal reasoning and large graphical models are being used to detect logical fallacies. Nonlinear causal discovery with additive noise models is being explored. Bert with disentangled attention and Deberta are being used for decoding-enhanced deep bidirectional transformers for language understanding. The pascal recognizing textual entailment challenge is being used to evaluate predictive uncertainty and visual object classification. Language models are few-shot learners. This paper discusses the limited reasoning abilities of current large language models (LLMs) and the difficulty of separating actual reasoning from training-corpus-derived knowledge. The authors introduce a new task, C ORR 2C AUSE, to infer causation from correlation, and collect a large-scale dataset of more than 400K samples. They evaluate an extensive list of LLMs on this new task and show that off-the-shelf LLMs perform poorly. The authors recommend using this dataset to benchmark the pure causal inference skills for LLMs that have not seen this dataset and welcome future work to connect the idea of this benchmark to more real-world false beliefs based on confusing correlation with causation. The paper discusses the development of a new dataset, C ORR 2C AUSE, which tests the ability of large language models (LLMs) to infer causal relationships between variables. This task is unique from other inference tasks, such as natural language inference (NLI), as it focuses solely on causal inference skills. The paper also identifies limitations of the current work and future directions for research. The authors provide a fine-grained analysis of the best-performing model, RoBERTa-Large MNLI, on the C ORR 2C AUSE dataset. Additionally, they suggest accompanying adversarial attacks with i.i.d. testing to improve the generalizability of finetuned models. The study focuses on analyzing the performance of Large Language Models (LLMs) in causal inference, specifically in identifying causal relationships between variables. The authors propose two robustness tests to determine the models' ability to learn causal inference skills. The first test involves paraphrasing the hypothesis, while the second test involves variable refactorization. The results show that the models are relatively robust, with F1 scores over 70% for most classes, except for Is-Ancestor and Is-Descendant. The best-performing model is RoBERTa-Large MNLI, which is especially sensitive towards paraphrasing but maintains a high F1 score of 67.87 under variable refactorization.

The study also identifies the V-structure as the most challenging causal relationship to identify, requiring identification of both unconditional independence and collider relations. The model performs well in judging relations such as Is-Parent, Is-Descendant, and Has-Confounder, with F1 scores over 96%. The authors suggest future studies should use out-of-distribution data as a test set to benchmark LLMs' performance in causal inference. The study adopts the common setup of text adversarial attack to test the models' robustness and provides a template for each causal relation to some semantically-equivalent alternatives. Finally, the study analyzes the performance of finetuned models on the original test set and perturbed test sets by paraphrasing and variable refactorization. The document discusses the performance of various language models on a causal inference task. The models tested include BERT-Base, GPT-3 Davinci, GPT-3 Curie, GPT-3 Babbage, GPT-3 Ada, and RoBERTa-Large MNLI. The best-performing model was RoBERTa-Large MNLI. The document also includes a fine-grained analysis of the models' performance by causal relation type. Overall, pure causal inference is a challenging task for language models, with most models performing worse than random guessing. The document also mentions more efficient models, such as LLaMa and Alpaca. The article discusses experiments performed on various large language models (LLMs) to test their performance on a dataset designed for causal inference. The LLMs evaluated include GPT-4, GPT-3.5, and various BERT-based models. The dataset used in the experiments is called C ORR 2C AUSE and contains hypotheses with varying numbers of nodes and causal relations. The statistics of the dataset are provided in Table 3. The experiments involve testing the LLMs on the dataset and comparing their performance, with results presented in Table 1. The article also includes a table of hypothesis templates for each causal relation. The paper discusses a method for verbalizing causal relations between variables using natural language. The method involves identifying statistical correlations and determining if the variables are independent or correlated based on d-separation sets. Six common causal relations are focused on, and hypotheses are composed and labeled based on the validity of the proposed causal relationship. The graphs are clustered into MECs using d-separation sets. The method is based on a faithfulness assumption and uses a graph-theoretic algorithm to check for chain, fork, and collider structures. The document presents a dataset generation process for large language models and causal inference. The process involves constructing causal graphs using isomorphism checks, generating unique DAGs, and identifying MECs. The dataset is based on concepts of causal inference and includes a task formulation for mapping correlation statements and causal hypotheses to their validity. The focus is on smaller graphs with up to 6 nodes, and the dataset statistics are provided in Table 1. The process is described in detail, including specific steps and their descriptions. This document introduces a dataset construction method that aims to infer causation from correlations. The construction process involves selecting a closed system of variables, mapping each graph to a set of statistical correlations, and using the Peter-Clark algorithm to identify causal relationships among variables. The document explains that while there is a one-to-many mapping between causal graphs and statistical distributions, they can be organized into Markov equivalence classes. The goal of the dataset construction is to provide a basis for large language models to learn causal inference. This document discusses large language models (LLMs) and causal inference, focusing on the use of directed graphical causal models (DGCMs) to represent causal relationships among variables. The Markov property and D-separation are fundamental concepts in graphical models used to determine conditional independence between variables. The document evaluates the performance of seventeen LLMs on a dataset of over 400K samples, finding that all of them perform poorly on pure causal inference. The authors suggest future work to explore ways to enhance the skill of LLMs with out-of-distribution perturbations. Furthermore, they explore whether LLMs can learn the skill through finetuning, but find that they still perform close to the random baseline. The document proposes a new task for large language models (LLMs) called Corr2Cause, which tests their ability to infer causation from correlation. The authors show that existing LLMs do not perform well on this task and propose a new dataset of 400K samples to test causal reasoning abilities. The dataset is grounded in the formal framework of causal discovery and provides rules about when it is valid or invalid to infer causation from correlation. The authors argue that the ability to perform Corr2Cause inference is a must-have skill for LLMs and a fundamental building block for deducing causal relationships. The code and data for the dataset are available online. Causal inference is a crucial aspect of human intelligence and involves establishing causal relationships between variables or events. Large language models (LLMs) have limited causal inference skills, as shown by their poor performance on a novel task called C ORR 2C AUSE. This task involves determining the causal relationship between a set of correlational statements. Existing causal inference datasets in natural language processing rely on discovering causality from empirical knowledge, whereas this task tests LLMs' pure causal inference skills. Through experiments on a large-scale dataset of more than 400K samples, the study identifies seventeen existing LLMs' shortcomings in performing causal inference, even after finetuning. The study highlights the need for improving LLMs' pure reasoning skills and generalizability to guide future research in this area.

Raw indexed text (51,347 chars / 8,292 words / 1,053 lines)

Can Large Language Models Infer

Causation from Correlation?

∗

Zhijing Jin 1,2, Jiarui Liu 3 Zhiheng Lyu 4 Spencer Poff 5

†

Mrinmaya Sachan 2 Rada Mihalcea 3 Mona Diab 5, Bernhard Schölkopf 1,

Max Planck Institute for Intelligent Systems, Tübingen, Germany, ETH Zürich,

University of Michigan, 4 University of Hong Kong, 5 Meta AI

Abstract

Causal inference is one of the hallmarks of human intelligence. While the field

of CausalNLP has attracted much interest in the recent years, existing causal

inference datasets in NLP primarily rely on discovering causality from empirical

knowledge (e.g. commonsense knowledge). In this work, we propose the first

benchmark dataset to test the pure causal inference skills of large language models

(LLMs). Specifically, we formulate a novel task C ORR 2C AUSE , which takes a

(set of) correlational statements and determines the causal relationship between

the variables. We curate a large-scale dataset of more than 400K samples, on

which we evaluate seventeen existing LLMs. Through our experiments, we identify

a key shortcoming of LLMs in terms of their causal inference skills, and show

that these models achieve almost close to random performance on the task. This

shortcoming is somewhat mitigated when we try to re-purpose LLMs for this skill

via finetuning, but we find that these models still fail to generalize – they can only

perform causal inference in in-distribution settings when variable names and textual

expressions used in the queries are similar to those in the training set, but fail in

out-of-distribution settings generated by perturbing these queries. C ORR 2C AUSE

is a challenging task for LLMs, and would be helpful in guiding future research on

improving LLMs’ pure reasoning skills and generalizability. 1

Introduction

Causal inference is a crucial reasoning ability of human intelligence. It is a fundamental aspect

of reasoning that involves establishing the correct causal relationships between variables or events.

Roughly, there are two distinct ways to obtain causality: one through empirical knowledge, e.g., we

know from common sense that preparing a birthday party for a friend will make them happy; the

other through pure causal reasoning, as causality can be formally argued and reasoned about using

known procedures and rules from causal inference (Spirtes et al., 2000; Pearl, 2009; Peters et al.,

2017). For example, we know that only knowing that A correlates with B does not mean that A

causes B. We also know another property from pure causal inference, specifically the study of causal

discovery (Spirtes et al., 2000; Spirtes and Zhang, 2016; Glymour et al., 2019), that if A and B are

originally independent of each other, but become correlated given C, then we can infer that, in this

closed system, C is a common effect of A and B, as illustrated in Figure 1. This collider phenomenon

can be used to deny the causation between A and B, regardless of what realizations the variables A,

B, and C take.

We formulate this task as a new task for NLP, namely correlation-to-causation inference

(C ORR 2C AUSE ), and argue that this is a must-have skill for large language models (LLMs). Imagine

∗

This work originated during Zhijing’s internship at Meta AI. Email: [email protected].

Equal supervision.

Our data is at https://huggingface.co/datasets/causalnlp/corr2cause.

Our code is at https://github.com/causalNLP/corr2cause.

†Training Corpus

Smoking causes cancer.

? How can LLMs process this

Clear causal fact

Upon the release of the vaccines, the number

of disease cases reached a historical high.

information? Correlation?

Causation? What causes what?

This requires the skill of inferring causation from correlation

We propose a new task: Corr2Cause Inference

Previous tasks:

Suppose we know that A correlates with B.

Can we infer that A causes B? No

A correlates with C. B correlates with C. However, A is independent of B.

Can we infer that A and B have a common effect? Yes

Alice slipped, so she fell down.

Plausible

Skill being tested in previous work: Empirical

knowledge instead of pure causal inference.

*Assumption that we explicitly mention in the samples: We suppose a close system of the given variables and correlations.

Figure 1: Illustration of the motivation behind our task and dataset.

the scenario in Figure 1, where in the training corpus there are a large number of correlations, such

as the word vaccine correlating with an increased number of disease cases. If we take the position

that the success of LLMs (Radford et al., 2019; Devlin et al., 2019; Ouyang et al., 2022; Zhang et al.,

2022; OpenAI, 2023, inter alia) lies in capturing a vast set of statistical correlations among terms

(Bender et al., 2021), then the crucial yet missing step is how to process such correlations and infer

causal relationships, for which a fundamental building block is this C ORR 2C AUSE inference skill.

To this end, we collect the first dataset, C ORR 2C AUSE , to test the pure causal reasoning abilities of

large language models. All the questions in this dataset are centered around testing when it is valid

or invalid to infer causation from correlation. To systematically compose this dataset, we ground

our generalization process in the formal framework of causal discovery (Spirtes et al., 1993, 2000;

Glymour et al., 2016; Spirtes and Zhang, 2016; Glymour et al., 2019), which provides rules about

how to deduce causal relations among variables given their statistical correlation in the observational

data. We generate more than 400K data points, and label a correlation-causation statement pair as

valid if and only if there is a bijective mapping between the statistical correlation and the underlying

causality.

Based on our C ORR 2C AUSE dataset with 400K samples, we investigate two main research questions:

(1) How well do existing LLMs perform on this task? (2) Can existing LLMs be re-trained or

re-purposed on this task and obtain robust causal inference skills? Through extensive experiments, we

show empirically that none of the seventeen existing LLMs we investigate perform well on this pure

causal inference task. We also show that although LLMs can demonstrate better performance after

being finetuned on the data, the causal inference skills attained by them are not robust. In summary,

our contributions are as follows:

1. We propose the novel task of C ORR 2C AUSE , to probe an aspect of LLMs reasoning ability,

pure causal inference;

2. We compose a dataset of over 400K samples, using insights from causal discovery;

3. We evaluate the performance of seventeen LLMs on our dataset, finding that all of them

perform poorly, close to the random baseline.

4. We further explored whether LLMs can learn the skill through finetuning, and find that but

LLMs fail to robustly manage the skill with out-of-distribution perturbations, and suggest

future work to explore more ways to enhance the pure causal inference skill in LLMs.

Preliminaries: Causal Inference

2.1 Directed Graphical Causal Models (DGCMs)

A directed graphical causal model (DGCM) is a commonly used representation to express the causal

relations among a set of variables. Given a set of N variables X = {X 1 , . . . , X N }, we can encode

the causal relations among them using a directed graph G := (X, E), where E is the set of directed

edges. Each edge e i,j ∈ E represents a causal link X i → X j , meaning that X i is a direct cause of

X j . In the context of this work, we take the common assumption of directed acyclic graphs (DAGs),

which most causal discovery methods use (Glymour et al., 2019), as graphs with cycles can make the

causal discovery process arbitrarily hard.

2Following the graph-theoretic terminology, we use an analogy of the ancestry tree to denote the

relations between two variables. For example, we call X i as a parent of X j if there is a directed edge

X i → X j in the graph, and, thus, X j is a child of X i . Similarly, we denote X i as an ancestor of X j

if there exists a directed path from X i to X j , and, thus, X j is a descendent of X i . Note that a parent

is a special case of an ancestor where the directed path has a length of 1.

For convenience, we also introduce the notions for some special three-variable relations. Given two

variables X i and X j , we call a third variable X k a confounder (i.e., common cause) if X k is a parent

of both X i and X j ; a collider (i.e., common effect) if X k is a child of both X i and X j ; and a mediator

if X k is both a child of X i , and a parent of X j .

2.2

D-Separation and Markov Property

D-Separation D-separation (Pearl, 1988) is a fundamental concept in graphical models used to

determine whether two sets of nodes X and Y in a DAG G are conditionally independent given a

third set of nodes Z, where the three sets are disjoint. We say that X and Y are d-separated by Z if

all paths between any node in X and any node in Y are blocked by the conditioning set Z. A path

between X and Y is blocked by Z if there exists a node A ∈ Z which satisfies one of the following

conditions: A is the parent node in a fork structure on the path (i.e., · ← A → ·); A is the mediator

node in a chain structure on the path (i.e., · → A → ·); or in any collider structure on the path (i.e.,

· → A ← ·), Z does not contain A or its descendants.

Markov Property The Markov property in a DAG G states that each node X i is conditionally

independent of its non-descendants given its parents, namely X i ⊥⊥ NonDe(X i )| Pa(X i ), where

NonDe(X i ) denotes the non-descendants of X i excluding itself, and Pa(X i ) denotes the parents

of X i . Using the Markov property, we can factorize the joint distribution of all the nodes in the

Q N

graph into P (X 1 , . . . , X N ) = i=1 P (X i |PA(X i )). To infer the causal graph from probability

distributions, a common assumption is faithfulness, namely the validity to infer all the d-separation

sets in the graph from the independence relations in the probability distribution. In our work, we also

take this broadly taken assumption which holds for most real-world scenarios.

Markov Equivalence of Graphs We denote two DAGs as Markov equivalent if they induce the

same joint distribution P (X). The set of DAGs that are Markov equivalent to each other is called

a Markov equivalence class (MEC). Causal graphs in the same MEC can be easily identified since

they have the same skeleton (i.e., undirected edges) and V-structures (i.e., structures in the form of

A → B ← C where A and C are not connected).

Obviously, there is a one-to-many mapping (i.e., surjection) between the causal graph and statistical

distribution. Namely, each causal graph sufficiently determines a statistical distribution, but from

a statistical distribution, we cannot necessarily induce a unique causal graph. This is why we say

“correlation does not necessarily mean causation”.

2.3

Causal Discovery

Causal discovery aims to learn the causal relations by analyzing statistical properties in the obser-

vational data (Spirtes et al., 1993, 2000; Glymour et al., 2016; Spirtes and Zhang, 2016; Glymour

et al., 2019). It can be achieved through constraint-based methods (Spirtes et al., 2000), score-based

methods (Chickering, 2002), or other methods taking advantage of the functional causal models

(Shimizu et al., 2006; Hoyer et al., 2008; Zhang and Hyvärinen, 2009).

To fit for the spirit of this paper to infer from correlation (expressed in natural language) to causation,

we base our dataset design on the widely-used Peter-Clark (PC) algorithm (Spirtes et al., 2000).

The PC algorithm is based on the principles of conditional independence and the causal Markov

assumption, which allows it to efficiently identify causal relationships among variables in a given

dataset. The algorithm first starts with a fully connected undirected graph among all the variables.

Then it removes the edge between two variables if there is an unconditional or conditional inde-

pendence relationship between them. Afterwards, it orients the directed edges whenever there is a

V-structure. And finally, it iteratively checks the direction of the other edges until the entire causal

graph is consistent with all the statistical correlations.

32. Generate all unique causal

graphs

1. Choose the

number of variables

3. Map each graph to a set of

statistical correlations

Causal Graphs

Correlations

E.g., N=3

A B C

Many-to-1

Mapping

1-to-1 Mapping

...

● A ⫫̸ B,

● A ⫫̸ C,

● B ⫫̸ C, and

● A ⫫ C | B

● A ⫫̸ B,

● B ⫫̸ C, and

● A ⫫ C

Hypothesize a causal relation between two nodes

Verbalize the statistical correlations

4. Compose the Data

Correlations

Hypothesized

Causation

Validity

Suppose there is a closed system of 3 variables, A, B and C. All the statistical

relations among these 3 variables are as follows:

A correlates with C. B correlates with C. However, A is independent of B.

A directly causes B.

[The validity label is equivalent to the results after running the PC algorithm.

I.e., if the hypothesis fits all causal graphs corresponding to the set of

correlations, then the label is entailment, otherwise non-entailment.]

Valid

Figure 2: Pipeline of the data construction process.

Dataset Construction

We introduce the construction of our dataset in this section. We start with our task formulation for

C ORR 2C AUSE , and then briefly give an overview of the data generation process, followed by detailed

descriptions of each step. We conclude the section with the overall statistics of the dataset.

3.1

Task Formulation

Given a set of N variables X = {X 1 , . . . , X N }, we have a statement s about all the correlations

among the variables, and a hypothesis h describing the causal relation r between the pair of variables

X i and X j . The task is to learn a function f : (s, h) 7→ v which maps the correlation statement

s and the causal relation hypothesis h to their validity v ∈ {0, 1}, which takes the value 0 if this

inference is invalid, and the value 1 if this inference is valid.

3.2

Overview of the Data Generation Process

We base the construction our dataset on several concepts of causal inference, including the DGCM,

d-separation, and MECs, as introduced in Section 2.

As in the overview of our data generation process in Figure 2, we first choose the number N of

variables (Step 1) and generate all the unique DGCMs with N nodes (Step 2), which we will introduce

in the Section 3.3. Then we collect all the d-separation sets from these graphs to identify MECs

(Step 3) in Section 3.4. Then, in Step 4, we create the formal form of data in Section 3.5. For each

correspondence of the MEC to causal graphs, we compose the correlation statement based on the

statistical relations in the MEC, and hypothesize a causal relation between two variables, and produce

the validity v = 1 if the hypothesis is a shared property of all causal graphs in the MEC, and v = 0 if

the hypothesis is not necessarily true for all the MEC graphs. Finally, we introduce the verbalization

process in Section 3.6.

3.3

Constructing the Graphs with Isomorphism Checks

The first step of the data generation is to compose the causal graphs, as in Step 1 and 2 of Figure 2.

For a set of N variables X = {X 1 , . . . , X N }, there are N (N − 1) possible directed edges, since

each node can link to any node other than itself. To remove cycles in the graph, we make the nodes

in topological order, which only allows edges X i → X j , where i < j. We achieve this by limiting

the adjacency matrix of the graph to only having non-zero values above the diagnal, resulting in

N (N − 1)/2 possible directed edges for the DAGs.

4At the first glance, for N nodes, there should be 2 N (N −1)/2 possible DAGs (i.e., the power set of all

edges). However, there could be isomorphic graphs in this set. To avoid this, we perform a graph

isomorphism check (McKay and Piperno, 2014), and reduce the set so that only unique DAGs are

retained, and we show their statistics in Table 1. Although we can handle large graphs, we mostly

focus on smaller graphs that can still lead to a reasonably sized dataset, so we empirically set N = 6,

but future work can use our open-sourced codes to extend to more nodes.

# Nodes

Total

# Unique DAGs

2 out of 2

6 out of 2 3

31 out of 2 6

302 out of 2 10

5,984 out of 2 15

6,325

# Edges/DAG

0.50

1.67

3.48

5.89

8.77

8.60

# MECs

142

2,207

2,376

# DAGs/MEC

1.0

1.2

1.55

2.13

2.71

2.66

Table 1: Statistics about the source causal graphs in our dataset. Given the number of nodes, we report

the number of unique DAGs, average number of edges per DAG, number of MECs, and average

number of DAGs per MEC.

3.4 Programmatically Generating the D-Separation Sets

Based on the set of unique DAGs, we then programmatically generate the d-separation sets by

graph theoretical conditions, as in Step 3 of Figure 2. To realize this step, we code an efficient

graph-theoretic algorithm to check for all the chain, fork, and collider structures to automatically

identify the set of nodes that d-separate each pair of nodes. Using the d-separation sets and the

faithfulness assumption, we form the statistical correlations as follows. For each pair of nodes, they

are conditionally independent given the variables in the d-separation set. If the d-separation set is

empty, then the two nodes are unconditionally independent. If no d-separation set can be found for

the two nodes, then they are directly correlated.

Moreover, using the d-separation sets, we are able to cluster causal graphs to MECs. We achieve it by

tracing the mapping between the causal graphs and the set of statistical correlations, and backtracking

the graphs with the same d-separation sets to group them in the same MEC. We show in Table 1 that

each MEC contains on average 2.66 DAGs.

3.5 Composing the Hypotheses and Label

After generating the set of correlations based on the d-separation sets, we now generate the causal

hypotheses. For the causal relation r, we focus on six common causal relations between two nodes

introduced in Section 2.1: Is-Parent, Is-Child, Is-Ancestor (excluding the parents), Is-Descendant

(excluding the children), Has-Confounder (i.e., there exists a confounder, or common cause, of the

two nodes), and Has-Collider (i.e., there exists a collider, or common effect, of the two nodes). In this

way, the set of hypotheses contains all six meaningful causal relations between every pair of variables,

resulting in a total size of 6 · N (N − 1)/2 = 3N (N − 1) hypotheses for a graph with N variables.

To generate the ground-truth validity label, we start from the correlation sets in Step 3, then look up

all the causal graphs in the same MEC corresponding to the given set of correlations, and check the

necessity of the hypothesized causal relation. If the causal relationship proposed in the hypothesis

is valid for all causal graphs within the MEC, then we generate the validity v = 1; otherwise, we

generate v = 0. A special case of valid samples is that when the size of the MEC is 1, then there is a

bijective mapping between the causal graph and the d-separation sets, so any hypothesis stating the

causal properties of that unique causal graph is valid.

3.6 Verbalizing into Natural Language

Finally, as in the last step of Figure 2, we convert all the information above to text data for our

C ORR 2C AUSE task. For the correlation statement, we verbalize the set of correlations in Step 3 into

a natural language statement s. When two variables can not be d-separated, i.e., A ̸⊥⊥ B, then we

describe them as “A correlates with B” since they are directly correlated and cannot be independent

by any condition. And if two variables have a valid d-separation set C, then we describe them as “A

is independent of B given C.” In the special case when the d-separation set is empty, we directly

say “A is independent of B.” In addition, we disambiguate the setting by starting the correlation

statement with the setup of a closed system of the given variables, and no hidden variables: “Suppose

5there is a closed system of N variables, A, B, . . . All the statistical relations among these N variables

are as follows:”. Finally, to verbalize the hypothesis, we feed the causal relation triplet (X i , r, X j )

into their hypothesis templates in Table 2. For example, we turn the triplet (A, Is-Parent, B) into “A

directly causes B”, as in the example of Figure 2.

Causal Relation

Is-Parent

Is-Ancestor

Is-Child

Is-Descendant

Has-Collider

Has-Confounder

Hypothesis Template

{Var i} directly causes {Var j}.

{Var i} causes something else which causes {Var j}.

{Var j} directly causes {Var i}.

{Var j} is a cause for {Var i}, but not a direct one.

There exists at least one collider (i.e., common effect) of {Var i} and {Var j}.

There exists at least one confounder (i.e., common cause) of {Var i} and {Var j}.

Table 2: Templates for each causal relation in the hypothesis. We use {Var i} and {Var j} as

placeholders for the two variables.

3.7

Statistics of the Resulting Data

We show the statistics of our C ORR 2C AUSE dataset in Table 3. Overall, our dataset contains 415,944

samples, with 18.57% in valid samples. The average length of the premise is 424.11 tokens, and

hypothesis 10.83 tokens. We split the data into 411,452 training samples, 2,246 development and

test samples, respectively. Since the main purpose of the dataset is to benchmark the performance of

LLMs, we prioritize the test and development sets to have a comprehensive coverage over all sizes of

graphs. Specifically, we iterate through the subset of our data for each N , and split it entirely for only

the test and development sets if the data is less than 1K, which is the case for N = 2 and 3. For the

other subsets that are larger, we randomly sample up to 1K or 10% of the data, whichever is smaller,

to the test and development sets. We set the cap to be 1K in order to form a reasonable computation

budget, since many LLMs are expensive to query in the inference mode. Aside from the test and

valid sets, all the rest of the data goes into the training set.

Overall

# Samples

# Test

# Dev

# Train

# Tokens/Premise

# Tokens/Hypothesis

% Valid Labels

Vocab Size

415,944

2,246

411,452

424.11

10.83

18.57

Statistics by the Number of Nodes N

N =2 N =3 N =4 N =5

N =6

180

1,440

17,040 397,260

144

1,000

144

1,000

1,152

15,040 395,260

31.5

52.0

104.0

212.61

434.54

10.83

0.00

3.33

7.50

13.01

18.85

Table 3: Statistics of our C ORR 2C AUSE dataset, and by subsets. We report the total number of

samples; splits of the test, developement and training sets; number of tokens per premise and

hypothesis; percentage of the entailment labels, and vocabulary size. Note that the number of unique

graphs and MECs are in Table 1.

Experiments

4.1

Experimental Setup

We set up a diverse list of LLMs for the experiments on our C ORR 2C AUSE dataset. To test existing

LLMs, we first include six commonly used BERT-based NLI models in the transformers library

(Wolf et al., 2020) with the most number of downloads: BERT (Devlin et al., 2019), RoBERTa (Liu

et al., 2019), BART (Lewis et al., 2020), DeBERTa (He et al., 2021), DistilBERT (Sanh et al., 2019),

and DistilBART (Shleifer and Rush, 2020). Apart from these BERT-based NLI models, we also

evaluate the general-purpose autoregressive LLMs based on GPT (Radford et al., 2019): GPT-3 Ada,

Babbage, Curie, Davinci (Brown et al., 2020); its instruction-tuned versions (Ouyang et al., 2022),

text-davinci-001, text-davinci-002, and text-davinci-003; and GPT-3.5 (i.e., ChatGPT), and the latest

GPT-4 (OpenAI, 2023), using the OpenAI API 2 with temperature 0. We also evaluate the recent,

more efficient models LLaMa (Touvron et al., 2023) and Alpaca (Taori et al., 2023).

https://openai.com/api/

6Random Baselines

Always Majority

Random (Proportional)

Random (Uniform)

BERT-Based Models

BERT MNLI

RoBERTa MNLI

DeBERTa MNLI

DistilBERT MNLI

DistilBART MNLI

BART MNLI

LLaMa-Based Models

LLaMa-6.7B

Alpaca-6.7B

GPT-Based Models

GPT-3 Ada

GPT-3 Babbage

GPT-3 Curie

GPT-3 Davinci

GPT-3 Instruct (text-davinci-001)

GPT-3 Instruct (text-davinci-002)

GPT-3 Instruct (text-davinci-003)

GPT-3.5

GPT-4

F1 Precision Recall Accuracy

0.0

13.5

20.38 0.0

12.53

15.11 0.0

14.62

31.29 84.77

71.46

62.78

2.82

22.79

14.52

20.70

26.74

33.38 7.23

34.73

14.71

24.12

15.92

31.59 1.75

16.96

14.33

18.13

83.63

35.38 81.61

82.50

74.31

78.85

30.23

78.50

26.81

27.37 15.50

15.93 99.42

97.37 17.36

21.33

0.00

27.45

26.43

27.82

17.99

21.87

15.72

21.69

29.08 0.00

15.96

15.23

16.57

11.84

13.46

13.4

17.79

20.92 0.00

97.95

100.00

86.55

37.43

58.19

19.01

27.78

47.66 84.77

21.15

15.23

31.61

48.04

36.69

68.97

69.46

64.60

Table 4: Overall performance. We report F1 (main metric), precision , recall and accuracy. For the

main metric, F1 score, we use the bold font to highlight the overall best performance, and underline

to highlight the best performance within each category of models.

When inspecting the behavior of finetuned models, we adopt a large set of models, including GPT-

based models (GPT-3 Ada, Babbage, Curie, and Davinci) using the OpenAI finetuning API for

classification, 3 BERT-based models from scratch (BERT-Base, BERT-Large, RoBERTa-Base, and

RoBERTa-Large), and BERT-Based NLI models (BERT-Base MNLI, BERT-Large MNLI, RoBERTa-

Base MNLI, and RoBERTa-Large MNLI) using the transformers library (Wolf et al., 2020). Our

training details are available in Appendix A.

For the random baselines, we provide “always majority” to predict the majority class 100% of the

time, “random (uniform)” which randomly samples a label with 50% chance for each, and “random

(proportional)” which samples a label from a Bernouli distribution proportional to the development

set label distribution.

4.2 The C ORR 2C AUSE Skill in Existing LLMs

We show the performance of LLMs in Table 4. We can see that pure causal inference is a very

challenging task across all existing LLMs. Among all the LLMs, the best performance is 33.38% F1

by BART MNLI, which is even higher than latest GPT-based model, GPT-4. Notably, many models

are worse than random guess, which means that they totally fail at this pure causal inference task.

4.3 Finetuned Performance

Next, we address the question: Can we re-purpose LLMs to learn this task?

The experimental results in Table 5a of 12 models finetuned on our C ORR 2C AUSE seem very strong

at first sight. Most models see a substantial increase, among which the finetuned BERT-based NLI

models demonstrate the strongest performance. The best-performing one, RoBERTa-Large MNLI,

achieves 94.74% F1 score on this task, as well as very high precision, recall and accuracy scores.

4.4 Fine-Grained Performance by Causal Relation

In addition to the overall results mentioned above, we also conduct a fine-grained analyze to check

the performance of the strongest model, RoBERTa-Large MNLI, by our six causal relation types.

https://platform.openai.com/docs/guides/fine-tuning

7F1

Precison

Finetuned GPT-Based Models

GPT-3 Ada

79.85

70.47

GPT-3 Babbage

78.19

69.98

GPT-3 Curie

81.23

75.00

GPT-3 Davinci

85.52

80.26

Finetuned BERT-Based Models

BERT-Base

69.29

54.42

BERT-Large

85.26

77.51

RoBERTa-Base

87.60

78.47

RoBERTa-Large

89.10

82.54

Finetuned BERT-Based NLI Models

BERT-Base MNLI

89.88

85.49

BERT-Large MNLI

90.19

84.44

RoBERTa-Base MNLI

94.27

90.35

RoBERTa-Large MNLI 94.74

92.24

Recall Accuracy F1 (Paraph.) F1 (Var. Ref.)

92.11

88.60

91.52 92.92

92.48

93.77

95.28 61.73

62.34

64.93

65.01 41.57

43.28

45.32

46.96

95.32

94.74

99.12

96.78 87.13

95.01

95.73

96.39 61.13

63.64

65.58

65.05 35.20

38.54

53.12

60.20

94.74

96.78

98.54

97.37 86.51

96.79

98.17

98.35 65.56

67.24

57.42

55.45 31.50

52.04

62.83

67.87

(a) Performance of finetuned models on the original test set.

(b) F1 scores of finetuned models

on the perturbed test sets by para-

phrasing (Paraph.) and variable

refactorization (Var. Ref.).

Table 5: Performance of finetuned models on the original test set and perturbed test sets.

As in Table 6a, the model is very good at judging relations such as Is-Parent, Is-Descendant and

Has-Confounder, all with more than 96% F1 scores, whereas it is several points weaker on the Has-

Collider relations. This could be due to that the collider relation is the most special type, requiring

identification of the V-structure based on both the unconditional independence based on the two

variables only and correlations whenever conditioned on a common descendant.

4.5 Robustness Analysis

Looking at the very high performance of the finetuned models, we raise the next question: Did the

models really robustly learn the causal inference skills?

Two Robustness Tests We design two simple robustness tests: (1) paraphrasing, and (2) variable

refactorization. For (1) paraphrasing, we simply paraphrase the hypothesis by changing the text

template for each causal relation to some semantically-equivalent alternatives in Appendix B. For (2)

variable refactorization, we reverse the alphabet of the variable names, namely flipping A, B, C, to Z,

Y, X and so on. The inspiration behind the two robustness tests comes from the spurious correlation

analysis described in Appendix C.

Specifically, we adopt the common setup of text adversarial attack (Morris et al., 2020; Jin et al., 2020)

to preserve the training set and keep the same saved models, but run the inference n the perturbed test

set. In this way, we separate the possibilities of the models only overfitting on the training data vs.

mastering the reasoning skills.

Results After Perturbation We can see from Table 5b that all the models drop drastically, by

up to 39.29 when we paraphrase the test set, and they decrease substantially by up to 58.38 when

we refactor the variable names. The best-performing model, RoBERTa-Large MNLI, is especially

sensitive towards paraphrasing, demonstrating the most drop among all models; however, it is the

most robust against the variable refactorization, maintaining a high F1 score of 67.87. We conduct

fine-grained analysis for RoBERTa-Large MNLI under perturbation in Table 6b. We can see the

the main source of the performance drop of the model comes from the two classes, Is-Ancestor

(decreasing to 45.45%) and Is-Descendant (decreasing to 29.41%), while the other classes stay

relatively robust, keeping their F1 scores over 70%.

From this analysis, we make the following suggestions to future studies testing this C ORR 2C AUSE

skill of LLMs. First, it is safe to use it as a test set to benchmark existing LLMs’ performance, since

the data we generate is out-of-distribution from the training data of the current LLMs. Then, when

testing finetuned models, it is very important to accompany adversarial attack together with the i.i.d.

test set. We also provide our perturbed versions of the test set in our data for future work to test the

generalizability skill.

8Relation Type

Is-Parent

Is-Ancestor

Is-Child

Is-Descendant

Has-Collider

Has-Confounder

96.18

93.94

95.73

96.55

92.19

98.67

Precision

95.45

93.94

94.92

93.33

87.41

97.37

Recall

96.92

93.94

96.56

100

97.52

100

Accuracy

98.67

98.93

98.67

99.47

94.64

99.73

74.80

45.45

73.39

29.41

70.70

70.42

Precision

79.31

90.91

78.43

83.33

75.00

73.53

Recall

70.77

30.30

68.97

17.86

66.90

67.57

Accuracy

91.73

93.60

92.27

93.60

82.04

94.37

(a) Fine-grained performance of RoBERTa-Large by causal relation (b) Its fine-grained performance by relation

type on the original test set.

type after variable refactorization.

Table 6: Fine-grained analysis of the best-performing model, RoBERTa-Large MNLI.

Related Work

Existing Causal Reasoning Tasks A large body of existing research of causal reasoning in NLP

focuses on leveraging empirical knowledge to do tasks such as inferring the cause and effect of why

an agent perform certain tasks (Sap et al., 2019a), the motivation and emotional reaction in a social

context (Sap et al., 2019b), how people achieve a given goal with a set of concrete steps (Zhang et al.,

2020), the development of a story given a different beginning (Qin et al., 2019), and how in general

LLMs serve as a knowledge base of cause and effect (Willig et al., 2023; Kıcıman et al., 2023). In

contrast, our C ORR 2C AUSE task focuses on the pure causal inference skill of models, which is a

knowledge-dependent reasoning skill based on formally correct rules from causal inference.

Existing Logical and Inference Tasks Another related area of literature is logical and inference

tasks. A well-established task is natural language inference (NLI), which identifies the semantic

relationship between a pair of sentences (MacCartney and Manning, 2008; Bowman et al., 2015).

NLI datasets mainly focus on the set and paraphrase relations, such as “a group of boys are playing

football” can entail “some guys are playing football,” where “boys” are a sub-concept of “guys” and

“a group of” and “some” are paraphrases. Existing datasets cover entailment in news articles (Dagan

et al., 2006), image captions (Bowman et al., 2015), and across multiple genres (Williams et al.,

2018). Recently, there has been increasing efforts to extend the inference task to various logical

inference skills such as deductive logic and propaganda techniques (Jin et al., 2022; Alhindi et al.,

2022). Our C ORR 2C AUSE dataset is the first dataset testing the correlation-to-causation inference

skill, which is unique of its type.

Limitations and Future Work

We identify several limitations of this work and open future directions: First, in the context of this

work, we limit the causal graphs to two to six nodes, but future work can feel free to explore larger

graphs. Another aspect is that we do not assume hidden confounders in this inference problem, so we

welcome future work to generate an even more challenging dataset to infer the existence of hidden

confounders, analogous to the causal discovery algorithm of fast causal inference (FCI) (Spirtes et al.,

2000). Finally, a lot of motivation behind proposing this task is inspired by the problem of invalid

reasoning patterns in our daily reasoning (Jin et al., 2022), which could fertilize the ground for more

pervasive spread of fake news. We believe false causal inference is a prevalent type of fallacious

beliefs, and welcome future work to connect the idea of this benchmark to more real-world false

beliefs based on confusing correlation with causation.

Conclusion

In this work, we introduced a novel task, C ORR 2C AUSE , to infer causation from correlation, and

collected a large-scale dataset of more than 400K samples. We evaluated an extensive list of LLMs

on this new task, and showed that off-the-shelf LLMs perform poorly on this task. We also show that

it is possible to re-purpose LLMs on this task by finetuning, but future work needs to be aware of the

out-of-distribution generalization problem. To avoid the Goodhart’s law, we recommend using this

dataset to benchmark the pure causal inference skills for LLMs that have not seen this dataset. Given

the limited reasoning abilities of current LLMs, and the difficulty of separating actual reasoning from

training-corpus-derived knowledge, it is imperative that our community focus on work aiming to

accurately disentangle and measure both abilities. We believe that the present work is a first such

step.

9Acknowledgment

We thank Riley Goodside for valuable suggestions to improve the our prompts to LLMs. We thank

Luigi Gresele and Amir Hossein Karimi for their suggestions to help us improve the formulation of

our causal discovery questions.

This material is based in part upon works supported by the German Federal Ministry of Education

and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039B; by the Machine Learning Cluster of

Excellence, EXC number 2064/1 – Project number 390727645; by the Precision Health Initiative at

the University of Michigan; by the John Templeton Foundation (grant #61156); by a Responsible

AI grant by the Haslerstiftung; and an ETH Grant (ETH-19 21-1). Zhijing Jin is supported by PhD

fellowships from the Future of Life Institute and Open Philanthropy. We also thank OpenAI for

granting Zhijing quota to their API of GPT series through the Researcher Access Program.

References

Tariq Alhindi, Tuhin Chakrabarty, Elena Musi, and Smaranda Muresan. 2022. Multitask instruction-

based prompting for fallacy recognition. In Proceedings of the 2022 Conference on Empirical

Methods in Natural Language Processing, pages 8172–8187, Abu Dhabi, United Arab Emirates.

Association for Computational Linguistics. 9

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On

the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021

ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New

York, NY, USA. Association for Computing Machinery. 2

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large

annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on

Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association

for Computational Linguistics. 9

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,

Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel

Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler,

Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray,

Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever,

and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural

Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc. 6

David Maxwell Chickering. 2002. Optimal structure identification with greedy search. J. Mach.

Learn. Res., 3:507–554. 3

Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The pascal recognising textual entailment

challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object

Classification, and Recognising Tectual Entailment: First PASCAL Machine Learning Challenges

Workshop, MLCW 2005, Southampton, UK, April 11-13, 2005, Revised Selected Papers, pages

177–190. Springer. 9

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of

deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference

of the North American Chapter of the Association for Computational Linguistics: Human Language

Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota.

Association for Computational Linguistics. 2, 6

Clark Glymour, Kun Zhang, and Peter Spirtes. 2019. Review of causal discovery methods based on

graphical models. Frontiers in Genetics, 10:524. 1, 2, 3

Madelyn Glymour, Judea Pearl, and Nicholas P Jewell. 2016. Causal inference in statistics: A primer.

John Wiley and Sons. 2, 3

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. Deberta: Decoding-enhanced

Bert with disentangled attention. In 9th International Conference on Learning Representations,

ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. 6

10Patrik O. Hoyer, Dominik Janzing, Joris M. Mooij, Jonas Peters, and Bernhard Schölkopf. 2008.

Nonlinear causal discovery with additive noise models. In Advances in Neural Information Pro-

cessing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information

Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 689–696.

Curran Associates, Inc. 3

Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. 2020. Is BERT really robust? A strong

baseline for natural language attack on text classification and entailment. In The Thirty-Fourth

AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications

of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational

Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages

8018–8025. AAAI Press. 8

Zhijing Jin, Abhinav Lalwani, Tejas Vaidhya, Xiaoyu Shen, Yiwen Ding, Zhiheng Lyu, Mrinmaya

Sachan, Rada Mihalcea, and Bernhard Schölkopf. 2022. Logical fallacy detection. In Findings of

the Association for Computational Linguistics: EMNLP 2022, pages 7180Ã¢â‚¬âCœ–7198, Abu

Dhabi, United Arab Emirates. Association for Computational Linguistics. 9

Emre Kıcıman, Robert Ness, Amit Sharma, and Chenhao Tan. 2023. Causal reasoning and large

language models: Opening a new frontier for causality. arXiv preprint arXiv:2305.00050. 9

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer

Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence

pre-training for natural language generation, translation, and comprehension. In Proceedings of the

58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online.

Association for Computational Linguistics. 6

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike

Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT

pretraining approach. CoRR, abs/1907.11692. 6

Bill MacCartney and Christopher D. Manning. 2008. Modeling semantic containment and exclusion in

natural language inference. In Proceedings of the 22nd International Conference on Computational

Linguistics (Coling 2008), pages 521–528, Manchester, UK. Coling 2008 Organizing Committee.

Brendan D. McKay and Adolfo Piperno. 2014. Practical graph isomorphism, II. J. Symb. Comput.,

60:94–112. 5

John Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. 2020. TextAttack:

A framework for adversarial attacks, data augmentation, and adversarial training in NLP. In

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing:

System Demonstrations, pages 119–126, Online. Association for Computational Linguistics. 8

OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774. 2, 6

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong

Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton,

Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and

Ryan Lowe. 2022. Training language models to follow instructions with human feedback. CoRR,

abs/2203.02155. 2, 6

Judea Pearl. 1988. Probabilistic reasoning in intelligent systems: Networks of plausible inference.

Morgan Kaufmann. 3

Judea Pearl. 2009. Causality: Models, reasoning and inference (2nd ed.). Cambridge University

Press. 1

Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2017. Elements of causal inference:

Foundations and learning algorithms. The MIT Press. 1

11Lianhui Qin, Antoine Bosselut, Ari Holtzman, Chandra Bhagavatula, Elizabeth Clark, and Yejin Choi.

2019. Counterfactual story reasoning and generation. In Proceedings of the 2019 Conference on

Empirical Methods in Natural Language Processing and the 9th International Joint Conference

on Natural Language Processing (EMNLP-IJCNLP), pages 5043–5053, Hong Kong, China.

Association for Computational Linguistics. 9

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019.

Language models are unsupervised multitask learners. OpenAI Blog, 1(8). 2, 6

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled

version of BERT: Smaller, faster, cheaper and lighter. CoRR, abs/1910.01108. 6

Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah

Rashkin, Brendan Roof, Noah A. Smith, and Yejin Choi. 2019a. ATOMIC: an atlas of machine

commonsense for if-then reasoning. In The Thirty-Third AAAI Conference on Artificial Intelligence,

AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI

2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019,

Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 3027–3035. AAAI Press. 9

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. 2019b. Social iqa:

Commonsense reasoning about social interactions. In EMNLP 2019. 9

Shohei Shimizu, Patrik O. Hoyer, Aapo Hyvärinen, and Antti J. Kerminen. 2006. A linear non-

gaussian acyclic model for causal discovery. J. Mach. Learn. Res., 7:2003–2030. 3

Sam Shleifer and Alexander M. Rush. 2020. Pre-trained summarization distillation. CoRR,

abs/2010.13002. 6

Peter Spirtes, Clark Glymour, and Richard Scheines. 1993. Causation, prediction, and search. 2, 3

Peter Spirtes, Clark Glymour, and Richard Scheines. 2000. Causation, Prediction, and Search,

Second Edition. Adaptive computation and machine learning. MIT Press. 1, 2, 3, 9

Peter Spirtes and Kun Zhang. 2016. Causal discovery and inference: Concepts and recent method-

ological advances. In Applied informatics, volume 3, pages 1–28. SpringerOpen. 1, 2, 3

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy

Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model.

https://github.com/tatsu-lab/stanford_alpaca. 6

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée

Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand

Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation

language models. CoRR, abs/2302.13971. 6

Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus

for sentence understanding through inference. In Proceedings of the 2018 Conference of the

North American Chapter of the Association for Computational Linguistics: Human Language

Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association

for Computational Linguistics. 9

Moritz Willig, Matej Zečević, Devendra Singh Dhami, and Kristian Kersting. 2023. Probing for

correlations of causal facts: Large language models and causality. 9

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,

Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick

von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger,

Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art

natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in

Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for

Computational Linguistics. 6, 7, 14

12Kun Zhang and Aapo Hyvärinen. 2009. Causality discovery with additive disturbances: An

information-theoretical perspective. In Machine Learning and Knowledge Discovery in Databases:

European Conference, ECML PKDD 2009, Bled, Slovenia, September 7-11, 2009, Proceedings,

Part II 20, pages 570–585. Springer. 3

Li Zhang, Qing Lyu, and Chris Callison-Burch. 2020. Reasoning about goals, steps, and temporal

ordering with WikiHow. In Proceedings of the 2020 Conference on Empirical Methods in

Natural Language Processing (EMNLP), pages 4630–4639, Online. Association for Computational

Linguistics. 9

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher

Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt

Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer.

2022. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068. 2

13A

Implementation Details

When finetuning on our data, for GPT-based models, we use the default settings of the OpenAI

finetuning API; and for BERT-based models, we use the transformers library (Wolf et al., 2020)

and train the models on a server with an NVIDIA Tesla A100 GPU with 40G of memory. To fit for

the GPU memory, we set the batch size to be 8. We use the validation set to tune the learning rate,

which takes value in {2e-6, 5e-6, 1e-5, 2e-5, 5e-5}; dropout rate, which takes value in {0, 0.1, 0.2,

0.3}; and weight decay, which takes value in {1e-4, 1e-5}. We train the models until convergence,

which is usually around five epochs.

Templates and Paraphrases

We use the verbalization templates in Table 7 to compose the hypotheses for all six causal relations.

Causal Relation

Is-Parent

Is-Ancestor

Is-Child

Is-Descendant

Has-Collider

Has-Confounder

Paraphrases

Is-Parent

Is-Ancestor

Is-Child

Is-Descendant

Has-Collider

Has-Confounder

Hypothesis Template

{Var i} directly causes {Var j}.

{Var i} causes something else which causes {Var j}.

{Var j} directly causes {Var i}.

{Var j} is a cause for {Var i}, but not a direct one.

There exists at least one collider (i.e., common effect) of {Var i} and {Var j}.

There exists at least one confounder (i.e., common cause) of {Var i} and {Var j}.

{Var i} directly affects {Var j}.

{Var i} influences {Var j} through some mediator(s).

{Var j} directly affects {Var i}.

{Var j} influences {Var i} through some mediator(s).

{Var i} and {Var j} together cause some other variable(s).

Some variable(s) cause(s) both {Var i} and {Var j}.

Table 7: Templates and their paraphrases for each causal relation in the hypothesis. We use {Var i}

and {Var j} as placeholders for the two variables.

Spurious Correlation Analysis

The inspirations of our two robustness tests (paraphrasing and variable refactorization) come from our

data analysis. We check for spurious correlations in the data by reporting in Table 8 the point-wise

mutual information (PMI) between the label and any n-gram with no more than four tokens. In

addition, we also report the difference of the PMI with the two labels in the |Diff| column of Table 8,

and report the top 10 n-grams.

The design spirit for our robustness test is that if the models’ correct judgment relies on exploiting

these spurious correlations, then such reliance will be broken in our perturbations.

N-Gram

a cause

a cause for

A causes

A causes something

a direct

a direct one

for D

for D but

for E

for E but

PMI w/ Non-Ent. Label

1.692209

1.663640

1.640679

1.621820

1.606052

1.592673

1.584826

1.583897

1.582980

1.582074

PMI w/ Ent. Label

-1.025611

-0.983790

-0.951610

-0.926075

-0.905316

-0.888107

-0.878180

-0.877014

-0.875864

-0.874728

|Diff|

2.717820

2.647430

2.592289

2.547895

2.511369

2.480781

2.463006

2.460911

2.458844

2.456802

Table 8: PMI between the labels and n-grams. The labels include non-entailment (Non-Ent.) and

entailment (Ent.). And the n-grams include all with no more than four words. The |Diff| column

shows the absolute value of the difference between the PMIs with two labels. We show the top 10

n-grams with the largest differences of their PMIs with the two classes in the |Diff| column.

We can see that some spurious correlations are rooted in the framing of the hypothesis, such as “a

cause (for)”, and “a direct (one)” (which we use the paraphrasing task to break), and others are

14connected to the variable names, such as “for D (but)” and “for E (but)” (which we use the variable

refactorization to break).