Structure learning in human causal induction 
Joshua B. Tenenbaum & Thomas L. Griffiths 
Department of Psychology 
Stanford University, Stanford, CA 94305 
{ jbt, gruffydd}@psych. stanford. edu 
Abstract 
We use graphical models to explore the question of how people learn sim- 
ple causal relationships from data. The two leading psychological theo- 
ries can both be seen as estimating the parameters of a fixed graph. We 
argue that a complete account of causal induction should also consider 
how people learn the underlying causal graph structure, and we propose 
to model this inductive process as a Bayesian inference. Our argument is 
supported through the discussion of three data sets. 
1 Introduction 
Causality plays a central role in human mental life. Our behavior depends upon our under- 
standing of the causal structure of our environment, and we are remarkably good at infer- 
ring causation from mere observation. Constructing formal models of causal induction is 
currently a major focus of attention in computer science [7], psychology [3,6], and philos- 
ophy [5]. This paper attempts to connect these literatures, by framing the debate between 
two major psychological theories in the computational language of graphical models. We 
show that existing theories equate human causal induction with maximum likelihood pa- 
rameter estimation on a fixed graphical structure, and we argue that to fully account for hu- 
man behavioral data, we must also postulate that people make Bayesian inferences about 
the underlying causal graph structure itself. 
Psychological models of causal induction address the question of how people learn asso- 
ciations between causes and effects, such as P(C-E), the probability that some event C 
causes outcome E. This question might seem trivial at first; why isn't P(C-E) simply 
P(+I c+), the conditional probability that E occurs (E = e + as opposed to e-) given that 
U occurs? But consider the following scenarios. Three case studies have been done to eval- 
uate the probability that certain chemicals, when injected into rats, cause certain genes to 
be expressed. In case 1, levels of gene 1 were measured in 100 rats injected with chem- 
ical 1, as well as in 100 uninjected rats; cases 2 and 3 were conducted likewise but with 
different chemicals and genes. In case 1, 40 out of 100 injected rats were found to have 
expressed the gene, while 0 out of 100 uninjected rats expressed the gene. We will denote 
these results as {40/100, 0/100}. Case 2 produced the results {7/100, 0/100}, while case 
3 yielded {53/100, 46/100}. For each case, we would like to know the probability that the 
chemical causes the gene to be expressed, P(U-E), where U denotes the chemical and E 
denotes gene expression. 
People typically rate P(C-E) highest for case 1, followed by case 2 and then case 3. In an 
experiment described below, these cases received mean ratings (on a 0-20 scale) of 14.9 +. 8, 
8.6 + .9, and 4.9 + .7, respectively. Clearly P(C-E) k p(e+lc+), because case 3 has the 
highest value of P(e+l c+) but receives the lowest rating for P(C-E). 
The two leading psychological models of causal induction elaborate upon this basis 
in attempting to specify P(C-E). The AP model [6] claims that people estimate 
P(C-E) according to 
zxP: PC+It+)- PC+It-)- (1) 
(We restrict our attention here to facilitatory causes, in which case AP is always between 0 
and 1.) Equation 1 captures the intuition that C' is perceived to cause E to the extent that C"s 
occurence increases the likelihood of observing E. Recently, Cheng [3] has identified sev- 
eral shortcomings of A P and proposed that P (C'-E) instead corresponds to causal power, 
the probability that C' produces E in the absence of all other causes. Formally, the power 
model can be expressed as: 
AP 
power = (2) 
1- 
There are a variety of normative arguments in favor of either of these models [3,7]. Em- 
pirically, however, neither model is fully adequate to explain human causal induction. We 
will present ample evidence for this claim below, but for now, the basic problem can be il- 
lustrated with the three scenarios above. While people rate P(C'-E) higher for case 2, 
{7/100,0/100}, than for case 3, {53/100,46/100}, A? rates them equally and the power 
model ranks case 3 over case 2. To understand this discrepancy, we have to distinguish be- 
tween two possible senses of P(C'-E): "the probability that C causes E (on any given trial 
when C is present)" versus "the probability that C is a cause of E (in general, as opposed 
to being causally independent of E)". Our claim is that the AP and power models concern 
only the former sense, while people's intuitions about P(C'-E) are often concerned with 
the latter. In our example, while the effect of C' on any given trial in case 3 may be equal 
to (according to AP) or stronger than (according to power) its effect in case 2, the general 
pattern of results seems more likely in case 2 than in case 3 to be due to a genuine causal 
influence, as opposed to a spurious correlation between random samples of two independent 
variables. In the following section, we formalize this distinction in terms of parameter esti- 
mation versus structure learning on a graphical model. Section 3 then compares two variants 
of our structure learning model with the parameter estimation models (AP and power) in 
light of data from three experiments on human causal induction. 
2 Graphical models of causal induction 
The language of causal graphical models provides a useful framework for thinking about 
people's causal intuitions [5,7]. All the induction models we consider here can be viewed 
as computations on a simple directed graph (Graph in Figure 1). The effect node E is the 
child of two binary-valued parent nodes: C', the putative cause, and B, a constant back- 
ground. Let X = (C', E),..., (C'N, EN) denote a sequence of N trials in which C' and 
E are each observed to be present or absent; B is assumed to be present on all trials. (To 
keep notation concise in this section, we use 1 or 0 in addition to + or - to denote presence 
or absence of an event, e.g. c: 1 if the cause is present on the ith trial.) Each parent node 
is associated with a parameter, wB or we, that defines the strength of its effect on E. In the 
AP model, the probability of E occuring is a linear function of C: 
Q(e+lc; wB, we): wB + wc.c. (3) 
(We use Q to denote model probabilities and P for empirical probabilities in the sample X.) 
In the causal power model, as first shown by Glymour [5], E is a noisy-OR gate: 
Q(e+lc; w;s, wc): 1 - (1 - wj?)(1 - wc) c. (4) 
2.1 Parameter inferences: A? and Causal Power 
In this framework, both the AP and power model's predictions for P(C-E) can be seen 
as maximum likelihood estimates of the causal strength parameter wc in Graphs, but under 
different parameterizations. For either model, the loglikelihood of the data is given by 
N 
: - (5) 
i----1 
N 
- Q(elc) + (1- og(1- Q(e (6) 
i----1 
where we have suppressed the dependence of Q(elc) on wj?, we. Breaking this sum into 
four parts, one for each possible combination of {e +, e- } and {c +, c- } that could be ob- 
served, (XIwJ? , we) can be written as 
(7) 
By the Information inequality [4], Equation 7 is maximized whenever w j? and we can be 
chosen to make the model probabilities equal to the empirical probabilites: 
(8) 
(9) 
To show that the A P model's predictions for P (C- E) correspond to maximum likelihood 
estimates of we under a linear parameterization of Graphs, we identify we in Equation 3 
with AP (Equation 1), and wj? with ?(e+ Ic- ). Equation 3 then reduces to ?(e+lc+ ) for 
the case c = c + (i.e., c = 1) and to ?(e+lc- ) for the case c = c- (i.e., c = 0), thus satis- 
fying the sufficient conditions in Equations 8-9 for w j? and we to be maximum likelihood 
estimates. To show that the causal power model's predictions for P(C-E) correspond to 
maximum likelihood estimates of we under a noisy-OR parameterization, we follow the 
analogous procedure: identify we in Equation 4 with power (Equation 2), and w j? with 
P(e + I c- ). Then Equation 4 reduces to P(e + I c+) for c = c + and to P(e + I c-) for c = c-, 
again satisfying the conditions for w j? and we to be maximum likelihood estimates. 
2.2 Structural inferences: Causal Support and X 2 
The central claim of this paper is that people's judgments of P(C-E) reflect something 
other than estimates of causal strength parameters - the quantities that we have just shown 
to be computed by AP and the power model. Rather, people's judgments may correspond 
to inferences about the underlying causal structure, such as the probability that C is a direct 
cause of E. In terms of the graphical model in Figure 1, human causal induction may be 
focused on trying to distinguish between Graphs, in which C' is a parent of E, and the "null 
hypothesis" of Graph 0, in which C' is not. 
This structural inference can be formalized as a Bayesian decision. Let hc be a binary vari- 
able indicating whether or not the link C'-E exists in the true causal model responsible for 
generating our observations. We will assume a noisy-OR gate, and thus our model is closely 
related to causal power. However, we propose to model human estimates of P(C'-E) as 
causal support, the log posterior odds in favorofGraph (he = 1) over Graph 0 (he = 0): 
?(he: lx) (10) 
support = log P(hc = 0IX ) ' 
Via Bayes' rule, we can express P(hc = 11 X) in terms of the marginal likelihood or evi- 
dence, P(XIhc = 1), and the prior probability that C' is a cause of E, P(hc = 1): 
(11) 
For now, we take P(hc = 1) = P(hc = 0) = 1/2. Computing the evidence requires 
integrating the likelihood P (XIwB , we) over all possible values of the strength parameters: 
(12) 
We take p(w, clh: ) to be a uniform density, and we note that p(Xl,  ) is sim- 
plythe exponential of (XIwB , we) as defined in Equation 5. P(Xlhc = 0), the marginal 
likelihood for Graph 0, is computed similarly, but with the prior p(w, wc l he - 1) in 
Equation 12 replaced by p(wlhc = 0)(wc). We again take p(wlhc = 0) to be uni- 
form. The Dirac delta distribution on wc: 0 enforces the restriction that the C'-E link 
is absent. By making these assumptions, we eliminate the need for any free numerical pa- 
rameters in our probabilistic model (in contrast to a similar Bayesian account proposed by 
Anderson [1]). 
Because causal support depends on the full likelihood functions for both Graph and 
Graph 0, we may expect the support model to be modulated by causal power - which is 
based strictly on the likelihood maximum estimate for Graph - but only in interaction with 
other factors that determine how much of the posterior probability mass for wc in Graph 
is bounded away from zero (where it is pinned in Graph 0). In general, evaluating causal 
support may require fairly involved computations, but in the limit of large N and weak 
causal strength we, it can be approximated by the familiar X2 statistic for independence, 
(P(c'e)-P(c'e))2 Here Po(c, e) = P(c)P(e) is the factorized approximation to 
P(c, e), which assumes C and E to be independent (as they are in Graph0). 
3 Comparison with experiments 
In this section we examine the strengths and weaknesses of the two parameter inference 
models, AP and causal power, and the two structural inference models, causal support and 
X2, as accounts of data from three behavioral experiments, each designed to address dif- 
ferent aspects of human causal induction. To compensate for possible nonlinearities in peo- 
ple's use of numerical rating scales on these tasks, all model predictions have been scaled by 
power-law transformations, f(a:) = sign(x)lxl, with 7 chosen separately for each model 
and each data set to maximize their linear correlation. In the figures, predictions are ex- 
pressed over the same range as the data, with minimum and maximum values aligned. 
Figure 2 presents data from a study by Buehner & Cheng [2], designed to contrast the pre- 
dictions of A P and causal power. People judged P (C'-E) for hypothetical medical studies 
much like the gene expression scenarios described above, seeing eight cases in which C' oc- 
curred and eight in which C' did not occur. Some trends in the data are clearly captured by 
the causal power model but not by AP, such as the monotonic decrease in P(C-E) from 
{1.00, 0.75} to {.25, 0.00}, as AP stays constant but P(e + I c-) (and hence power) de- 
creases (columns 6-9). Other trends are clearly captured by AP but not by the power model, 
like the monotonic increase in P(C-E) as P(e + I c+) stays constant at 1.0 but P(e + I c- ) 
decreases, from {1.00, 1.00} to {1.00, 0.00} (columns 1, 6, 10, 13, 15). However, one of 
the most salient trends is captured by neither model: the decrease in P(C-E) as AP stays 
constant at 0 but P(e + I c-) decreases (columns 1-5). The causal support model predicts 
this decrease, as well as the other trends. The intuition behind the model's predictions for 
AP = 0 is that decreasing the base rate P(e+ l c-) increases the opportunity to observe the 
cause's influence and thus increases the statistical force behind the inference that C' does not 
cause E, given AP = 0. This effect is most obvious when P(e+l c+) = P(e+l c-) = 1, 
yielding a ceiling effect with no statistical leverage [3], but also occurs to a lesser extent for 
P(e+lc-) < 1. While X2 generally approximates the support model rather well, it also fails 
to explain the cases with P(e + I c+) = P(e + I c- ), which always yield X2 = 0. The superior 
fit of the support model is reflected in its correlation with the data, giving/: 0.95 while 
the power, AP, and x2models gave/ values of 0.81, 0.82, and 0.82 respectively. 
Figure 3 shows results from an experiment conducted by Lober and Shanks [6], designed 
to explore the trend in Buehner and Cheng's experiment that was predicted by AP but not 
by the power model. Columns 4-7 replicated the monotonic increase in P(C-E) when 
P(e + I c+) remains constant at 1.0 but P(e + I c- ) decreases, this time with 28 cases in which 
C' occurred and 28 in which C' did not occur. Columns 1-3 show a second situation in which 
the predictions of the power model are constant, but judgements of P(C-E) increase. 
Columns 8-10 feature three scenarios with equal AP, for which the causal power model 
predicts a decreasing trend. These effects were explored by presenting a total of 60 trials, 
rather than the 56 used in Columns 4-7. For each of these trends the AP model outperforms 
the causal power model, with overall/ values of 0.96 and 0.36 respectively. However, it 
is important to note that the responses of the human subjects in columns 8-10 (contingen- 
cies {1.00, 0.60}, {0.80, 0.40}, {0.40, 0.00}) are not quite consistent with the predictions 
of AP: they show a slight U-shaped non-linearity, with P(C-E) judged to be smaller for 
0.80, 0.40 than for either of the extreme cases. This trend is predicted by the causal support 
model and its X2 approximation, however, which both give the slightly better/ of 0.99. 
Figure 4 shows data that we collected in a similar survey, aiming to explore this non-linear 
effect in greater depth. 35 students in an introductory psychology class completed the sur- 
vey for partial course credit. They each provided a judgment of P(C-E) in 14 different 
medical scenarios, where information about P(e + I c+) and P(e + I c- ) was provided in terms 
of how many mice from a sample of 100 expressed a particular gene. Columns 1-3, 5-7, and 
9-11 show contingency structures designed to elicit U-shaped trends in P (C-E). Columns 
4 and 8 give intermediate values, also consistent with the observed non-linearity. Column 14 
attempted to explore the effects of manipulating sample size, with a contingency structure 
of {7/7, 93/193}. In each case, we observed the predicted nonlinearity: in a set of situa- 
tions with the same AP, the situations involving less extreme probabilities show reduced 
judgments of P(C-E). These non-linearities are not consistent with the AP model, but 
are predicted by both causal support and X2. AP actually achieves a correlation comparable 
to X2 (/: 0.92 for both models) because the non-linear effects contribute only weakly to 
the total variance. The support model gives a slightly worse fit than X2,/: 0.80, while 
the power model gives a poor account of the data,/: 0.38. 
4 Conclusions and future directions 
In each of the studies above, the structural inference models based on causal support or 
X2 consistently outperformed the parameter estimation models, AP and causal power. 
While causal power and AP were each capable of capturing certain trends in the data, 
causal support was the only model capable of predicting all the trends. For the third data 
set, X2 provided a significantly better fit to the data than did causal support. This finding 
merits future investigation in a study designed to tease apart X2 and causal support; in any 
case, due to the close relationship between the two models, this result does not undermine 
our claim that probabilistic structural inferences are central to human causal induction. 
One unique advantage of the Bayesian causal support model is its ability to draw inferences 
from very few observations. We have begun a line of experiments, inspired by Gopnik, 
Sobel & Glymour (submitted), to examine how adults revise their causal judgments when 
given only one or two observations, rather than the large samples used in the above studies. 
In one study, subjects were faced with a machine that would inform them whether a pen- 
cil placed upon it contained "superlead" or ordinary lead. Subjects were either given prior 
knowledge that superlead was rare or that it was common. They were then given two pen- 
cils, analogous to B and C in Figure 1, and asked to rate how likely these pencils were to 
have superlead, that is, to cause the detector to activate. Mean responses reflected the in- 
duced prior. Next, they were shown that the superlead detector responded when B and C 
were tested together, and their causal ratings of both B and C increased. Finally, they were 
shown that B set off the superlead detector on its own, and causal ratings of B increased to 
ceiling while ratings of C returned to their prior levels. This situation is exactly analogous 
to that explored in the medical tasks described above, and people were able to perform accu- 
rate causal inductions given only one trial of each type. Of the models we have considered, 
only Bayesian causal support can explain this behavior, by allowing the prior in Equation 
11 to adapt depending on whether superlead is rare or common. 
We also hope to look at inferences about more complex causal structures, including those 
with hidden variables. With just a single cause, causal support and X2 are highly correlated, 
but with more complex structures, the Bayesian computation of causal support becomes in- 
creasingly intractable while the X2 approximation becomes less accurate. Through exper- 
iments with more complex structures, we hope to discover where and how human causal 
induction strikes a balance between ponderous rationality and efficient heuristic. 
Finally, we should stress that despite the superior performance of the structural inference 
models here, in many situations estimating causal strength parameters is likely to be just as 
important as inferring causal structure. Our hope is that by using graphical models to relate 
and extend upon existing accounts of causal induction, we have provided a framework for 
exploring the interplay between the different kinds of judgments that people make. 
References 
[1] J. Anderson (1990). The adaptive characterof thought. Erlbaum. 
[2] M. Buehner & P. Cheng (1997) Causal induction; The power PC theory versus the Rescorla- 
Wagner theory. In Proceedings of the 19th Annual Conference of the Cognitive Science Society. 
[3] P. Cheng (1997). From covariation to causation: A causal power theory. Psychological Review 
104, 367-405. 
[1] T Cover & J .Thomas (1991). Elements of information theory. Wiley. 
[5] C. Glymour (1998). Learning causes: Psychological explanations of causal explanation. Minds 
and Machines 8, 39-60. 
[6] K. Lober & D. Shanks (2000). Is causal induction based on causal power? Critique of Cheng 
(1997). PsychologicalReview 107, 195-212. 
[7] J. Pearl (2000). Causality. Cambridge University Press. 
Graph Graph 
= l) c = O)  
Model Form of P(elb,c) P(C->E) 
AP Linear w C 
Power Noisy OR-gate w C 
Support Noisy OR-gate log P(hc = 1) 
P(h c: O) 
Figure 1: Different theories of human causal 
induction expressed as different operations on 
a simple graphical model. The AP and power 
models correspond to maximum likelihood 
parameter estimates on a fixed graph (Graph 0, 
while the support model corresponds to a 
(Bayesian) inference about which graph is the 
true causal structure. 
P(e+lc+) 
P(e+lc- ) 
100 
50 
0 
100 075 050 025 000 100 075 050 025 100 075 050 100 075 100 
1 00 0 75 0 50 0 25 0 00 0 75 0 50 0 25 0 00 0 50 0 25 0 00 0 25 0 00 0 00 
Humans 
AP 
/A Powe 
Suppo 
2 
Figure 2: Computational models compared with the 
performance of human participants from Buehner and 
Cheng [1], Experiment lB. Numbers along the top 
of the figure show stimulus contingencies. 
P(e+lc+) 
P(e+lc-) 
lOO 
50 
o 
o9o o8o o7o lOO lOO lOO lOO lOO o8o o4o o9o 
066 033 ooo 075 050 025 ooo 060 040 ooo 083 
Humans 
AP 
P(e+lc+) 
P(e+lc-) 
20 
10 
0 
040 070 100 090 007 053 100 074 002 051 100 010 100 100 
000 030 060 083 000 046 093 072 000 049 098 010 100 048 
Humans 
AP 
Power 
Support 
Support 
2 
Figure 3: Computational models compared 
with the performance of human participants 
from Lober and Shanks [5], Experiments 
4-6. 
Figure 4: Computational models compared with the 
performance of human participants on a set of stimuli 
designed to elicit the non-monotonic trends shown in 
the data of Lober and Shanks [5]. 
