Conditioning on a confouding variable: demonstration using a three-dimensional scatterplot

R script: 3D_scatterplot_demo.R

In this demonstration I use simulated data to show how a third variable that affects two other variables generates correlation between those two variables even if those two variables have no effect on each other, and I show how conditioning on the third variable eliminates the correlation.

We start by generating a simulated set of 1000 data points. Each unit, or each case, is given a value on each of three variables: x, y, and z. First the values of z are drawn randomly from a standard normal distribution. Then the values of x and y are affected by the values of z, so these values are drawn randomly from a normal distribution and then are adjusted upward or downward a little bit depending on the corresponding values of z. So x and y are both affected by z, but x and y have not effect on each other. However, if the values of x and y are plotted while ignoring the values of z, a clear correlation between x and y can be seen. In the 3-dimensional scatterplot, if the plot is oriented such that the x axis is displayed horizontally and the y axis is shown vertically, the correlation between x and y can be seen. This is because higher values of z causes higher values of x and higher values of y. When z is higher we should expect to see higher values of x and higher values of y. When z is lower, we should expect to see lower values of x and lower values of y. But what if z weren't allowed to vary? If z isn't allowed to be higher or lower, then there is no opportunity for higher values of z to produce higher values of x and y. All of the covariation between x and y is produced by variation in z, so if z isn't allowed to vary, there should be no covariation between x and y. If we condition on z when examining the covariation between x and y, the covariation should disappear.

We can do that by examining just one small slice of the data: only those cases that have a certain value of z (or at least only those who have very similar values of z). We can specify any arbitray value of z, and we specify a small interval around that value of z, within which interval the values of z should be sufficiently similar to each other so as not to generate much variation. Ideally we would use an infinitely small interval, but then we wouldn't have anything to look at in the demonstration, so we just use a fairly small interval for the purspose of the demonstration. The data points are then separated into two groups, those whose values of z are within the specified interval and those who are not. The two groups are then differentiated by color in the 3D scatterplot. We can now see how the data have sort of been sliced at a particular point on the z axis. All the red points have approximately the same value of z. If the plot is then oriented such that the x axis is displayed horizontally and the y axis vertically, we can see that among the red points there doesn't appear to be any covariation between x and y. All of the covariation between x and y was produced by variation in z, so when z isn't allowed to vary, there isn't any covariation between x and y. Conditioning on the confounding variable z has eliminated the spurious correlation between x and y.

We could specify a different value of z at which to slice the data. For any value of z we choose, we should see no covariation between x and y within that value of z. Ideally we would look at every possible value of z and calculate the covariation between x and y within each value of z and then calculate an average of all those covariations, and we would find it to be approximately zero. That is essentially what a multiple regression model does. If the confounding variable z is included in the model, then the slope coefficient for x represents the average covariation between x on y while conditioning on the values of z. Recall that in a bivariate linear regression model, the model can be thought of as a line fitted through a 2-dimensional scatterplot of the values of x and y. A multiple regression model can be thought of as a 2-dimensional plane fitted through a 3-dimensional scatterplot of the values of x, y, and z. In the next demonstration plot, our simulated values of x, y, and z are plotted with a plane fitted through them. If the plot is oriented such that the z axis is displayed horizontally and the y axis vertically, we can see a clear positive slope. This is becuase z has a positive effect on x. But if the plot is oriented such that the x axis is displayed horizontally and the y axis vertically, we see no slope. This is because x has no effect on y. Within each level of z there is no covariation between x and y. Within each level of z, the model describing the covariation between x and y is basically like a flat line with zero slope.


UPDATE: Randy Cragun has recently produced a more accessible web version of my demonstration so you can see my graphs even if you don't know how to use R: http://jamescragun.com/teaching/what_does_controlling_for_mean.html