Adventures in econometricland
I took Oxford’s advanced undergraduate econometrics course. My experience of the course, and really of the entire field, was the following: the concepts are simple, the real challenge is making sense of notation so obfuscatory that you wonder if it’s done on purpose.
In order to arrive at this view, I went through a long and confusing journey, one I wish upon no friend. This document’s structure takes my journey in reverse order.1 I start with what I eventually pinned down as the clear mathematical facts. Once armed with this toolkit, I do my best to explain why standard notation is confusing, and attempt to guess, from context, what econometricians actually mean.
The examples I give really are of standard practice. I give quotes from a few textbooks and our lecture slides, but I promise that you will find the same thing almost everywhere. And the confusing usages are not just a convenience of notation that is readily acknowledged in conversation. When I asked people about this in person, all I got were long, confusing back-and-forths.
- The facts
- The hermeneutics (I)
- Inconsistent causal language
- Appendix A
- Appendix B
- Appendix C
We start with a set of ordered pairs .
You can think of and as
- real numbers (facts about each of the the individuals in the population)
- or as random variables (probability distributions over facts about individuals in a sample),
all the maths will apply equally. (I will return to this fact and comment on it).
The CEF minimises
Some algebraic facts
We write the equality:
Where and are known, but depends on our choice of .
Suppose we want to solve
The solution is . The proof of this is in appendix A. Suppose we specify as such, we then get:
Now is known and is known (by the subtraction ).
The LRM minimises
Some algebraic facts
Now we write the following equality:
This says that is equal to a linear function of plus some number .
We then have
As before is known, whereas is a function of and .
Here is the distance, for observation , between the LRM and the CEF; while is the distance between the CEF and the actual value of . We can then call the distance between the LRM and the actual value.2
We can also see that is equivalent to , i.e. the CEF and the LRM occupy the same coordinates.
Suppose we want to solve
The solution is
I prove this in appendix B. (It’s possible to prove an analogous result in general using matrix algebra, see appendix C.)
Suppose we specify that and are equal to these solution values. Now that and are known, is known too (by the subtraction ). As before, is known.
Thus, in our regression equation,
all of , , , , and (and thus ), are known.
Two things to note about the facts above.
- Whether we are using real numbers of random variables does not matter for anything we’ve said so far. All we have used are the expectation and summation operators and their properties. Textbooks often warn about the important distinction between the sample and the population, but as far as these algebraic facts are concerned the difference is immaterial!
- I have not used “hat” notation (as in ). Instead I have described the results of optimisation procedures carefully using words, like “the solution to this minimisation problem is …”. The way standard econometrics uses the hat is a good example of obfuscatory notation.
The hermeneutics (I)
Econometrics textbooks, within the same sentence or paragraph, routinely use the hat in two ways which seem to me to be incompatible.
In Stock and Watson, p. 158, we have Claim A:
The linear regression model is:
Where is the population regression line or population regression function, is the intercept of the population regression line, and is the slope of the population regression line.
And Stock and Watson, p. 163 gives Claim B:
The OLS estimators, and , are sample counterparts of the population coefficients and . Similarly, the OLS regression line is the sample counterpart of the population regression line and the OLS residuals are sample counterparts of the population errors .
So far so good.
Loss function minimisers?
Stock and Watson, p. 187 (Claim C):
The OLS estimators, and are the values of and that minimise .
This quote is the biggest culprit. After many conversations, I finally understood that we’re supposed to take the quote to mean:
The OLS estimators, and are the values of and that minimise , where is the number of observations in the sample ( if is the sample size) and and are the th values in the sample.
I swear, I’m not taking this quote out of context! Nowhere, in the entire textbook, would you find a clue that the and in claim C are completely different quantities than and in claim A. This is criminal negligence. (I’m also not cherry-picking. My lecture notes cheerfully call and the ‘OLS’ solutions, and this usage is standard.)
Of course, I’m the kind of person to take claim C at face value, and combine it with claim A, to arrive at and , which, I gathered from context, was not a desirable conclusion.
The following is not as bad as the above, since it avoids explicit contradiction, but still sows confusion by using the hat to mean different things when put on top of different values.
Claim D, from Stock and Watson p. 163:
The predicted value of , given , based on the OLS regression line, is .
This is compatible with the loss function mimimiser usage of the hat: claim C, which us and are loss function minimisers; claim D then tells us that is the value obtained when you compute the values of and which minimise a loss function, and plug them into the regression function.
But, of course, this “predicted value” verbiage is incompatible with the sample analogue usage. can’t be both the predicted value (whether in a sample or not) and the actual value in a sample. That would imply that predictions are always perfect!
So even if we amend claim C as I’ve done above, we still can’t say that the hat is consistently used to mean sample analogue, since in the case of it’s apparently used to mean predicted value. (More specifically predicted value in a sample, one guesses from context).
Inconsistent causal language
Here is an entirely separate category of wrongdoing. In all of the above we have taken the statement
to be an innocuous equality: is equal to regression intercept, plus regression slope times , plus some remaining difference. Call this this the algebraic claim.
But it turns out that the statement is sometimes used to make a completely different, and incredibly strong, causal claim. Econometricians switch between the two usages.
In keeping with the above structure, I’ll start by clearly stating the causal claim, then I’ll analyse quotes which trade on the ambiguity between the causal and algebraic claims.
The causal claim
We think of
Not as a regression equation, but as a complete causal account of everything causally affecting . (Sometimes the equation is said to desribe the data generating process, another case of dressing a big implausible claim in sheep’s clothing). For example, if there are things causally affecting , we have:
We can think of this claim as equivalent to an infinite lists of counterfactuals, giving the potential values of for every combination of values of the causal factors . It also makes the claim that nothing else has a causal effect on .
(if we think the world is non-deterministic, the claim becomes , where are random variables, and we have a list of counterfactuals giving the potential distributions of for every combination of values of the causal factors.)
That’s a rather huge claim. In any realistic case, causal chains are incredibly long and entangled, so that basically everything affects everything else in some small way. So the claim often amounts to an entire causal model of the world.
In the first part, I have restricted my attention to the confusions that arise when taking the algebraic interpretation as given. It’s clearly the interpretation they want you to use. Regression is a mathematical operation, “” is an algebraic symbol, and so on. Phrases like “slope of the population regression line” are routinely used while no hint is ever made at any causal meaning of the claim . But you’ll see below many claims which only make sense under the causal interpretation.
Algebra or causes?
Stock and Watson p. 158, claim E:
The term is the error term […]. This term contains all the other factors besides that determine the value of the dependent variable, , for a specific observation
This is a favourite trick: use a word like “determines”, which heavily implies a causal claim, but stay just shy of being unambigously causal. That way you can always retreat to the algebraic claim. (Other favourites which I see all the time in published papers are “contributes to”, “is associated with”, “explains”, “influences”…).
Indeed, under the algebraic interpretation, claim E is puzzling. What on earth does it mean for a number to “contain”, “factors” that “determine” the value of another number? As far as the mathematics is concerned, we have no concept of “determine”, much less of a number “containing” another number.
A causal variant of claim E would be:
The term is the further-causes term […]. This term contains all the other factors besides that cause the value of the dependent variable, , for a specific observation
Wooldrige, p.92f, claim F:
When assumption MLR.4 holds, we often say that we have exogenous explanatory variables. If is correlated with for any reason, then is said to be an endogenous explanatory variable […] Unfortunately, we will never know for sure whether the average value of the unobservables is unrelated to the explanatory variables.
Under the algebraic interpretation, MLR.4 is the claim that the conditional expectation function is exactly the regression line. (In the notation I use above, ). This is a pretty strong claim, but has nothing to do with exogeneity. The exogeneity part of Claim F only makes sense under the causal interpretation, and I suspect that in the end we are to take Claim F causally. In that case, Claim F uses the language or correlation (“if is correlated with for any reason”) to make an extremely strong causal claim. “Correlation does not imply causation” is a very good slogen which it would be beneficial to actually apply.
While claim F seems to require the causal interpretation, the phrase “error term” in claim E calls for the algebraic one. And most of the quotes from part one, such as claims A and B, which call the “population regression function”, rely on the algebraic claim.
Stock and Watson, p.131, claim G:
The causal effect of a treatment is the expected effect on the outcome of interest of the treatment as measured in a ideal randomized controlled experiment. This effect can be expressed as the difference of two conditional expectations. Specifically, the causal effect on of treatment level is the difference in the conditional expectations where is the expected value of of for the treatment group (which received treatment level ) in an ideal randomized controlled experiment and is the expected value of for the control group (which receives treatment level ).
Stock and Watson, p. 170, claim H:
The first of the three least squares assumptions is that the conditional distribution of given has a mean of zero. This assumption is a formal mathematical statement about the “other factors” contained in and asserts that these other factors are unrelated to in the sense that, given a value of , the mean of the distribution of these other factors is zero.
Claim G is good because there is appropriate hedging: causal effects are the difference between conditional expectations only in an idealised RCT. An idealised RCT is the only case where where the causal claim and the algebraic claim have the same meaning.
In claim H however, the sentence “This assumption is a formal mathematical statement about the “other factors” contained in ” trades on the ambiguity between the algebraic and causal claims. Mathematical statements are about sums and products, not about causality in the world. This kind of writing promotes a kind of magical thinking in which, say, the expectation operator (really just a sum) can tell us about the what we would causally “expect” to see if we intervened on the world.
Knowns or unknowns?
I want to go back to a part of claim F, which I did not discuss above:
Unfortunately, we will never know for sure whether the average value of the unobservables is unrelated to the explanatory variables.
We see the same talk of unobservables in the University of Oxford Econometrics lecture slides, Michaelmas Term 2017, (claim J):
The simple regression model
- and are observable random scalars
- is the unobservable random disturbance or error
- and are the parameters (constants) we would like to estimate
On the causal usage, are indeed practically impossible to observe. But then so are and , but these are simply called parameters, and not unobservables.
But on the algebraic usage, we are presumably to take and to be loss function minimising coefficients. Then, if and are known, so are and , and by a simple subtraction, is known too.
The same thing happens . When that equality is first introduced, it is presented as a mere piece of algebra. If we know and we can obviously get by a subtraction. Yet econometricians insist on calling unknown; they are laying the groundowrk to hoodwink you later by switching to the causal usage.
Proof that the solution to
Taking the first-order condition:
Proof that the solution to
Thus we write:
Fact A is proven here:
A special case of fact A is:
We can also write:
Since is a scalar, it is equal to its transpose . Thus:
We then solve:
Assuming that is invertible (since it’s a square matrix, this is equivalent to ), we have:
For the curious, or those who have to much time on their hands, I include a full version history, showing how this document evolved over the past few weeks. It’s an interesting window into my thought process. ↩
As a separate gripe from the main one in this post, I note that often what I call is just written as , by this I mean that in the same document, people will write and . This is either a terrible choice of notation (same name for two different objects) or an implicit and unnecessary (in this case) assumption that and . ↩