Supplement to Bounded Rationality

The Bias-Variance Decomposition of Mean Squared Error

Trading off bias and variance to improve predictive accuracy is a common practice in statistics and machine learning. The trade-off is typically introduced through the decomposition of mean squared error (MSE) in ordinary least squares regression. This section provides a stand-alone derivation of MSE in terms of bias and variance, starting with an example prediction task.

Predicting the exact volume of gelato to be consumed in Rome next summer is more difficult than predicting that more gelato will be consumed next summer than next winter. Although higher temperatures lead to higher demand for gelato, the precise relationship between daily temperatures in Rome and gelato consumption is uncertain. Modeling quantitative, predictive relationships between random variables like the temperature in Rome, X, and volume of gelato consumption, Y, is the subject of regression analysis.

Suppose we predict that the value of Y is h. How should we evaluate whether this prediction is accurate? Intuitively, the best prediction minimizes the difference \(Y - h\). If we are indifferent to the direction of our errors, then measuring the performance by squared difference from Y, \((Y - h)^2\) is common. Since the values of Y varies, we consider the average value of \((Y - h)^2\) by computing its expectation, \(\mathbb{E} \left[ (Y - h)^2 \right]\). This quantity is the mean squared error of h,

\[\textrm{MSE}(h) := \mathbb{E} \left[ (Y - h)^2 \right].\]

Now imagine our prediction of Y is based on some data \(\mathcal{D}\) about the relationship between X and Y, such as last year’s daily temperatures and gelato sales in Rome. The specific dataset \(\mathcal{D}\) will become important later. For now, view our prediction of Y as a function of X, written \(h(X)\). We aim to minimize \(\mathbb{E} \left[ (Y - h(X))^2 \right]\), where the accuracy of \(h(\cdot)\) depends on the possible values of X, represented by the conditional expectation

\[\mathbb{E} \left[ (Y - h(X))^2 \right] := \mathbb{E} \left[ \mathbb{E} \left[ Y - h(X) \mid X\right] \right].\]

To evaluate this conditional prediction, we use the same method as before, now accounting for X. For each possible value x of X, the best prediction of Y is the conditional mean, \(\mathbb{E}\left[ Y \mid X = x\right]\). The regression function of Y on X, \(r(x)\), gives the optimal value of Y for each value \(x \in X\):

\[r(x) := \mathbb{E}\left[ Y \mid X = x\right].\]

Although the regression function represents the true population value of Y given X, this function is usually unknown and often complex, leading to its approximation by a simplified model or learning or learning algorithm, \(h(\cdot)\).

We might restrict candidates for \(h(X)\) to linear (or affine) functions of X. Making predictions about Y with a simplified model can introduce a systematic prediction error called bias. Bias results from the difference between the central tendency of data generated by the true model, \(r(X)\) (for all \(x \in X\)), and the central tendency of our estimator, \(\mathbb{E}\left[h(X)\right]\), written

\[\textrm{Bias}(h(X)) := r(X) - \mathbb{E}\left[h(X) \right],\]

where any non-zero difference between the pair is interpreted as a systematically positive or systematically negative error of the estimator, \(h(X)\).

Variance measures the average deviation of a random variable from its expected value. In this context, we compare the predicted value \(h(X)\) of Y based data \(\mathcal{D}\) about the relationship between X and Y to the average value of \(h(X)\), \(\mathbb{E}\left[ h(X) \right]\). We express this variance as:

\[\textrm{Var}(h(X)) = \mathbb{E}\left[( \mathbb{E}\left[ h(X) \right] - h(X))^2\right].\]

The bias-variance decomposition of mean squared error is fundamental in frequentist statistics, where the goal is to compute an estimate \(h(X)\) of the "true parameter" \(r(X)\) using data \(\mathcal{D}\) . Here \(r(X)\) is assumed to be fixed and the data \(\mathcal{D}\) is treated as a random quantity, exactly the reverse of Bayesian statistics. What this means is that the data set \(\mathcal{D}\) is interpreted to be one among many possible data sets of the same dimension generated by the true data generating process, \(r(X)\).

Following Christopher M. Bishop (2006), the bias-variance decomposition of MSE can be derived as follows. Let h refer to our estimate \(h(X)\) of Y, r refer to the true value of Y, and \(\mathbb{E}\left[ h \right]\) the expected value of the estimate h. Then,

\[\begin{align} &\textrm{MSE}(h) \\ &\quad = \mathbb{E}\left[ ( r -h)^2 \right] \\ &\quad = \mathbb{E}\left[ \left( \left( r - \mathbb{E}\left[ h \right] \right) + \left( \mathbb{E}\left[ h \right] - h \right) \right )^2 \right] \\ &\quad = \mathbb{E}\left[ \left( r - \mathbb{E}\left[ h \right] \right)^2 \right] + \mathbb{E}\left(\left( \mathbb{E}\left[ h \right] - h \right )^2\right) + 2 \mathbb{E}\left[ \left( \mathbb{E}\left[ h \right] - h \right) \cdot \left( r - \mathbb{E}\left[ h \right] \right) \right] \\ &\quad = \left( r - \mathbb{E}\left[h \right] \right)^2 + \mathbb{E}\left[ \left( \mathbb{E}\left[h \right] - h \right)^2\right] + 0\\ &\quad = \textrm{Bias}(h)^2 \ + \ \textrm{Var}(h) \end{align}\]

where the term \(2 \mathbb{E}\left[ \left( \mathbb{E}\left[ h \right] - h \right) \cdot \left( r - \mathbb{E}\left[ h \right] \right) \right]\) is zero, since

\[\begin{align} &\mathbb{E}\left[ \left( \mathbb{E}\left[ h \right] - h \right) \cdot \left( r - \mathbb{E}\left[ h \right] \right) \right] \notag\\ &\qquad = \left(\mathbb{E} \left[ r \cdot \mathbb{E} \left[ h \right] \right] - \mathbb{E}\left[ \mathbb{E}\left[ h \right]^2 \right] - \mathbb{E}\left[ h \cdot r \right] + \mathbb{E}\left[ h\cdot \mathbb{E}\left[ h \right] \right] \right) \nonumber \tag{1}\\ &\qquad = r \cdot \mathbb{E} \left[ h \right] - \mathbb{E}\left[ h \right]^2 - r \cdot \mathbb{E} \left[ h \right] + \mathbb{E}\left[ h \right]^2 \label{eq:owl}\tag{2}\\ &\qquad = 0. \tag{3} \end{align}\]

Note that the frequentist assumption that r is a deterministic process is necessary for the derivation to go through; for if r were a random quantity, the reduction of \(\mathbb{E} \left[ r \cdot \mathbb{E} \left[ h \right] \right]\) to \( r \cdot \mathbb{E} \left[ h \right]\) in line (2) would be invalid.

Finally, the prediction error of \(h(X)\) due to noise, N, which occurs independent of the model or learning algorithm, adds irreducible error. Thus, the full bias-variance decomposition of the mean-squared error is:

\[\tag{4}\label{eq-puppy} \textrm{MSE}(h)\ = \ \textrm{Bias}(h)^2 \ + \ \textrm{Var}(h) \ + \ N\]

Copyright © 2024 by
Gregory Wheeler <g.wheeler@fs.de>

This is a file in the archives of the Stanford Encyclopedia of Philosophy.
Please note that some links may no longer be functional.