Posts Choosing the R² of a Data Generating Process

Choosing the R² of a Data Generating Process

If you have ever set up a linear regression, there is a good chance you’re familiar with the R² measure, also known as the coefficient of determination. Of course, this measure does not only appear in linear models and rather serves as a general goodness of fit indicator in regression problems. The main idea behind the R² is to express, as a percentage, how much of the dependent variable’s variance is predictable by independent variables.

For my research, I recently wanted to create a data generating process (DGP) with a specific degree of predictability, i.e., a specific R². This post discusses the general intuition behind creating a DGP with a predefined R².

Structure of the DGP

To begin, let us introduce a simple DGP with a linear structure:

\[\mathbf{y} = \alpha + \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon},\]

where \(\alpha\) and \(\boldsymbol{\beta}\) are the coefficients of our DGP, \(\mathbf{y}\) is the dependent variable, \(\mathbf{X}\) are the independent variables, and \(\boldsymbol{\epsilon}\) is a random error term with zero mean. The error term \(\boldsymbol{\epsilon}\) has a variance of \(\sigma^2_\epsilon\), and the explanatory variables \(\mathbf{X}\) are randomly distributed with mean \(\mathbf{\mu}\) and variance-covariance matrix \(\boldsymbol{\Sigma}\).

This DPG structure makes things a little easier, but, in general, any other DGP could be chosen, and the following concepts will still apply.

Decomposing the coefficient of determination

As previously noted, we want to create our DGP with a specific R², which we will call \(r^\star\). For observations \(i=1, 2, \dots, N\), the formula of the R² measure is given by

\[\begin{align} R^2 &= 1 - \frac{\text{sum of squares explained}}{\text{total sum of squares}} \\ &= 1 - \frac{\sum_{i=1}^N \left(y_i - \hat{y}_i \right)^2}{\sum_{i=1}^N \left(y_i - \bar{y}\right)^2} \end{align}\]

Looking at the total sum of squares, \(\sum_{i=1}^N \left(y_i - \bar{y}\right)^2\), we see that this quantity is nothing but the sample variance of the dependent variable multiplied by \(N-1\) (or \(N\) if you don’t use Bessel’s correction), i.e., we have:

\[\text{total sum of squares} = (N-1) \cdot \text{Var}\left(y\right).\]

Under knowledge of \(\alpha\) and \(\boldsymbol{\beta}\), we can model our target, \(\hat{\mathbf{y}}\) as

\[\hat{\mathbf{y}} = \alpha + \mathbf{X} \boldsymbol{\beta},\]

thus we can rewrite the sum of squares explained as

\[\begin{align} \text{sum of squares explained} &= \sum_{i=1}^N \left(y_i - \hat{y}_i\right)^2 \\ &= \sum_{i=1}^N \left(y_i - \alpha - \mathbf{X}_i \boldsymbol{\beta} \right)^2 \\ &= \sum_{i=1}^N \left(\alpha + \mathbf{X}_i \boldsymbol{\beta} + \epsilon_i - \alpha - \mathbf{X}_i \boldsymbol{\beta} \right)^2 \\ &= \sum_{i=1}^N \epsilon_i^2. \end{align}\]

Because the mean of the error term is zero, following the same logic as above, we obtain:

\[\text{sum of squares explained} = (N-1) \cdot \text{Var}(\epsilon).\]

Expressing R² in terms of DGP parameters

Putting it all together, the R² can be rewritten as the relationship

\[R^2 = 1 - \frac{\text{Var}(\epsilon)}{\text{Var}(y)},\]

where \(\text{Var}(\epsilon) = \sigma^2_\epsilon\) is a parameter of our DGP. Hence, all we have to do is decompose the variance of the dependent variable \(\text{Var}(y)\) and express it in terms of the DGP parameters, such that we can fully express the coefficient of determination as a function of our parameters.

\[\begin{align} \text{Var}(\mathbf{y}) &= \text{Var}\left(\alpha + \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\right) \\ &= \boldsymbol{\beta}^\top \text{Var}(\mathbf{X}) \boldsymbol{\beta} + \text{Var}(\boldsymbol{\epsilon}) \\ &= \boldsymbol{\beta}^\top \boldsymbol{\Sigma} \boldsymbol{\beta} + \sigma^2_\epsilon. \end{align}\]

That’s it. We can now write our target R², \(r^\star\) as

\[r^\star(\boldsymbol{\beta}, \Sigma, \sigma^2_\epsilon) = 1 - \frac{\sigma^2_\epsilon}{\boldsymbol{\beta}^\top \boldsymbol{\Sigma} \boldsymbol{\beta} + \sigma^2_\epsilon},\]

which gives us a nifty formula such that we can tune the variance of the error term, conditional on the chosen distribution of the independent variables and the selected coefficients. Alternatively, one could also fix the variance of the error term and tune the coefficients or the variance-covariance matrix of the predictors. However, this feels less intuitive, at least to me.

As visible from both the above equation and the plot below, it is clear that as \(\boldsymbol{\beta}^\top \boldsymbol{\Sigma} \boldsymbol{\beta}\) stays constant, increasing the variance of the error term \(\sigma^2_\epsilon\) will reduce the R².

Et voilà, if you ever encounter a task where you want to build a DGP with a specific degree of predictability, all there is to do is tune the error term’s variance conditional on the variance-covariance of the predictors multiplied by their coefficients.