Intro to Statistics and Data Science

6 Lab: Linear Model Selection and Regularization

This is a modified version of Lab 1: Subset Selection Methods, Lab 2: Ridge Regression and the Lasso, and Lab 3: PCR and PLS Regression labs from chapter 6 of Introduction to Statistical Learning with Application in R. This version uses tidyverse techniques and methods that will allow for scalability and a more efficient data analytic pipeline.

We will need the packages loaded below.

Since we will be making use of validation set and cross-validation techniques to build and select models we should set our seed to ensure the analyses and procedures being performed are reproducible. Readers will be able to reproduce the lab results if they run all code in sequence from start to finish — cannot run code chucks out of order. Setting the seed should occur towards the top of an analytic script, say directly after loading necessary packages.

6.1 Data Setup

We will be using the Hitters dataset from the ISLR library. Take a moment and inspect the codebook — ?ISLR::Hitters. Our goal will be to predict salary from the other available predictors.

Let’s skim the data to make sure it has been encoded correctly and quickly inspect it for other potential issues such as missing data.

## ── Data Summary ────────────────────────
##                            Values    
## Name                       Piped data
## Number of rows             322       
## Number of columns          21        
## _______________________              
## Column type frequency:               
##   character                1         
##   factor                   3         
##   numeric                  17        
## ________________________             
## Group variables            None      
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 name                  0             1     7    17     0      322          0
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts    
## 1 league                0             1 FALSE          2 A: 175, N: 147
## 2 division              0             1 FALSE          2 W: 165, E: 157
## 3 new_league            0             1 FALSE          2 A: 176, N: 146
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##    skim_variable n_missing complete_rate    mean      sd    p0   p25    p50    p75  p100
##  1 at_bat                0         1      381.    153.    16   255.   380.   512     687
##  2 hits                  0         1      101.     46.5    1    64     96    137     238
##  3 hm_run                0         1       10.8     8.71   0     4      8     16      40
##  4 runs                  0         1       50.9    26.0    0    30.2   48     69     130
##  5 rbi                   0         1       48.0    26.2    0    28     44     64.8   121
##  6 walks                 0         1       38.7    21.6    0    22     35     53     105
##  7 years                 0         1        7.44    4.93   1     4      6     11      24
##  8 c_at_bat              0         1     2649.   2324.    19   817.  1928   3924.  14053
##  9 c_hits                0         1      718.    654.     4   209    508   1059.   4256
## 10 c_hm_run              0         1       69.5    86.3    0    14     37.5   90     548
## 11 c_runs                0         1      359.    334.     1   100.   247    526.   2165
## 12 c_rbi                 0         1      330.    333.     0    88.8  220.   426.   1659
## 13 c_walks               0         1      260.    267.     0    67.2  170.   339.   1566
## 14 put_outs              0         1      289.    281.     0   109.   212    325    1378
## 15 assists               0         1      107.    137.     0     7     39.5  166     492
## 16 errors                0         1        8.04    6.37   0     3      6     11      32
## 17 salary               59         0.817  536.    451.    67.5 190    425    750    2460

We see there is some missing data/values which are all contained in the response variable salary. There isn’t much we could do here unless we would want to do some research ourselves and see if we can find them somewhere else. We will remove them from our dataset. Note that we could still produce predicted salaries for these players, but we just won’t be able to determine if the predictions are any good.

We will be implementing model selection later in this lab and we have a choice which impacts how we need to handle the dataset. We could opt to use \(C_p\), \(BIC\), or \(R^2_{adj}\) or we could use validation set or cross-validation approaches. Remember that \(C_p\), \(BIC\), or \(R^2_{adj}\) are essentially methods that indirectly estimate test \(MSE\) while validation set or cross-validation approaches provide direct estimates of test \(MSE\). The indirect approaches require many assumptions that may not be tenable and they use the entire dataset. We prefer the direct approaches to estimating test \(MSE\) because they are more flexible and computational limitations are no longer a barrier.

6.2 Subset Selection Methods

Results for best subset search.

## # A tibble: 19 x 2
##    model_index test_mse
##          <int>    <dbl>
##  1           9  114697.
##  2           8  120848.
##  3          10  122537.
##  4          11  125281.
##  5          12  127959.
##  6          13  128225.
##  7          14  129884.
##  8          15  130396.
##  9          17  130706.
## 10          16  131059.
## 11          18  131543.
## 12          19  132190.
## 13           4  138234.
## 14           7  139159.
## 15           6  140677.
## 16           2  144966.
## 17           3  145706.
## 18           5  147677.
## 19           1  151427.

Results for best foward selection search.

## # A tibble: 19 x 2
##    model_index test_mse
##          <int>    <dbl>
##  1          10  122800.
##  2           9  123859.
##  3          11  125631.
##  4          12  125944.
##  5          13  127514.
##  6          14  128240.
##  7           8  128615.
##  8          15  129167.
##  9          17  129379.
## 10          18  129533.
## 11          16  129768.
## 12          19  129838.
## 13           6  131553.
## 14           7  131891.
## 15           5  132531.
## 16           4  135324.
## 17           3  143196.
## 18           2  147530.
## 19           1  168375.

Results for best backward selection search.

## # A tibble: 19 x 2
##    model_index test_mse
##          <int>    <dbl>
##  1          10  124270.
##  2           9  125919.
##  3           8  126654.
##  4          11  128029.
##  5           6  130783.
##  6          12  132626.
##  7          13  133752.
##  8          15  133922.
##  9          16  133988.
## 10           5  134176.
## 11           7  134629.
## 12          19  134881.
## 13          14  135071.
## 14          18  135404.
## 15          17  135502.
## 16           4  142425.
## 17           2  143606.
## 18           3  144385.
## 19           1  164982.

Results above suggest using a model with 9 predictors — regardless of selection method — would be reasonable. Now implement each selection method on the entirety of hitter_mod_bldg_dat and extract the best model that uses 9 predictors from each. These will be 3 of our candidate models.

All three of these candidate models have 9 predictors. It is possible that all three methods produce the same model. Let’s take a look at the estimated coefficients for these models.

## Joining, by = "name"
## Joining, by = "name"
name best fwd back
(Intercept) 139.789 141.091 133.610
at_bat -2.093 -1.556 -1.741
hits 6.833 5.548 6.110
walks 5.395 4.991 5.150
c_hm_run 1.796 NA NA
c_runs 0.988 1.265 1.406
c_walks -0.903 -0.824 -0.935
divisionW -132.673 -130.710 -130.760
put_outs 0.307 0.270 0.282
assists 0.278 NA NA
c_at_bat NA -0.092 NA
c_rbi NA 0.692 0.683
c_hits NA NA -0.350

As the above table demonstrates, the three methods selected three different models that use 9 of the available predictors. There are 7 predictors that all three models have in common. The forward and backward models have 8 predictors in common.

6.3 Ridge Regression & the Lasso

The glmnet package allows users to fit elastic net regressions which includes ridge regression (alpha = 0) and the lasso (alpha =1). Unfortunately the package does not use our familiar formula syntax, y ~ x. It requires the user to encode their data such that the penalized least squares algorithm can be applied. Both lm() and glm() also require the data to be correctly encoded for the least squares algorithm to work but this is done automatically within those functions by using the model.matrix() which takes the inputted formula and appropriately re-encodes the data for the least squares algorithm. glmnet requires the user to use model.matrix() themselves. Wouldn’t it be nice if the glmnet functions did this automatically just like lm() and glm()? Fortunately there is the glmnetUtils package which allows us to keep on using our familiar formula interface with glmnet functions.

The tuning parameter for both ridge regression and the lasso is \(\lambda\), size of the penalty to apply in the least squares algorithm. How do we go about tuning/finding an estimate for the \(\lambda\) parameter? We call upon cross-validation techniques once again. glmnet (glmnetUtils) has a built-in function that will implement cross-validation quickly and efficiently for us. Meaning there is no need for use to use crossv_kfold(). We will be using cv.glmnet() — use ?glmnetUtils::cv.glmnet to access the correct documentation. It is the alpha argument to these functions that determine if we are using ridge regression (alpha = 0) or the lasso (alpha = 1).

Since we are in search of an adequate \(\lambda\) we need to specify a range of values to search. Much like we specified the number of predictors to search over when using regsubset() — started with 1 and went up to 19. The difference here is that \(\lambda\) can be any positive real number (size of penalty to apply). Obviously we cannot use every possible value greater than 0, so we need to setup a grid search. We could go with an automatically selected grid for \(\lambda\) or we can build our own grid. We will demonstrate both.

The following plot displays the 200 estimated test \(MSE\) values using 10 fold cross validation. The plot indicates the minimum \(\lambda\) (left dotted line). The right dotted line corresponds to a \(\lambda\) value that would result in a more regularized model (forcing more or further shrinkage of coefficients towards zero) and is within one standard error of the minimum \(\lambda\) (model’s are not all that different in estimated test \(MSE\)). Why would we desire a more regularized model? A more regularized model will typically shift more of the prediction burden/importance to fewer predictor variables. The top of the plot displays (periodically) the number of predictors in the models and in the case of ridge regression this number is very unlikely to change since it is highly unlikely to force any of the coefficients to be exactly zero.

Strictly speaking the best \(\lambda\) is the one which minimizes the test error. We will extract that value to use for ridge regression. We will also extract the \(\lambda\) that is within one standard error which can help safe guard against over-fitting issues.

Let’s get the best \(\lambda\) values for the lasso. Recall that alpha = 1 now and we will let the function select the search grid for \(\lambda\).

There is an extremely important difference between this plot which is using the lasso and with the plot above which uses ridge regression. The number of predictors is decreasing (top of graph) as \(\lambda\) increases. This is an important distinction between the lasso and ridge regression. Lasso forces coefficients to be zero (i.e. it is performing variable selection). For the lasso a more regularized model will result in a more parsimonious (simpler) model. This is extremely useful when theory indicates that it is likely a handful of variables that are truly related to the response and the others are over-fitting to irreducible error (i.e. won’t be useful for out of sample prediction). For this reason it is fairly standard practice to use lambda.1se as the best \(\lambda\).

Now that we have identified the best \(\lambda\)’s for each method we need to fit them to the entire model building dataset hitter_mod_bldg_dat. These will be our candidate models produced by ridge regression and the lasso. We do this within a tibble framework so that later we are able to calculate each model’s test error (test mse) on hitter_mod_comp_dat. This will allow us to easily compare all of our candidate models against each other in order to pick a final prediction model.

Let’s take a look at the coefficients for the candidate models produced by ridge regression and the lasso.

name ridge_min ridge_1se lasso_min lasso_1se
(Intercept) -18.450 232.822 -73.143 221.115
at_bat 0.085 0.088 NA NA
hits 0.892 0.345 2.076 1.315
hm_run 1.199 1.247 NA NA
runs 0.980 0.541 NA NA
rbi 0.804 0.572 NA NA
walks 1.281 0.669 1.046 0.410
years 1.552 2.635 NA NA
c_at_bat 0.012 0.008 NA NA
c_hits 0.060 0.028 NA NA
c_hm_run 0.615 0.243 0.970 0.464
c_runs 0.127 0.058 0.329 0.180
c_rbi 0.139 0.062 0.154 0.206
c_walks 0.043 0.060 NA NA
leagueA -12.689 0.122 NA NA
leagueN 12.693 -0.122 NA NA
divisionE 58.498 17.297 113.273 NA
divisionW -58.503 -17.297 -2.676 NA
put_outs 0.195 0.048 0.247 0.026
assists 0.077 0.007 NA NA
errors -1.633 -0.141 NA NA
new_leagueA 2.442 0.236 NA NA
new_leagueN -2.448 -0.236 NA NA

6.4 PCR & PLS Regression

Will will be using pcr() from the pls library to implement principal component regression. Recall the purpose of PCR is to reduce the dimension of our predictor set. That is we want to find a set of \(m\) principal components (linear combinations of predictors) that we will then use in a linear regression. This is to solve the problem of having too many predictors relative to observations, especially when the number of predictors exceeds that of our observations (\(p > n\)).

Hopefully it is clear that \(m\), the number of principal components, is just a tuning parameter. Meaning we can once again turn to cross-validation techniques to pick the optimal \(m\), optimal in that we want the lowest possible test error. We can set validation = "CV" in pcr() to conduct 10-fold cross-validation by default. Note that you can control the number by specifying, for example, segments = 5 for 5-fold cross-validation.

Before we proceed to tuning the \(m\) parameter there is one more important decision to be made. Do we want to standardize each predictor variable prior to forming the principal components? This is a very important decision because forming principal components is sensitive to scale and standardizing each variable forces them to have the same scale. If we choose to standardize the predictors than we are saying they are all equally important and none should be prioritized over the other due to how it is measured. If not, we are indicating that the scale of the predictor variables is important and should be taken into account when forming principal components. In most cases we will want scale = TRUE because we will have many variables measured on many different scales with no reasonable justification to believe that the scale of the measure should be important. For example a variable such as height does not become more useful when measured in centimeters, then measured in meters.

We can use validationplot() to find the \(m\) (number of components) that corresponds to the lowest estimated test error (either reported as root mean squared error or mean squared error). The plots below indicate that 6 would be the optimal number of principal components — added a vertical line at 6 to help identify.

If it becomes too difficult to determine the optimal value using a validationplot() then use summary() to get a table of the values.

## Data:    X dimension: 224 19 
##  Y dimension: 224 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           458.7    351.4    352.4    352.8    351.3    348.4    339.9
## adjCV        458.7    351.0    352.0    352.3    350.8    348.0    339.1
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       341.8      345    344.1     349.8     352.4     353.4     356.3
## adjCV    341.0      344    343.2     348.2     350.8     351.7     354.4
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        356.5     351.1     344.3     351.4     352.8     358.3
## adjCV     353.6     349.2     342.5     349.0     350.2     355.4
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X         37.57    60.18    71.18    79.31    84.61    88.95    92.34    95.01
## salary    42.30    43.27    43.67    44.55    46.10    48.66    48.66    48.77
##         9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X         96.31     97.33     98.06     98.73     99.19     99.47     99.75
## salary    49.23     50.39     50.48     50.58     50.59     51.66     51.99
##         16 comps  17 comps  18 comps  19 comps
## X          99.89     99.97     99.99    100.00
## salary     53.62     54.97     55.31     55.38

Let’s move on to calculating the optimal \(m\) when using plsr(), partial least squares regression. Note that \(m\) stands for the number of partial least squares directions (much like the number of principal components).

Looks like 4 partial least squares directions produce the lowest cross-validation error.

There doesn’t appear to be much difference between using 2, 3 ,or 4 partial least squares directions (see plot above and table below). The fewer directions, the better because the purpose of partial least squares regression is to reduce the dimension of our predictor space. Let’s also plan on proposing a candidate model with 2 parial leat squares directions.

## Data:    X dimension: 224 19 
##  Y dimension: 224 1
## Fit method: kernelpls
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           458.7    346.5    344.3    343.7    343.7    348.9    349.6
## adjCV        458.7    346.2    343.6    342.7    342.6    347.2    347.7
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       349.1    347.1    351.7     347.0     346.6     346.5     348.3
## adjCV    347.1    344.8    349.1     344.8     344.1     344.4     346.0
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        350.7     349.8     349.2     349.9     350.0     352.9
## adjCV     348.1     347.3     346.8     347.5     347.5     350.2
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X         37.35    50.19    65.57    73.78    78.79    84.70    88.71    90.62
## salary    44.80    48.39    49.82    50.67    51.88    52.57    53.18    54.24
##         9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X         92.81     93.85      94.5     97.24     98.38     98.57     99.15
## salary    54.58     54.85      55.1     55.14     55.19     55.28     55.30
##         16 comps  17 comps  18 comps  19 comps
## X          99.40     99.71     99.97    100.00
## salary     55.34     55.35     55.36     55.38

Now that we have identified the best \(m\) for each method we need to fit them to the entire model building dataset hitter_mod_bldg_dat. These will be the candidate models produced by principal component regression and partial least squares regression. We do this within a tibble framework so that later we are able to calculate each model’s test error (test mse) on hitter_mod_comp_dat. This will allow us to easily compare all of our candidate models against each other in order to pick a final prediction model.