6 Lab: Linear Model Selection and Regularization

This is a modified version of Lab 1: Subset Selection Methods, Lab 2: Ridge Regression and the Lasso, and Lab 3: PCR and PLS Regression labs from chapter 6 of Introduction to Statistical Learning with Application in R. This version uses tidyverse techniques and methods that will allow for scalability and a more efficient data analytic pipeline.

We will need the packages loaded below.

# Load packages
library(tidyverse)
library(modelr)
library(janitor)
library(skimr)
library(leaps) # best subset selection
library(glmnet) # ridge & lasso
library(glmnetUtils) # improves working with glmnet
library(pls) # pcr and pls

Since we will be making use of validation set and cross-validation techniques to build and select models we should set our seed to ensure the analyses and procedures being performed are reproducible. Readers will be able to reproduce the lab results if they run all code in sequence from start to finish — cannot run code chucks out of order. Setting the seed should occur towards the top of an analytic script, say directly after loading necessary packages.

# Setting random number generator version
# This is so the code output matches the write-up
RNGversion("3.5")

# Set seed 
set.seed(27182) # Used digits from e

6.1 Data Setup

We will be using the Hitters dataset from the ISLR library. Take a moment and inspect the codebook — ?ISLR::Hitters. Our goal will be to predict salary from the other available predictors.

hitter_dat <- read_csv("data/Hitters.csv") %>% 
  clean_names() %>% 
  rename(c_rbi = crbi) %>% 
  mutate_at(vars(league, division, new_league), factor)

Let’s skim the data to make sure it has been encoded correctly and quickly inspect it for other potential issues such as missing data.

hitter_dat %>% 
  skim_without_charts()

## ── Data Summary ────────────────────────
##                            Values    
## Name                       Piped data
## Number of rows             322       
## Number of columns          21        
## _______________________              
## Column type frequency:               
##   character                1         
##   factor                   3         
##   numeric                  17        
## ________________________             
## Group variables            None      
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 name                  0             1     7    17     0      322          0
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts    
## 1 league                0             1 FALSE          2 A: 175, N: 147
## 2 division              0             1 FALSE          2 W: 165, E: 157
## 3 new_league            0             1 FALSE          2 A: 176, N: 146
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##    skim_variable n_missing complete_rate    mean      sd    p0   p25    p50    p75  p100
##  1 at_bat                0         1      381.    153.    16   255.   380.   512     687
##  2 hits                  0         1      101.     46.5    1    64     96    137     238
##  3 hm_run                0         1       10.8     8.71   0     4      8     16      40
##  4 runs                  0         1       50.9    26.0    0    30.2   48     69     130
##  5 rbi                   0         1       48.0    26.2    0    28     44     64.8   121
##  6 walks                 0         1       38.7    21.6    0    22     35     53     105
##  7 years                 0         1        7.44    4.93   1     4      6     11      24
##  8 c_at_bat              0         1     2649.   2324.    19   817.  1928   3924.  14053
##  9 c_hits                0         1      718.    654.     4   209    508   1059.   4256
## 10 c_hm_run              0         1       69.5    86.3    0    14     37.5   90     548
## 11 c_runs                0         1      359.    334.     1   100.   247    526.   2165
## 12 c_rbi                 0         1      330.    333.     0    88.8  220.   426.   1659
## 13 c_walks               0         1      260.    267.     0    67.2  170.   339.   1566
## 14 put_outs              0         1      289.    281.     0   109.   212    325    1378
## 15 assists               0         1      107.    137.     0     7     39.5  166     492
## 16 errors                0         1        8.04    6.37   0     3      6     11      32
## 17 salary               59         0.817  536.    451.    67.5 190    425    750    2460

We see there is some missing data/values which are all contained in the response variable salary. There isn’t much we could do here unless we would want to do some research ourselves and see if we can find them somewhere else. We will remove them from our dataset. Note that we could still produce predicted salaries for these players, but we just won’t be able to determine if the predictions are any good.

hitter_dat <- hitter_dat %>% na.omit()

We will be implementing model selection later in this lab and we have a choice which impacts how we need to handle the dataset. We could opt to use \(C_p\), \(BIC\), or \(R^2_{adj}\) or we could use validation set or cross-validation approaches. Remember that \(C_p\), \(BIC\), or \(R^2_{adj}\) are essentially methods that indirectly estimate test \(MSE\) while validation set or cross-validation approaches provide direct estimates of test \(MSE\). The indirect approaches require many assumptions that may not be tenable and they use the entire dataset. We prefer the direct approaches to estimating test \(MSE\) because they are more flexible and computational limitations are no longer a barrier.

# Test set for comparing candidate models -- selecting final model
hitter_mod_comp_dat <- hitter_dat %>% sample_frac(0.15)

# Train set for candidate model building
hitter_mod_bldg_dat <- hitter_dat %>% setdiff(hitter_mod_comp_dat)

6.2 Subset Selection Methods

## Helper functions

# Get predicted values using regsubset object
predict_regsubset <- function(object, fmla , new_data, model_id)
{
  # Not a dataframe? -- handle resample objects/k-folds
  if(!is.data.frame(new_data)){
    new_data <- as_tibble(new_data)
  }
  
  # Get formula
  obj_formula <- as.formula(fmla)
  
  # Extract coefficients for desired model
  coef_vector <- coef(object, model_id)
  
  # Get appropriate feature matrix for new_data
  x_vars <- names(coef_vector)
  mod_mat_new <- model.matrix(obj_formula, new_data)[ , x_vars]
  
  # Get predicted values
  pred <- as.numeric(mod_mat_new %*% coef_vector)
  
  return(pred)
}

# Calculate test MSE on regsubset objects
test_mse_regsubset <- function(object, fmla , test_data){
  
  # Number of models
  num_models <- object %>% summary() %>% pluck("which") %>% dim() %>% .[1]
  
  # Set up storage
  test_mse <- rep(NA, num_models)
  
  # observed targets
  obs_target <- test_data %>% 
    as_tibble() %>% 
    pull(!!as.formula(fmla)[[2]])
  
  # Calculate test MSE for each model class
  for(i in 1:num_models){
    pred <- predict_regsubset(object, fmla, test_data, model_id = i)
    test_mse[i] <- mean((obs_target - pred)^2)
  }
  
  # Return the test errors for each model class
  tibble(model_index = 1:num_models,
         test_mse    = test_mse)
}

# Best susbset: use 10-fold CV to select optimal number of variables
hitter_bestsubset_cv <- hitter_mod_bldg_dat %>% 
  crossv_kfold(10, id = "folds") %>% 
  mutate(fmla = "salary ~ . -name",
         model_fits = map2(fmla, 
                           train, 
                           ~ regsubsets(as.formula(.x), data = .y, nvmax = 19,)
                           ),
         model_fold_mse = pmap(list(model_fits, fmla ,test), test_mse_regsubset)
         )

# Forward selection: use 10-fold CV to select optimal number of variables
hitter_fwd_cv <- hitter_mod_bldg_dat %>% 
  crossv_kfold(10, id = "folds") %>% 
  mutate(fmla = "salary ~ . -name",
         model_fits = map2(fmla, 
                           train, 
                           ~ regsubsets(as.formula(.x), data = .y, nvmax = 19, method = "forward")
                           ),
         model_fold_mse = pmap(list(model_fits, fmla ,test), test_mse_regsubset)
         )

# Backward selection: use 10-fold CV to select optimal number of variables
hitter_back_cv <- hitter_mod_bldg_dat %>% 
  crossv_kfold(10, id = "folds") %>% 
  mutate(fmla = "salary ~ . -name",
         model_fits = map2(fmla, 
                           train, 
                           ~ regsubsets(as.formula(.x), data = .y, nvmax = 19, method = "backward")
                           ),
         model_fold_mse = pmap(list(model_fits, fmla ,test), test_mse_regsubset)
         )

Results for best subset search.

# Plot test MSE (exhaustive search)
hitter_bestsubset_cv %>% 
  unnest(model_fold_mse) %>% 
  group_by(model_index) %>% 
  summarise(test_mse = mean(test_mse)) %>% 
  ggplot(aes(model_index, test_mse)) + 
    geom_line()

# Plot test MSE (exhaustive search)
hitter_bestsubset_cv %>% 
  unnest(model_fold_mse) %>% 
  group_by(model_index) %>%  
  summarise(test_mse = mean(test_mse)) %>% 
  arrange(test_mse)

## # A tibble: 19 x 2
##    model_index test_mse
##          <int>    <dbl>
##  1           9  114697.
##  2           8  120848.
##  3          10  122537.
##  4          11  125281.
##  5          12  127959.
##  6          13  128225.
##  7          14  129884.
##  8          15  130396.
##  9          17  130706.
## 10          16  131059.
## 11          18  131543.
## 12          19  132190.
## 13           4  138234.
## 14           7  139159.
## 15           6  140677.
## 16           2  144966.
## 17           3  145706.
## 18           5  147677.
## 19           1  151427.

Results for best foward selection search.

# Plot test MSE (forward search)
hitter_fwd_cv %>% 
  unnest(model_fold_mse) %>% 
  group_by(model_index) %>% 
  summarise(test_mse = mean(test_mse)) %>% 
  ggplot(aes(model_index, test_mse)) + 
    geom_line()

# Plot test MSE (forward search)
hitter_fwd_cv %>% 
  unnest(model_fold_mse) %>% 
  group_by(model_index) %>%  
  summarise(test_mse = mean(test_mse)) %>% 
  arrange(test_mse)

## # A tibble: 19 x 2
##    model_index test_mse
##          <int>    <dbl>
##  1          10  122800.
##  2           9  123859.
##  3          11  125631.
##  4          12  125944.
##  5          13  127514.
##  6          14  128240.
##  7           8  128615.
##  8          15  129167.
##  9          17  129379.
## 10          18  129533.
## 11          16  129768.
## 12          19  129838.
## 13           6  131553.
## 14           7  131891.
## 15           5  132531.
## 16           4  135324.
## 17           3  143196.
## 18           2  147530.
## 19           1  168375.

Results for best backward selection search.

# Plot test MSE (backward search)
hitter_back_cv %>% 
  unnest(model_fold_mse) %>% 
  group_by(model_index) %>% 
  summarise(test_mse = mean(test_mse)) %>% 
  ggplot(aes(model_index, test_mse)) + 
    geom_line()

# Table test MSE (backward search)
hitter_back_cv %>% 
  unnest(model_fold_mse) %>% 
  group_by(model_index) %>%  
  summarise(test_mse = mean(test_mse)) %>% 
  arrange(test_mse)

## # A tibble: 19 x 2
##    model_index test_mse
##          <int>    <dbl>
##  1          10  124270.
##  2           9  125919.
##  3           8  126654.
##  4          11  128029.
##  5           6  130783.
##  6          12  132626.
##  7          13  133752.
##  8          15  133922.
##  9          16  133988.
## 10           5  134176.
## 11           7  134629.
## 12          19  134881.
## 13          14  135071.
## 14          18  135404.
## 15          17  135502.
## 16           4  142425.
## 17           2  143606.
## 18           3  144385.
## 19           1  164982.

Results above suggest using a model with 9 predictors — regardless of selection method — would be reasonable. Now implement each selection method on the entirety of hitter_mod_bldg_dat and extract the best model that uses 9 predictors from each. These will be 3 of our candidate models.

hitter_regsubsets <- tibble(train = hitter_mod_bldg_dat %>% list(),
                            test  = hitter_mod_comp_dat %>% list()) %>%
  mutate(best_subset = map(train, ~ regsubsets(salary ~ . -name, 
                                               data = .x , nvmax = 9)),
         fwd_selection = map(train, ~ regsubsets(salary ~ . -name, 
                                                 data = .x, nvmax = 9, 
                                                 method = "forward")),
         back_selection = map(train, ~ regsubsets(salary ~ . -name, 
                                                  data = .x, nvmax = 9, 
                                                  method = "backward"))) %>% 
  pivot_longer(cols = c(-test, -train), names_to = "method", values_to = "fit")

All three of these candidate models have 9 predictors. It is possible that all three methods produce the same model. Let’s take a look at the estimated coefficients for these models.

# Inspect/compare model coefficients 
hitter_regsubsets %>% 
  pluck("fit") %>% 
  map( ~ coef(.x, id = 9)) %>% 
  map2(c("best", "fwd", "back"), ~ enframe(.x, value = .y)) %>% 
  reduce(full_join) %>% 
  knitr::kable(digits = 3)

## Joining, by = "name"
## Joining, by = "name"

name	best	fwd	back
(Intercept)	139.789	141.091	133.610
at_bat	-2.093	-1.556	-1.741
hits	6.833	5.548	6.110
walks	5.395	4.991	5.150
c_hm_run	1.796	NA	NA
c_runs	0.988	1.265	1.406
c_walks	-0.903	-0.824	-0.935
divisionW	-132.673	-130.710	-130.760
put_outs	0.307	0.270	0.282
assists	0.278	NA	NA
c_at_bat	NA	-0.092	NA
c_rbi	NA	0.692	0.683
c_hits	NA	NA	-0.350

As the above table demonstrates, the three methods selected three different models that use 9 of the available predictors. There are 7 predictors that all three models have in common. The forward and backward models have 8 predictors in common.

6.3 Ridge Regression & the Lasso

The glmnet package allows users to fit elastic net regressions which includes ridge regression (alpha = 0) and the lasso (alpha =1). Unfortunately the package does not use our familiar formula syntax, y ~ x. It requires the user to encode their data such that the penalized least squares algorithm can be applied. Both lm() and glm() also require the data to be correctly encoded for the least squares algorithm to work but this is done automatically within those functions by using the model.matrix() which takes the inputted formula and appropriately re-encodes the data for the least squares algorithm. glmnet requires the user to use model.matrix() themselves. Wouldn’t it be nice if the glmnet functions did this automatically just like lm() and glm()? Fortunately there is the glmnetUtils package which allows us to keep on using our familiar formula interface with glmnet functions.

The tuning parameter for both ridge regression and the lasso is \(\lambda\), size of the penalty to apply in the least squares algorithm. How do we go about tuning/finding an estimate for the \(\lambda\) parameter? We call upon cross-validation techniques once again. glmnet (glmnetUtils) has a built-in function that will implement cross-validation quickly and efficiently for us. Meaning there is no need for use to use crossv_kfold(). We will be using cv.glmnet() — use ?glmnetUtils::cv.glmnet to access the correct documentation. It is the alpha argument to these functions that determine if we are using ridge regression (alpha = 0) or the lasso (alpha = 1).

Since we are in search of an adequate \(\lambda\) we need to specify a range of values to search. Much like we specified the number of predictors to search over when using regsubset() — started with 1 and went up to 19. The difference here is that \(\lambda\) can be any positive real number (size of penalty to apply). Obviously we cannot use every possible value greater than 0, so we need to setup a grid search. We could go with an automatically selected grid for \(\lambda\) or we can build our own grid. We will demonstrate both.

# lambda grid to search -- use for ridge regression (200 values)
lambda_grid <- 10^seq(-2, 10, length = 200)

# ridge regression: 10-fold cv
ridge_cv <- hitter_mod_bldg_dat %>% 
  cv.glmnet(formula = salary ~ . - name, 
            data = ., 
            alpha = 0, 
            nfolds = 10,
            lambda = lambda_grid
            )

The following plot displays the 200 estimated test \(MSE\) values using 10 fold cross validation. The plot indicates the minimum \(\lambda\) (left dotted line). The right dotted line corresponds to a \(\lambda\) value that would result in a more regularized model (forcing more or further shrinkage of coefficients towards zero) and is within one standard error of the minimum \(\lambda\) (model’s are not all that different in estimated test \(MSE\)). Why would we desire a more regularized model? A more regularized model will typically shift more of the prediction burden/importance to fewer predictor variables. The top of the plot displays (periodically) the number of predictors in the models and in the case of ridge regression this number is very unlikely to change since it is highly unlikely to force any of the coefficients to be exactly zero.

# Check plot of cv error
plot(ridge_cv)

Strictly speaking the best \(\lambda\) is the one which minimizes the test error. We will extract that value to use for ridge regression. We will also extract the \(\lambda\) that is within one standard error which can help safe guard against over-fitting issues.

# ridge's best lambdas
ridge_lambda_min <- ridge_cv$lambda.min
ridge_lambda_1se <- ridge_cv$lambda.1se

Let’s get the best \(\lambda\) values for the lasso. Recall that alpha = 1 now and we will let the function select the search grid for \(\lambda\).

# lasso: 10-fold cv
lasso_cv <- hitter_mod_bldg_dat %>% 
  cv.glmnet(formula = salary ~ . - name, 
            data = ., 
            alpha = 1, 
            nfolds = 10
            )

There is an extremely important difference between this plot which is using the lasso and with the plot above which uses ridge regression. The number of predictors is decreasing (top of graph) as \(\lambda\) increases. This is an important distinction between the lasso and ridge regression. Lasso forces coefficients to be zero (i.e. it is performing variable selection). For the lasso a more regularized model will result in a more parsimonious (simpler) model. This is extremely useful when theory indicates that it is likely a handful of variables that are truly related to the response and the others are over-fitting to irreducible error (i.e. won’t be useful for out of sample prediction). For this reason it is fairly standard practice to use lambda.1se as the best \(\lambda\).

# Check plot of cv error
plot(lasso_cv)

# lasso's best lambdas
lasso_lambda_1se <- lasso_cv$lambda.1se
lasso_lambda_min <- lasso_cv$lambda.min

Now that we have identified the best \(\lambda\)’s for each method we need to fit them to the entire model building dataset hitter_mod_bldg_dat. These will be our candidate models produced by ridge regression and the lasso. We do this within a tibble framework so that later we are able to calculate each model’s test error (test mse) on hitter_mod_comp_dat. This will allow us to easily compare all of our candidate models against each other in order to pick a final prediction model.

hitter_glmnet <- tibble(train = hitter_mod_bldg_dat %>% list(),
                        test  = hitter_mod_comp_dat %>% list()) %>%
  mutate(ridge_min = map(train, ~ glmnet(salary ~ . -name, data = .x,
                                         alpha = 0, lambda = ridge_lambda_min)),
         ridge_1se = map(train, ~ glmnet(salary ~ . -name, data = .x,
                                         alpha = 0, lambda = ridge_lambda_1se)),
         lasso_min = map(train, ~ glmnet(salary ~ . -name, data = .x,
                                         alpha = 1, lambda = lasso_lambda_min)),
         lasso_1se = map(train, ~ glmnet(salary ~ . -name, data = .x,
                                         alpha = 1, lambda = lasso_lambda_1se))) %>% 
  pivot_longer(cols = c(-test, -train), names_to = "method", values_to = "fit")

Let’s take a look at the coefficients for the candidate models produced by ridge regression and the lasso.

# Inspect/compare model coefficients 
hitter_glmnet %>% 
  pluck("fit") %>% 
  map( ~ coef(.x) %>% 
         as.matrix() %>% 
         as.data.frame() %>% 
         rownames_to_column("name")) %>%
  reduce(full_join, by = "name") %>% 
  mutate_if(is.double, ~ if_else(. == 0, NA_real_, .)) %>% 
  rename(ridge_min = s0.x,
         ridge_1se = s0.y,
         lasso_min = s0.x.x,
         lasso_1se = s0.y.y) %>% 
  knitr::kable(digits = 3)

name	ridge_min	ridge_1se	lasso_min	lasso_1se
(Intercept)	-18.450	232.822	-73.143	221.115
at_bat	0.085	0.088	NA	NA
hits	0.892	0.345	2.076	1.315
hm_run	1.199	1.247	NA	NA
runs	0.980	0.541	NA	NA
rbi	0.804	0.572	NA	NA
walks	1.281	0.669	1.046	0.410
years	1.552	2.635	NA	NA
c_at_bat	0.012	0.008	NA	NA
c_hits	0.060	0.028	NA	NA
c_hm_run	0.615	0.243	0.970	0.464
c_runs	0.127	0.058	0.329	0.180
c_rbi	0.139	0.062	0.154	0.206
c_walks	0.043	0.060	NA	NA
leagueA	-12.689	0.122	NA	NA
leagueN	12.693	-0.122	NA	NA
divisionE	58.498	17.297	113.273	NA
divisionW	-58.503	-17.297	-2.676	NA
put_outs	0.195	0.048	0.247	0.026
assists	0.077	0.007	NA	NA
errors	-1.633	-0.141	NA	NA
new_leagueA	2.442	0.236	NA	NA
new_leagueN	-2.448	-0.236	NA	NA

6.4 PCR & PLS Regression

Will will be using pcr() from the pls library to implement principal component regression. Recall the purpose of PCR is to reduce the dimension of our predictor set. That is we want to find a set of \(m\) principal components (linear combinations of predictors) that we will then use in a linear regression. This is to solve the problem of having too many predictors relative to observations, especially when the number of predictors exceeds that of our observations (\(p > n\)).

Hopefully it is clear that \(m\), the number of principal components, is just a tuning parameter. Meaning we can once again turn to cross-validation techniques to pick the optimal \(m\), optimal in that we want the lowest possible test error. We can set validation = "CV" in pcr() to conduct 10-fold cross-validation by default. Note that you can control the number by specifying, for example, segments = 5 for 5-fold cross-validation.

Before we proceed to tuning the \(m\) parameter there is one more important decision to be made. Do we want to standardize each predictor variable prior to forming the principal components? This is a very important decision because forming principal components is sensitive to scale and standardizing each variable forces them to have the same scale. If we choose to standardize the predictors than we are saying they are all equally important and none should be prioritized over the other due to how it is measured. If not, we are indicating that the scale of the predictor variables is important and should be taken into account when forming principal components. In most cases we will want scale = TRUE because we will have many variables measured on many different scales with no reasonable justification to believe that the scale of the measure should be important. For example a variable such as height does not become more useful when measured in centimeters, then measured in meters.

# pcr: 10-fold cv
pcr_cv <- hitter_mod_bldg_dat %>% 
  pcr(salary ~ . -name, data = ., scale = TRUE, validation = "CV")

We can use validationplot() to find the \(m\) (number of components) that corresponds to the lowest estimated test error (either reported as root mean squared error or mean squared error). The plots below indicate that 6 would be the optimal number of principal components — added a vertical line at 6 to help identify.

# Using root mean squared error
validationplot(pcr_cv)
abline(v = 6) # add vertical line at 6

# Using mean squared error
validationplot(pcr_cv, val.type="MSEP")
abline(v = 6) # add vertical line at 6

If it becomes too difficult to determine the optimal value using a validationplot() then use summary() to get a table of the values.

pcr_cv %>% 
  summary()

## Data:    X dimension: 224 19 
##  Y dimension: 224 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           458.7    351.4    352.4    352.8    351.3    348.4    339.9
## adjCV        458.7    351.0    352.0    352.3    350.8    348.0    339.1
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       341.8      345    344.1     349.8     352.4     353.4     356.3
## adjCV    341.0      344    343.2     348.2     350.8     351.7     354.4
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        356.5     351.1     344.3     351.4     352.8     358.3
## adjCV     353.6     349.2     342.5     349.0     350.2     355.4
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X         37.57    60.18    71.18    79.31    84.61    88.95    92.34    95.01
## salary    42.30    43.27    43.67    44.55    46.10    48.66    48.66    48.77
##         9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X         96.31     97.33     98.06     98.73     99.19     99.47     99.75
## salary    49.23     50.39     50.48     50.58     50.59     51.66     51.99
##         16 comps  17 comps  18 comps  19 comps
## X          99.89     99.97     99.99    100.00
## salary     53.62     54.97     55.31     55.38

Let’s move on to calculating the optimal \(m\) when using plsr(), partial least squares regression. Note that \(m\) stands for the number of partial least squares directions (much like the number of principal components).

# pls: 10-fold cv
pls_cv <- hitter_mod_bldg_dat %>% 
  plsr(salary ~ . -name, data = ., scale = TRUE, validation = "CV")

Looks like 4 partial least squares directions produce the lowest cross-validation error.

validationplot(pls_cv, val.type = "MSEP")
abline(v = 4)

There doesn’t appear to be much difference between using 2, 3 ,or 4 partial least squares directions (see plot above and table below). The fewer directions, the better because the purpose of partial least squares regression is to reduce the dimension of our predictor space. Let’s also plan on proposing a candidate model with 2 parial leat squares directions.

pls_cv %>% 
  summary()

## Data:    X dimension: 224 19 
##  Y dimension: 224 1
## Fit method: kernelpls
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           458.7    346.5    344.3    343.7    343.7    348.9    349.6
## adjCV        458.7    346.2    343.6    342.7    342.6    347.2    347.7
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       349.1    347.1    351.7     347.0     346.6     346.5     348.3
## adjCV    347.1    344.8    349.1     344.8     344.1     344.4     346.0
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        350.7     349.8     349.2     349.9     350.0     352.9
## adjCV     348.1     347.3     346.8     347.5     347.5     350.2
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X         37.35    50.19    65.57    73.78    78.79    84.70    88.71    90.62
## salary    44.80    48.39    49.82    50.67    51.88    52.57    53.18    54.24
##         9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X         92.81     93.85      94.5     97.24     98.38     98.57     99.15
## salary    54.58     54.85      55.1     55.14     55.19     55.28     55.30
##         16 comps  17 comps  18 comps  19 comps
## X          99.40     99.71     99.97    100.00
## salary     55.34     55.35     55.36     55.38

Now that we have identified the best \(m\) for each method we need to fit them to the entire model building dataset hitter_mod_bldg_dat. These will be the candidate models produced by principal component regression and partial least squares regression. We do this within a tibble framework so that later we are able to calculate each model’s test error (test mse) on hitter_mod_comp_dat. This will allow us to easily compare all of our candidate models against each other in order to pick a final prediction model.

hitter_dim_reduct <- tibble(train = hitter_mod_bldg_dat %>% list(),
                        test  = hitter_mod_comp_dat %>% list()) %>%
  mutate(pcr_4m = map(train, ~ pcr(salary ~ . -name, data = .x, ncomp = 4)),
         pls_2m = map(train, ~ plsr(salary ~ . -name, data = .x, ncomp = 2))) %>% 
  pivot_longer(cols = c(-test, -train), names_to = "method", values_to = "fit")

6.5 Final Model Selection

We will need to calculate the test error for each candidate model produced by the several methods we have previously fit.

# Test error for regsubset fits
regsubset_error <- hitter_regsubsets %>% 
  mutate(test_mse = map2(fit, test, ~ test_mse_regsubset(.x, salary ~ . -name, .y))) %>% 
  unnest(test_mse) %>% 
  filter(model_index == 9) %>% 
  select(method, test_mse)

# Test error for ridge and lasso fits
glmnet_error <- hitter_glmnet %>% 
  mutate(pred = map2(fit, test, predict),
         test_mse = map2_dbl(test, pred, ~ mean((.x$salary - .y)^2))) %>% 
  unnest(test_mse) %>% 
  select(method, test_mse)

# Test error for pcr and plsr fits
dim_reduce_error <- hitter_dim_reduct %>% 
  mutate(pred = pmap(list(fit, test, c(4,2)), predict),
         test_mse = map2_dbl(test, pred, ~ mean((.x$salary - .y)^2))) %>% 
  unnest(test_mse)  %>% 
  select(method, test_mse)

# Test errors ccombined and organzied
regsubset_error %>% 
  bind_rows(glmnet_error) %>% 
  bind_rows(dim_reduce_error) %>% 
  arrange(test_mse) %>%
  knitr::kable(digits = 2)

method	test_mse
fwd_selection	91438.94
back_selection	96435.69
best_subset	101786.18
ridge_1se	115455.82
ridge_min	116436.23
lasso_1se	117048.62
lasso_min	119358.18
pcr_4m	124611.38
pls_2m	128228.13

Appears that partial least squares regression with 4 directions produces the best model (lowest test mse).