Intro to Statistics and Data Science

5 Lab: Cross-Validation and the Bootstrap

This is a modified version of the Lab: Cross-Validation and the Bootstrap section of chapter 5 from Introduction to Statistical Learning with Application in R. This version uses tidyverse techniques and methods that will allow for scalability and a more efficient data analytic pipeline.

We will need the packages loaded below.

Whenever performing analyses or processes that include randomization or resampling it is considered best practice to set the seed of your software’s random number generator. This is done in R using set.seed(). This ensure the analyses and procedures being performed are reproducible. For instance, readers following along in the lab will be able to produce the precisely the same results as those produced in the lab. Provided they run all code in the lab in sequence from start to finish — cannot run code chucks out of order. Setting the seed should occur towards the top of an analytic script, say directly after loading necessary packages.

5.1 Validation Set Approach

We explore the use of the validation set approach in order to estimate the test error rates that result from fitting various linear models on the Auto dataset. This dataset is from the ISLR library. Take a moment and inspect the codebook — ?ISLR::Auto. We will read in the data from the Auto.csv file and process do a little processing.

We begin by using the sample_frac() function to split the set of observations into two halves, by selecting a random subset of 196 observations out of the original 392 observations. We refer to these observations as the training set.

Let’s keep it relatively simple and fit a simple linear regression using horsepower to predict mpg and polynomial regressions of up to degree 5 of horsepower to predict mpg.

## # A tibble: 10 x 2
##    degree test_mse
##     <int>    <dbl>
##  1      2     18.6
##  2      4     18.9
##  3      3     18.9
##  4      7     19.0
##  5      6     19.1
##  6      8     19.3
##  7      5     19.3
##  8      9     21.4
##  9      1     22.1
## 10     10     22.6

5.4 The Bootstrap

We will we using the Portfolio dataset from ISLR — see ?ISLR::Portfolio for details. We will load the dataset from the Portfolio.csv file.

TABLE 5.1: Data summary
Name Piped data
Number of rows 100
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
x 0 1 -0.08 1.06 -2.43 -0.89 -0.27 0.56 2.46
y 0 1 -0.10 1.14 -2.73 -0.89 -0.23 0.81 2.57

5.4.1 Estimating the Accuracy of a Statistic of Interest

## [1] 0.5758321
TABLE 5.2: Data summary
Name Piped data
Number of rows 1000
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
alpha_boot 0 1 0.58 0.09 0.32 0.51 0.57 0.63 0.86

## # A tibble: 1 x 2
##   est_boot est_se
##      <dbl>  <dbl>
## 1    0.576 0.0881

5.4.2 Estimating the Accuracy of a Linear Regression Model

TABLE 5.3: Data summary
Name Piped data
Number of rows 2000
Number of columns 2
_______________________
Column type frequency:
numeric 1
________________________
Group variables term

Variable type: numeric

skim_variable term n_missing complete_rate mean sd p0 p25 p50 p75 p100
estimate (Intercept) 0 1 39.97 0.87 37.27 39.43 39.95 40.57 43.10
estimate horsepower 0 1 -0.16 0.01 -0.18 -0.16 -0.16 -0.15 -0.14

## # A tibble: 2 x 3
##   term        estimate std.error
##   <chr>          <dbl>     <dbl>
## 1 (Intercept)   39.9     0.717  
## 2 horsepower    -0.158   0.00645
## # A tibble: 2 x 2
##   term         est_se
##   <chr>         <dbl>
## 1 (Intercept) 0.866  
## 2 horsepower  0.00743