9 Multiple Linear Regression

This lab is structured to guide you through an organized process such that you could easily organize your code with comments — meaning your R script — into a lab report. We would suggest getting into the habit of writing an organized and commented R script that completes the tasks and answers the questions provided in the lab — including in the Own Your Own section.


9.1 Getting Started

Recall that we explored simple linear regression by examining baseball data from the 2011 Major League Baseball (MLB) season. We will also use this data to explore multiple regression. Our inspiration for exploring this data stems from the movie Moneyball, which focused on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.

In this lab we’ll be looking at data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to find the model that best predicts a team’s runs scored in a season. We also aim to find the model that best predicts a team’s total wins in a season. The first model would tell us which player statistics we should pay attention to if we wish to purchase runs and the second model would indicate which player statistics we should utilize when we wish to purchase wins.

9.2 The data

Let’s load up the data for the 2011 season.

In addition to runs scored, there are seven traditionally used variables in the data set: at-bats, hits, home runs, batting average, strikeouts, stolen bases, and wins. There are also three newer variables: on-base percentage, slugging percentage, and on-base plus slugging. For the first portion of the analysis we’ll consider the seven traditional variables. At the end of the lab, you’ll work with the newer variables on your own.

We also would like to modify the data so that it easier to work with during model selection. We remove the variable team from the dataset and store the updated verison in mlb11_wins.

Since wins is not a player level statistic - at least for non-pitchers - we do not want to use it when predicting runs. Therefore we are going to create another modified dataset to utilize when attempting to find the best model for predicting the total number of runs for a team during a season. The reverse is not an issue when attempting to predict a team’s number of wins for a season - runs can be used to predict wins.

As discussed in class there are many ways to go about model selection. We will look at both forward and backward selection methods that utilize

9.3 The search for the best model

As discussed in class there are many ways to go about model selection. We will look at both forward and backward selection methods that utilize different criterions (\(R^2_{adj}\), p-values, or AIC).

9.3.1 Predicting runs with backward selection

The first step in backward selection is to define a full model. Since we created a modified dataset for predicting runs we can use a shortcut, runs ~ ., for telling R to use all remaining variables to predict runs.


Exercise 1: How many variables are being used to predict runs in the full model? How many parameters are being estimated in the full model? How many of the parameters are significantly different than 0 at the 0.05 level? What is the full model’s \(R^2_{adj}\)?


Now that we have a full model defined we can go about backwards model selection. The step() function in R makes it extremely easy to use AIC (Akiake’s Information Criterion) for model selection. Similar to \(R^2_{adj}\), AIC applies a penalty to models using more predictor variables. Run the following code to determine the best model for predicting a team’s runs in a season using backward selection with AIC as the criterion (note that lower AIC indicates a better model).


Exercise 2: How many steps did the backward selection using AIC conduct before selecting a model? Which variable was the first to be removed? Which variables ended up in the final model? How many parameters are being estimated in this final model? How many of the parameters in this final model are significantly different than 0 at the 0.05 level? Does this final model have a higher \(R^2_{adj}\) than the full model for runs?


Instead of AIC, let’s use \(R^2_{adj}\) as our criterion when conducting backward selection. Remember that \(R^2_{adj}\) indicates a better model.

Since the model which removed new_onbase has the highest \(R^2_{adj}\) we move onto step 2 using that model and continue by removing one variable at a time and calculate the new \(R^2_{adj}\) for each model.

Since the model in step 2 that removed strikeouts has the highest \(R^2_{adj}\) we move onto step 3 using the model that now has both new_onbase and strikeouts removed and continue by removing one variable at a time and calculating the new \(R^2_{adj}\) for each model.

Since none of the models which remove one more additional variable from the model that already excludes both new_onbase and strikeouts have larger \(R^2_{adj}\), we stop the process and now have a final model. In the code below we store the final model and look at a summary of the final model.

  1. Which variables ended up in the final model when using backward selection with \(R^2_{adj}\)? How many parameters are being estimated in this final model? How many of the parameters in this final model are significantly different than 0 at the 0.05 level? Does this final model have a higher \(R^2_{adj}\) than the full model for runs? Higher than the final model when using backward selection with AIC?

Finally let’s use a p-value method with a 0.05 significance level as our criterion when conducting backward selection. Remember that higher p-values are bad so we remove a variable when if it has the highest p-value which is greater than 0.05. If all p-values are less than 0.05 then we stop and we have arrived at our final model. Fortunately, R does have a function that makes this easier whish is drop1() - add1() is for forward selection. We must input the model fit and then indicate that test = "F" so p-values are printed.

Below we store the final model selected under this method and examinie it using summary().


Exercise 4: Which variables ended up in the final model using a p-value method with a 0.05 significance level as our criterion when conducting backward selection? How many parameters are being estimated in this final model? How many of the parameters in this final model are significantly different than 0 at the 0.05 level? Does this final model have a higher \(R^2_{adj}\) than the full model for runs? Why might someone prefer this final model over all of the models thus far?


9.3.2 Predicting runs with forward selection

The first step in forward selection is to set up a null/base model to build up from. This model could include variables that researchers stipulate a model must have for theoretical reasons. No such variables exisit in our case which means our null model will only have the intercept in it. We must also specify the full model so the preocedure knows which models to attempt. Note that the full model will be athe same as in backward selection.

Now that we have a null model defined we can go about forward model selection. Once again we will us the step() function in R to use AIC (Akiake’s Information Criterion) for model selection - remember that lower AIC indicates a better model.


Exercise 5: How many steps did the forward selection using AIC conduct before selecting a model? Which variable was the first to be added? Which previous selection method does this agree with - this doesn’t always happen?


Instead of AIC, let’s use \(R^2_{adj}\) as our criterion when conducting forward selection. Remember that \(R^2_{adj}\) indicates a better model.

Since the model that added new_obs has the highest \(R^2_{adj}\) we move onto step 2 using that model and continue by adding one variable at a time and calculate the new \(R^2_{adj}\) for each model.

Since the model is step 2 that added stolen_bases has the highest \(R^2_{adj}\) we move onto step 3 using that model and continue by adding one variable at a time and calculate the new \(R^2_{adj}\) for each model.

Since none of the models that added one more variable in step 3 resulted in an increased \(R^2_{adj}\), we stop the process and now have a final model. In the code below we store the final model and look at a summary of the final model.


Exercise 6: Does the final model selected using forward selection with \(R^2_{adj}\) differ from the final model using forward selection with AIC?


Finally let’s use a p-value method with a 0.05 significance level as our criterion when conducting forward selection. Remember that hlower p-values are consider better so we add a variable when it has the lowest p-value and is less than 0.05. If newly added variable is not less than 0.05 then we stop and conclude that we have arrived at our final model. Fortunately, R does have a function that makes this easier which is add1(). We must input model fit, the possible full model, and then indicate that the test = "F" so p-values are printed.

Below we store the final model selected under this method and examinie it using summary().


Exercise 7: What do you note about the final model selected using forward selection with p-values? This does not always occur.


9.4 Assessing the conditions

After conducting a model selection procedures we should conduct graphical checks to explore wheather our conditions for multiple regression are being met. R has built in command for basic diagnostic plots. Simply use the plot() function and input the model that you desire diagnostic plots for as demonstrated in the code below.

None of the models have and severe departures from our necessary conditions for multiple regression. Diagnostic methods can be tricky and to really get a good understanding of them will require either further self-study or taking a regression course.


9.5 On Your Own

  • Using all the available variables in our dataset, conduct backward selection using AIC to select the best model for predicting wins for a team in a single season. Hints: make sure you are using the mlb11_wins dataset. How many variables are being used to predict wins in the full model? Which variables are included in the this final model?

  • Construct a 95% confidence interval for one of the slopes in the final model from part 1.

  • Conduct backward selection using \(R^2_{adj}\) and then p-value method to select the best model for predicting wins. Do either of the final models selected using these criterions match the final model selected using backward selection with AIC? Do they match eachother?

  • Conduct forward selection with the three different criterions we have been using to select the best model for predicting wins for each. Are they all the same? Different?