knitr::opts_chunk$set(echo = TRUE, comment = "AL")

1 Introduction

When the coronavirus spread to the USA, the National Basketball Association (NBA) was in the middle of it’s 2019-2020 season. The season was immediately suspended until further notice when Rudy Gobert, a player for the Salt Lake City NBA team, tested positive for Covid-19. However, talks surfaced about a potential “bubble environment” that could completely isolate the teams to continue to play without having to worry about the possibility of being infected. So on July 7th, 22 teams were invited to participate to continue playing against each other in DisneyWorld and resulted in one of the most magical seasons ever played.

Even though there were many top teams playing, only 1 can be crowned champion at the end of the season, but is there a way that we can somehow predict who will be the champion based off of how well/bad a team performed in the regular season? Therefore, this project will attempt to look at a model with variables to see what variables can best be used to model how well a team will succeed.

NBA logo

1.1 Variables

First let’s read in the data to see what variables we will be working with. More specifically, we will be looking at the variables that are taken from game to game for each individual team and that are recorded when a team is on OFFENCE.

#reading in the data
library(readxl)
NBA = read_xlsx("C:/OU/Math 4773/Projects/Project 2/NBA team stat.xlsx")
names(NBA)
AL  [1] "Team" "FG"   "3P"   "2P"   "2PA"  "FT"   "ORB"  "AST"  "STL"  "W"

The variables that we will be looking at closely are:

1.1.1 TEAM

Team: the name of the NBA team you are looking at. (qualitative)

The NBA teams

1.1.2 3P

3P: The total number of 3 pointers made by a team per game, a successful shot that is shot behind an arc called the “3 point line.” (quantitative)

Jeremy Lin scoring a 3 pointer

1.1.3 2P

2P: The total number of 2 pointers made by a team per game, any successful shot inside the 3 point line (can be a shot or a dunk). (quantitative)

2 midranger shot { width=30% }

1.1.4 FT

FT: The total number of Free throws made by a team per game, a 1 point shot that is given to a player after being fouled. A free throw is taken at the “free throw line” located 15-feet (4.572 meters). (quantitative)

an underhanded free throw

1.1.5 ORB

ORB The total number of offensive rebounds by a team per game. An offensive rebound occurs when an offensive player grabs the basketball off of a missed shot by himself or a teammate and has possession. (quantitative)

offensive rebound by Marcus Smart off a missed free throw by his teammate

1.1.6 AST

AST: The total number of assists made by a team per game. An assist is given when a player passes to his teammate and the teammate scores. Only the teammate that passed the ball which lead to a 2/3 pointer will get counted for the assist, all prior passes will not count as extra assists. (quantitative)

Assist by Russel Westbrook to Andre Robinson

1.1.7 W

W: The total number of wins a team had at the end of the regular season (quantitative)

This years NBA champions, the Los Angeles Lakers!

1.2 How were the variables collected?

During each seasonal NBA game, each before-mention variables are collected game-to-game and then averaged over during the 82 game season.

1.3 Why is the data collected

The NBA is a global phenomenon drawing millions of both viewers and revenue each season. These statistics can be seen one way as "how do we keep the NBA more interesting so we can profit off the market of entertainment more?’ But the more logistical, and the one this project is based on, is how one NBA team can get an edge over the others by studying their strengths and weaknesses through statistics.

1.3.1 What is my interest in the data?

I want to see what variables contribute the most to a team, so I can hopefully better gain analysis skills to one day work for one of the 30 NBA teams and help them win a championship through the game of statistics.

1.4 Research question:

With so many variables and different focuses during a basketball game, I want to narrow down what NBA teams should focus on in order to win the most amount of games possible. That is to say, I want to look at what variables will lead have the greatest effect on the number of wins NBA teams achieve.

1.5 Ploting the data

These plots will be the variables vs W

1.5.1 3P 3 Pointer plot

library(ggplot2)
g = ggplot(data = NBA, aes(x = `3P` , y = W), main = "3 pointers")+geom_point()

g = g + geom_smooth(method = "loess") + geom_text(aes(label=Team),hjust=0, vjust=1)
g
AL `geom_smooth()` using formula 'y ~ x'

This is a plot of 3 pointers per game vs the number of wins a team got.

1.5.2 2P 2 Pointer plot

g = ggplot(data = NBA, aes(x= `2P`,y = W), main = "2 Pointers") + geom_point()
g = g + geom_smooth(method = "loess") + geom_text(aes(label=Team),hjust=0, vjust=1)
g
AL `geom_smooth()` using formula 'y ~ x'

This plot plots the number of 2 pointers per game to the number of wins for each of the 30 NBA teams.

1.5.3 FT Free Throw

#Free throw plot of free throws made per game vs wins
g = ggplot(data = NBA, aes(x= FT,y = W), main = "FT") + geom_point()
g = g + geom_smooth(method = "loess") + geom_text(aes(label=Team),hjust=0, vjust=1)
g
AL `geom_smooth()` using formula 'y ~ x'

This is the number of Free throws per game vs the number of wins for each of the 30 teams.

1.5.4 RB Rebound plot

g = ggplot(data = NBA, aes(x= ORB,y = W), main = "RB") + geom_point()
g = g + geom_smooth(method = "loess") + geom_text(aes(label=Team),hjust=0, vjust=1)
g
AL `geom_smooth()` using formula 'y ~ x'

This is the number of offensive rebounds vs wins for each of the 30 teams in the NBA

1.5.5 AST Assist plot

g = ggplot(data = NBA, aes(x= AST,y = W), main = "Assists") + geom_point()
g = g + geom_smooth(method = "loess") + geom_text(aes(label=Team),hjust=0, vjust=1)
g
AL `geom_smooth()` using formula 'y ~ x'

This is a plot of the number of assists per game a team has vs the number of wins.

2 Theory behind MLR

Below are 3 important proofs/theories around MLR

2.1 General formula

In a simple linear model, we see that \(E(Y|X=x) = \beta_0 + \beta_1x_1\). Now lets say that Y is a set of cases \(Y_1,Y_2...Y_n\) independent reponse values of Y. Then the formula for the simple linear model will be: \(Y_i = E(Y_i|X_i=x_i) = \beta_0 + \beta_1x_i + \epsilon_i\). Then if we add on each of the trails on, then our formula will transform into \(E(Y|X_1) = x_1, X_2 = x_2,...X_p = x_p) = \beta_0 + \beta_1x_1 + \beta_2x_2 +...+\beta_px_p\) (Sheather 2008)

2.2 RSS to calculate \(\hat{\beta}\)

RSS can be written in the matrix form of the function for \(\beta\) as: \(RSS(\beta) = (Y-\hat{Y{}})^T(Y-\hat{Y})\) Because \(Y = X\beta+\epsilon\) we will then plug the value of Y into the RSS equation. \[RSS= (X\beta+\epsilon - X\hat{\beta}+\epsilon)^T(X\beta+\epsilon-X\hat{\beta}+\epsilon)\]

Now apply the transpose seen in the first term remembering the transpose rule of \((AB)^T = B^TA^T\), then we will get \[Y^TY + (X\beta)^TX\beta - Y^TX\beta - (X\beta)^TY \implies Y^TY + \beta^T(X^TX)\beta-2Y^TY\beta\]

From here, we see that the \(\epsilon\)s cancel quite nicely and we are just left with \((X\beta - X\hat{\beta})^T(X\beta-X\hat{\beta})\)(Rosenfeld, n.d.)

Now to minimum, we need to take the derivative with respect to \(\beta\), which will give us: \(-2X^Ty+2X^TX\hat{\beta}=0\) (Rosenfeld, n.d.)

Now we can solve for \(\hat{\beta}\) and we get the following: \[2X^TX\hat{\beta} = 2X^Ty \implies X^TX\hat{\beta} = X^Ty\] multiply by the inverse to get \(\hat{\beta}\) by itself

\[\hat{\beta} = (X^TX)^{-1}X^Ty\] (Rencher and Schaalje 2018)

2.3 Visual aide

Here is some visual aide to help with the idea of linear regression. Since we are now using vectors and matrices, imagine we have a hyper-space and we want to see a visual representation of \(Y = X\beta+\epsilon\) then we would get a picture such as the following

a rough visual aide

As we can see the Blue line is the Y vector of our responses, and the Green and Pink vectors are \(\hat{\epsilon}\) & \(X\beta\) respectively. The red lines are orthogonal vectors to \(\hat{\epsilon}\) and represent \(x\beta\) & \(\epsilon\) in the hyper-plane

3 Model selection

From the before mentioned variables, the FULL model I have in mind right now will look something like \(E(Y_{Wins}) = \beta_0 + \beta_1x_{3p} + \beta_2x_{2p} + \beta_3x_{FT} +\beta_4x_{ORB} + \beta_5x_{AST} + \beta_6x_{3p}x_{2p}\)

#the full model
nba.lm = lm(W ~`3P` + `2P` + FT + ORB + AST + `3P`*`2P`,data = NBA)

3.1 Using the lm() function and interprettying summary output

summary(nba.lm)
AL 
AL Call:
AL lm(formula = W ~ `3P` + `2P` + FT + ORB + AST + `3P` * `2P`, 
AL     data = NBA)
AL 
AL Residuals:
AL     Min      1Q  Median      3Q     Max 
AL -15.743  -6.668  -0.774   7.330  14.067 
AL 
AL Coefficients:
AL              Estimate Std. Error t value Pr(>|t|)
AL (Intercept) -326.7440   245.3218  -1.332    0.196
AL `3P`          21.6834    19.4184   1.117    0.276
AL `2P`          10.7492     8.3214   1.292    0.209
AL FT             0.7591     1.3274   0.572    0.573
AL ORB           -3.5390     2.4033  -1.473    0.154
AL AST            0.3506     1.1943   0.294    0.772
AL `3P`:`2P`     -0.5657     0.6925  -0.817    0.422
AL 
AL Residual standard error: 10.11 on 23 degrees of freedom
AL Multiple R-squared:  0.4293, Adjusted R-squared:  0.2804 
AL F-statistic: 2.883 on 6 and 23 DF,  p-value: 0.03032

From the summary output, we can see that the full model is not a good model to model our response variable number of wins due to the low multiple \(R^2\) of a small 0.4293 (about 43% of the variable is explained by this model) and a much smaller \(R^2_a\) of 02804, meaning it does not count for any of the complexity of our model

However, P-value of the F-statistic shows that the model is adequate at the \(\alpha\) level of 0.05, meaning that at least one of the parameters (independent variables) \(\neq 0\)

It also is not very clear what parameters we should keep/remove due to the high P-values from each parameter. We will use the AIC next to see what our final model is supposed to look like, because we can’t seem to get it from the raw summary output.

3.2 Using AIC

step(nba.lm,direction = "backward")
AL Start:  AIC=144.82
AL W ~ `3P` + `2P` + FT + ORB + AST + `3P` * `2P`
AL 
AL             Df Sum of Sq    RSS    AIC
AL - AST        1     8.804 2358.1 142.93
AL - FT         1    33.408 2382.7 143.24
AL - `3P`:`2P`  1    68.150 2417.4 143.68
AL <none>                   2349.3 144.82
AL - ORB        1   221.484 2570.8 145.52
AL 
AL Step:  AIC=142.93
AL W ~ `3P` + `2P` + FT + ORB + `3P`:`2P`
AL 
AL             Df Sum of Sq    RSS    AIC
AL - FT         1    31.880 2390.0 141.34
AL - `3P`:`2P`  1    64.088 2422.2 141.74
AL <none>                   2358.1 142.93
AL - ORB        1   228.429 2586.5 143.71
AL 
AL Step:  AIC=141.34
AL W ~ `3P` + `2P` + ORB + `3P`:`2P`
AL 
AL             Df Sum of Sq    RSS    AIC
AL - `3P`:`2P`  1    65.595 2455.6 140.15
AL <none>                   2390.0 141.34
AL - ORB        1   308.011 2698.0 142.97
AL 
AL Step:  AIC=140.15
AL W ~ `3P` + `2P` + ORB
AL 
AL        Df Sum of Sq    RSS    AIC
AL <none>              2455.6 140.15
AL - ORB   1    360.50 2816.1 142.26
AL - `2P`  1    895.05 3350.6 147.47
AL - `3P`  1   1169.48 3625.0 149.83
AL 
AL Call:
AL lm(formula = W ~ `3P` + `2P` + ORB, data = NBA)
AL 
AL Coefficients:
AL (Intercept)         `3P`         `2P`          ORB  
AL    -118.512        6.285        4.197       -4.257

From the AIC step function, we can see that the final model that we should be using will only include 3 Pointers, 2 Pointers, and Total Rebounds. This means that the model that only has 3 pointers, 2 pointers and total rebounds is the model that is the “closest” to the true model.

3.3 Final model

Now lets make an object that stores the improved and FINAL model that we will be using for the rest of our analysis.

#the final model
nba.lm.reduced = lm(W ~ `3P` + `2P` + ORB, data = NBA)
summary(nba.lm.reduced) #summary output of our final model
AL 
AL Call:
AL lm(formula = W ~ `3P` + `2P` + ORB, data = NBA)
AL 
AL Residuals:
AL     Min      1Q  Median      3Q     Max 
AL -15.130  -7.583   1.271   6.735  15.547 
AL 
AL Coefficients:
AL             Estimate Std. Error t value Pr(>|t|)   
AL (Intercept) -118.512     61.229  -1.936  0.06387 . 
AL `3P`           6.285      1.786   3.519  0.00162 **
AL `2P`           4.197      1.363   3.078  0.00486 **
AL ORB           -4.257      2.179  -1.954  0.06157 . 
AL ---
AL Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AL 
AL Residual standard error: 9.718 on 26 degrees of freedom
AL Multiple R-squared:  0.4035, Adjusted R-squared:  0.3346 
AL F-statistic: 5.861 on 3 and 26 DF,  p-value: 0.003383

3.4 Anova test

anova(nba.lm.reduced,nba.lm) #anova of reduced vs full model to test for what we need to include in our model

We can see from the anova analysis that the extra terms (terms not in the nested model) has a P-value greater than 0.05, we will accept the NULL as plausible that the \(\beta\) terms that are not nested = 0, meaning \(\beta_3 = \beta_5 = \beta_6 = 0\) (Fritz 2015)

3.5 What do the point estimates mean from our formula?

From the final reduced formula that we got from the step() function, we get our point estimates to be: \[\beta_0 = -118.512\] \[\beta_1 = 6.285\] \[\beta_2 = 4.197\] (meaning for every unit, every 3 pointer will be increased by 6.285 and every 2 pointer will be increased by 4.197)

and \[\beta_4 = -4.257\] (meaning for every unit of an offensive rebound, it will decrease by 4.257)

4 Checking validity

Remember that MLR takes on the 4 assumptions of:

  1. Errors have a mean of 0
  2. The errors have a constant variance of \(I\sigma^2\)
  3. \(\epsilon\) ~ \(N_n\)
  4. Errors are iid (independent from each other)

4.1 Shapiro-Wilk test

library(s20x)
normcheck(nba.lm.reduced, shapiro.wilk = T)

From the Shapiro-Wilks test, we see that we have a very normal looking distribution judging from the large P-value. This means that we can accept the NULL as plausible and say that the errors for our data take a normal distribution. This means that assumption number 3 is checked off.

4.2 Residual VS Fitted

nba.res = residuals(nba.lm.reduced)
nba.fit = fitted(nba.lm.reduced)

trendscatter(nba.res~nba.fit, f = 0.5, data = nba.lm.reduced, xlab="Fitted Values",ylab="Residuals", main="Residual vs Fitted Value")

There seems to be a very loose symmetry about the residuals = 0. Although we have a general band that seems to be close to Y = 0, there does seem to be some outliers in our graph meaning that we have assumption 1 checked off.

We can also see that there is no real patern appearing within this graph, so it looks like there is a constant variance fulling assumption number 2.

4.3 Independence in the data

Because this isn’t a time series, we will have to see how the “experiment” was set up to prove that it is independent. Because every regular season game has no effect on the games before and after it, we know that 1 game does not depend on the game before/after it. Therefore, we can say that the data here is independent and fulfills the final assumption that are associated with MLR that our errors are independent.

5 Checking confidence intervals

ciReg(nba.lm.reduced)
AL             95 % C.I.lower    95 % C.I.upper
AL (Intercept)     -244.37026           7.34578
AL `3P`               2.61365           9.95611
AL `2P`               1.39474           7.00003
AL ORB               -8.73519           0.22177

As we can see, we can say with 95% confidence that the true underlining means for the following variables of our reduced models are: 3 Pointer: 2.61365 9.95611 2 Pointer: 1.39474 7.00003 Offensive Rebounds: -8.73519 0.22177

6 using Predict()

predict(nba.lm.reduced) 
AL        1        2        3        4        5        6        7        8 
AL 34.81015 35.59544 33.28721 16.48513 28.47623 28.04990 42.92472 35.18770 
AL        9       10       11       12       13       14       15       16 
AL 29.77921 22.64972 43.17051 42.45681 36.01742 36.45285 38.78626 38.65516 
AL       17       18       19       20       21       22       23       24 
AL 51.60326 34.13038 41.01711 18.34265 36.81439 26.19186 33.51982 36.50176 
AL       25       26       27       28       29       30 
AL 42.12769 38.38195 42.64328 38.59136 39.03931 37.31078

As we can see, when we call the predict() function, it returns the number of wins each of the 30 teams are predicted to have (point estimates).

Notably, the our hometown favorites, the OKC Thunder number 21 from the above output, are predicted to win about 36 games with it’s values of 3 Pointers, 2 pointers and offensive rebounds.

7 Regression plane

7.1 Regression plane with 2 pointers and 3 pointers with ORB = 0

#library(rockchalk)
library(plotly)
AL 
AL Attaching package: 'plotly'
AL The following object is masked from 'package:ggplot2':
AL 
AL     last_plot
AL The following object is masked from 'package:stats':
AL 
AL     filter
AL The following object is masked from 'package:graphics':
AL 
AL     layout
nba.3p.2p = lm(W ~`3P` + `2P` , data = NBA)
Three.pointers = NBA$`3P`
Two.Pointers = NBA$`2P`

x <- seq(8, 18, by = 2)
y <- seq(12, 32, by = 4)
plane <- outer(x, y, function(a, b){nba.3p.2p$coef[1] + 
    nba.3p.2p$coef[2]*a + nba.3p.2p$coef[3]*b})

plot_ly(data = NBA, z = ~W, x = ~Three.pointers, y = ~Two.Pointers, opacity = 0.5) %>%
  add_markers() %>%
  add_surface(x = ~x, y = ~y, z = ~plane, showscale = FALSE)
AL Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
AL Please use `arrange()` instead.
AL See vignette('programming') for more help
AL This warning is displayed once every 8 hours.
AL Call `lifecycle::last_warnings()` to see where this warning was generated.

This is probably the most interesting regression plane to study, because (excluding the free throw) every shot from a live ball in play will either be a 3 pointer or a 2 pointer. This regression plane idea was taken from (Wood, n.d.)

Notice that the residuals are not that far off from the plane, meaning that the quantity associated with the error should be small (and hopefully near 0 when we sum them)

sum(nba.lm.reduced$residuals)
AL [1] 2.131628e-14

^^^ Notice how when we summed our residuals, it provides a number very close (pretty much) equal to 0! (Mendenhall 2016)

8 Checking for outliers with Cooks plot

cooks20x(nba.lm.reduced)

8.1 What to conclude from cooks plot

we see that we have observation 14 as a main outlier because the cooks distance is near 0.133 (4/n as a way to detect outliers for cooks distance)(Zach 2019), so we will remove the 14th observation from our data set. This 14th observation may have been having too much of an impact on our data set since it has such a large value for cooks distance (Gao, Ahn, and Zhu 2015)

8.1.1 New dataset without 14th observation

nba_new.lm=lm(W~`3P` + `2P` + ORB, data=NBA[-c(14),])
summary(nba_new.lm)
AL 
AL Call:
AL lm(formula = W ~ `3P` + `2P` + ORB, data = NBA[-c(14), ])
AL 
AL Residuals:
AL      Min       1Q   Median       3Q      Max 
AL -14.9877  -7.6515   0.1994   5.6550  14.1668 
AL 
AL Coefficients:
AL             Estimate Std. Error t value Pr(>|t|)   
AL (Intercept)  -96.282     60.156  -1.601  0.12204   
AL `3P`           6.112      1.719   3.556  0.00154 **
AL `2P`           3.648      1.346   2.710  0.01197 * 
AL ORB           -4.752      2.112  -2.250  0.03351 * 
AL ---
AL Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AL 
AL Residual standard error: 9.339 on 25 degrees of freedom
AL Multiple R-squared:  0.4304, Adjusted R-squared:  0.362 
AL F-statistic: 6.296 on 3 and 25 DF,  p-value: 0.002485

We observe that our P-value related to our F-statistic got “better” going from

0.003383 to 0.003099, giving us more evidence to reject the NULL hypothesis and say that the (new) model has even better adequacy.

9 Conclusion

An NBA game can have may different variables that can help a team excel on offense end. Now, we can see what offensive stat teams should be focusing on if they are wanting to win as many games as possible in order to qualify for the playoffs in the spring.

9.1 Answer to research question

Before, I posed the research question, what stats should teams focus on offense in order to achieve the most wins in the regular season. Now with our model, the answer to that question is: Teams should be focused on 3 pointers, 2 pointers and offensive rebounds during offense for the most wins possible.

9.2 Ways to improve model

Just like anything in life, there are a number of ways this experiment can be improved upon.

9.2.1 Variance inflation

One variable that caught my eye from the final model that had an odd coefficient was the ORB (offensive rebound), which had a value of -4.257. This is surprising because usually, when a team gets an offensive rebound, they will have another change to score points when they failed. Which would lead us to believe that it should be a POSITIVE coefficient because it will help them lead to more points, and lead them to more wins. There are 2 reasons why I think the coefficient is negative:

  1. There is some sort of dependence between 3 pointers and/or 2 Pointers leading to some inflation
  1. When a team gets an offensive rebound, it disrupts the pace of the game, because most teams have a realistic mindset. Meaning if the player/teammate misses his/her shot, then the team should switch their mindset to how they will defend the opposing team. If they get an offensive rebound, they might not know what set play to run, leading to a turnover (meaning the opponent has a free break-away to the basket) because players aren’t on the same page (from personal expierence).

2.5 When players are scrambling for an offensive rebound, many times refrees will call a loose ball foul which sets the offensive team behind and penalizes them for scrambling for the ball if the refree deems the offensive player to be too rough while chasing the rebound (from personal experience).

Because only 1 of the above reasons can be quantified, we will attempt to see if there is some sort of multi-collinearity/ dependence between the variables, specifically around Offensive Rebounds.

9.2.2 Variance inflation test

car::vif(nba.lm.reduced)
AL     `3P`     `2P`      ORB 
AL 2.002835 1.996323 1.005721

Since the variance inflation coefficients are very low, we will have to go in-favor of human error being the reason for the negative coefficient.

9.2.2.1 Pairs plot

pairs(W ~ `3P`+`2P`+ORB, data = NBA)

From the pairs plot, we see that we really had no idea what type of degree should have been associated with the independent variables of our model, so we can keep using a degree of 1. This gives us even more evidence that Offensive Rebounds has a negative correlation, because we should have the correct degree for it in our model.

9.2.3 Experimental ways to improve the model

Like I discussed with the previous experiment in MATH 4753, the NBA is split into 2 conferences: EAST and WEST. Therefore, most years, one conference will always have stronger teams because NBA players tend to shift over to play in more power conferences to have a better change to play for a team that can contend for the championship better.

This can be seen in this quick calculations for the 2019-2020 season where we take the average number of wins of the top 4 teams from each conference

avg.east.top4 = (45+48+53+56)/4
avg.west.top4 = (52+49+44+46)/4
  avg.east.top4
AL [1] 50.5
  avg.west.top4
AL [1] 47.75

notice how the east has \(\approx\) 3 more wins than the west.

A way that we can potentially improve this experiment is that we can limit our observations to only one conference, but then that means that we will be cutting our sample size in half (from 30 teams to 15) and we don’t know if we can then apply certain assumptions such as the CLT. One possible solution to this is that within the one conference we are studying, we can look at multiple seasons to increase our sample size.

We can also try randomized block designs, where each “block” can be a group of 30/n teams. And the treatment can be that each “block” can play another block as it’s “treatment,” but obviously this has many issues, because teams within the block won’t be able to play each other and it will (most likely) completely mess up the NBA’s current playoffs being EAST vs WEST oriented due to random teams playing each other, you might get some blocks being weaker than the others, so the playoffs might be skewed this way.

References

Fritz, Mike. 2015. Improving the User Experience Through Practical Data Analytics : Gain Meaningful Insight and Increase Your Bottom Line / Mike Fritz, Paul d. Berger.

Gao, Qibing, Mihye Ahn, and Hongtu Zhu. 2015. “Cook’s Distance Measures for Varying Coefficient Models with Functional Responses.” Technometrics : A Journal of Statistics for the Physical, Chemical, and Engineering Sciences 57 (2): 268.

Mendenhall, Sincich. 2016. Mendenhall, William M., and Terry L. Sincich. 2016. Simple Linear Regression. 6th Ed. Florida, Usa: Taylor & Francis Group, Llc.

Rencher, Alvin C., and G. Bruce Schaalje. 2018. “Chapter 7: Multiple Regression: Estimation.”

Rosenfeld, Michael J. n.d. “OLS in Matrix Form.”

Sheather, Simon J. 2008. A Modern Approach to Regression with R Simon J. Sheather. Springer Texts in Statistics. New York ; London: Springer.

Wood, Robert. n.d. “Visualizing Multiple Regression in 3D.”

Zach. 2019. “How to Identify Influential Data Points Using Cook’s Distance.”