Abstract
This project is all about applications of SLR to real data using R. I will be using a dataset from the 2018-2019 NBA season that will have certain stats from the 30 teams in the NBA. TheThe National Basketall Association (NBA) is the premier basketball league where top players from around the world compete for regional teams spread out in the USA and Toronto to play for the title of world champion each year. Even though the teams don’t change that much, each season is different from the one before which draws in millions of viewers every year to watch along to see who will stand on top of the league.
Besides my big interest in basketball, I am very interested in the data because I one day hope to become a data scientist and work for one of the NBA teams and help them win a championship. I want to look at data to help guide a team to use quantative methods to help train and compete for the championship. Each team is different, but they all have want to be the ones lifting the trophy by the end, and I want to be part of.
The data is from Basketball Reference, the offical reference partnered with NBA using real time stats of all teams, players and organizations. The data is collected over the 82 games during the regualr season. More specificly, I will be examining the 2018-2019 NBA season where the Toronto Raptors won the championship.
The Toronto Raptors holding the trophy
The variables are:
Team: the data that will represent 1 of 30 teams in the NBA
Playoffs: wheter or not the team will attend the 2018-2019 playoffs
W: the amount of wins in the season
PTS: total points scored in the season
oppPTS: total points scored by all opponents during the season
TRB: total rebounds by a team over the season
opTRB: total rebounds by opponent of team over the season
AST: total assists made by a team
oppAST: total assits made by all opponents
This can be more clearly seen in the code below
AL$ Warning: package 'DT' was built under R version 3.6.3
AL$ Warning in data(nba): data set 'nba' not found
datatable( #using the example in the html document on canvas
nba,filter = 'top', options = list(
pageLength = 5, autoWidth = TRUE, editable = TRUE, dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')),
caption = htmltools::tags$caption(
style = 'caption-side: bottom; text-align: center;',
'Table 2: ', htmltools::em('stats for all 30 teams in the NBA.')
)
) %>%
formatStyle('W', color = 'red', backgroundColor = 'blue', fontWeight = 'bold')
The data was collected by taking the total number of points scored by a team each game along with the total number of points scored by their opponent and the victor of the game. Then they are all summed up at the end of the season (82 games) to get point totals. The rebounds and asists variables were collected the same way and over 82 games just like the points. If a team does advance onto the playoffs, it is denoted by a 1 in the data table, and a 0 else.
I want to look at 3 stats that can be collected in a basketball game (points, rebounds and asists) and see which stat can best predict how many wins a team will have by the end of the season, and ultimatly if that team will qualify for the playoffs.
The three stats can be seen in game with the examples below:
Jeremy Lin scoring 3 points
(NBA 2018)
An assist to LeBron James
First, lets plot our data to see what variable will give is the most linear line with the closest points to the line. We will calculate the difference of each stat per team. Meaning that we will take the total stat of a team - total stat against all opponents to see how they faired against the league during the 82 games.(Ajmera 2018)
\(stat.diff = team_{stat} - opponent_{stat}\)
we will repeat this 3 times for points, rebounds and asists.
library(ggplot2)
nba$point_diff = nba$PTS - nba$oppPTS
g = ggplot(nba, aes(x = nba$point_diff, y = W, color = nba$Playoffs), main = "POINT DIFFERENCE") + geom_point()
g = g + geom_smooth(method = "loess")
g
From the graph above, we can see a very high linear correlation between the point difference and the number of wins along if a team will make the playoffs. Meaning if a team has a higher point difference (meaning they score more points than their opponent) then the team is more likely to advance to the playoffs.
nba$rb.diff = nba$TRB-nba$oppTRB
library(ggplot2)
g = ggplot(nba, aes(x = nba$rb.diff, y = W, color = nba$Playoffs)) + geom_point()
g = g + geom_smooth(method = "loess")
g
Although there seems to be some sort of correlation, it is not as good as the point difference that was graphed before it.
nba$ast.diff = nba$AST - nba$oppAST
library(ggplot2)
g = ggplot(nba, aes(x = nba$ast.diff, y = W, color = nba$Playoffs)) + geom_point()
g = g + geom_smooth(method = "loess")
g
Again, although the plot seems to be a positive linear relation, the points are very far from the best fit line.
Ultimatly from these models, we can see that the difference of points is the best estimate to see the ammount of wins a team will have and whether or not they make the playoffs.
I think that the relationship between wins (our dependent variable Y) and the difference points (our independent variable X) follows a linear model. To make an accurate model to display the relationship, I will be using the theory of simple linear regression (SLR).
One of the main ideas of SLR is that \(\bar Y\)(mean value of Y) for any \(X_i\) value will make a straight line when plotted. Any errors that fall below/above the plotted line will be denoted as \(\epsilon\). The formula for SLR can be derived as: \(\hat Y_i = \hat\beta_0+\hat\beta_1*X_i+\epsilon_i\) for an estimate and \(Y = \beta_0 + \beta_1*X_1+\epsilon\) for the whole data (Mirman 2014)
\(\beta_0\) is the Y intercept of our model and \(\beta_1\) is the slope of the model and two of our random variables.
The mean of the probability distribution of \(\epsilon = 0\)
Variance of the probability distribution of \(\epsilon\) is constant of each X.(meaning $$ has a constant variance for all X)
\(\epsilon\) ~ Normal
\(\epsilon\) is independitly idetically distributed, meaning the errors are all independent of each other. -Since our data is not a time series, we know that the data is independent because each game is INDEPENDENT from the other games and each game is random, we do not know who is going to win. (Mendenhall 2016) (Sheather 2008)
In order to figure out what \(\beta_0\) & \(\beta_1\), I will be using the method of least squares like in the lab 3 and 4. To acomplish this, I need to find the sum that will result in the smallest summation of the errors, which will give me the line of best fit in the process.
AL$
AL$ Call:
AL$ lm(formula = W ~ point_diff, data = nba)
AL$
AL$ Residuals:
AL$ Min 1Q Median 3Q Max
AL$ -4.9674 -1.6577 0.4238 1.6372 4.8466
AL$
AL$ Coefficients:
AL$ Estimate Std. Error t value Pr(>|t|)
AL$ (Intercept) 41.079025 0.457269 89.84 <2e-16 ***
AL$ point_diff 0.029635 0.001171 25.32 <2e-16 ***
AL$ ---
AL$ Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AL$
AL$ Residual standard error: 2.505 on 28 degrees of freedom
AL$ Multiple R-squared: 0.9581, Adjusted R-squared: 0.9566
AL$ F-statistic: 640.9 on 1 and 28 DF, p-value: < 2.2e-16
Since the |min| \(\approx\) |max| and median is close to 0, that means we have a good fit for the data.
We also obtain the following values for \(\beta_0\) and \(\beta_1\)
\(\hat \beta_0\) = 41.079025 \(\hat \beta_1\) = 0.029635
\(\hat \beta_0 + \hat \beta_1 =\) 41.079025 + 0.029635*\(x_i\)
This formula implies that each Win changes when the point difference by 0.029635 per point difference.
AL$ 95 % C.I.lower 95 % C.I.upper
AL$ (Intercept) 40.14235 42.01570
AL$ point_diff 0.02724 0.03203
We see that a 95% confidence interval will be from (40.14235, 42.01570) for \(\beta_o\)
We see that a 95% confidence interval will be from (0.02724, 0.03203) for \(\beta_1\)
Since both of the estimates of \(\beta_0\) & \(\beta_1\) are within their respective confidence intervals, we will deem the NULL hypothesis as plausible.
Plotting the residuals defined as \(y-\hat y\) and \(\sum(Y_i - \hat Y_i)^2\) for all of them added together, This will result in the RSS, or the residual sum of squares.
plot(W~point_diff,bg="Blue",pch=21,cex=1.2,
ylim=c(10,1.1*max(W)),
main="Residual Line Segments of Wins vs Point difference", data=nba)
ht.lm=with(nba, lm(W~point_diff)) # Code taken from lab 3 and modified
abline(nba.lm)
yhat=with(nba,predict(ht.lm,data.frame(point_diff)))
with(nba,{segments(point_diff,W,point_diff,yhat)})
abline(ht.lm)
Now we will be doing a plot that will graph the means of wins vs the means of point differences, all the means together can be expressed as: \(\sum (\hat y_i - \bar Y)^2\) to represent the distance.
plot(W~point_diff,bg="Blue",pch=21,cex=1.2,
ylim=c(0,1.1*max(W)),
main="Mean of Wins vs Point difference", data=nba)
abline(nba.lm)
with(nba, abline(h=mean(W))) #Code taken from lab 3 and modified
abline(nba.lm)
with(nba, segments(point_diff,mean(W),point_diff,yhat,col="Red"))
TSS = MSS + RSS When we comine the mean and the residuals, this will show the total deviation line segments (\(\hat Y = \bar Y\)) and result in the Total Sum of Squares
plot(W~point_diff,bg="Blue",pch=21,cex=1.2,
ylim=c(0,1.1*max(W)),
main="Total Deviation Line Segments of W vs Point difference", data=nba)
with(nba,abline(h=mean(W)))
with(nba, segments(point_diff,W,point_diff,mean(W),col="Green"))
RSS
AL$ [1] 175.6315
MSS
AL$ [1] 4020.368
TSS
AL$ [1] 4196
MSS/TSS
AL$ [1] 0.9581431
Since the value of MSS/TSS is \(\approx\) 1, we know we have a good fit for the model.
MSS + RSS
AL$ [1] 4196
AL$ [1] 4196
We confirm that TSS= MSS + RSS meaning our numbers seem to make since.
I will again just use a simple plot() to show that a linear model is a good fit for the data
nba.lm=lm(W~point_diff,data=nba)
plot(nba$W~nba$point_diff,bg = "blue",pch=21,cex=1.2,
ylim=c(0,1.1*max(nba$W)),
main="Scatter Plot and Fitted Line of Wins vs Point difference", data=nba,lwd = 2)
abline(nba.lm)
We see that a simple scatter plot of the point differences point to a line to be a good fit for the data. But how good of a fit is the line? Further in the document I will do calculations to see how well this straight line represents our data.
library(s20x)
trendscatter(W~point_diff, f = 0.5, data = nba, main="Wins vs point difference scatterplot")
As we can see, it is again a linear trend with the dashed red lines showing us the errors and the blue line representing our line of best fit.
As we saw before when we were doing our confidence intervals for our estimates for \(\beta_0\) &$ _1$ our estimates were inside our interval.
As we can see the P value is 0.859, this would mean we don’t have enough evidence to reject our null hypothesis that \(\epsilon\) ~ Normal(0, \(\sigma^2\)).
nba.res = residuals(nba.lm)
nba.fit = fitted(nba.lm)
plot(nba$point_diff,nba.res, xlab="Point Difference",ylab="Residuals", main="Residuals vs Residuals")
There seems to be some symetry about Residuals = 0, but it is very rough at best. This means that there is not signifigant deviation from the line of best fit.
The lines show threre is a very loose symetry about 0, implying a linear model is a decent fit for the data.
AL$ 1 2 3
AL$ 38.11557 41.07903 47.00593
We we can see from the predict() function. If a team has -100 point difference (meaning that opponents scored a total of 100 points MORE than a team) they’re predicted to win 38 games.
If a team has 0 point difference, meaning a team scored as many points as all the opponents scored against that team, they are predicted to win 41 games.
Finally, if a team has a 200 point difference, then that team is predicted to have 47 wins in the season.
We must check for outliers to see if they are affecting the interpretation of our data. We can do this with cooks distance, which will look at our data and see how much impact they have in the data set. (Gao, Ahn, and Zhu 2015 )
We see that 7, 17 and 29 have a major impact on our data with their large distance value. if we remove those values, then we can get a much better interpretation.
AL$
AL$ Call:
AL$ lm(formula = W ~ point_diff, data = nba[-c(7, 17, 29), ])
AL$
AL$ Residuals:
AL$ Min 1Q Median 3Q Max
AL$ -5.2105 -1.6254 0.2088 1.5429 4.3190
AL$
AL$ Coefficients:
AL$ Estimate Std. Error t value Pr(>|t|)
AL$ (Intercept) 41.541692 0.426246 97.46 <2e-16 ***
AL$ point_diff 0.030561 0.001128 27.08 <2e-16 ***
AL$ ---
AL$ Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AL$
AL$ Residual standard error: 2.201 on 25 degrees of freedom
AL$ Multiple R-squared: 0.967, Adjusted R-squared: 0.9657
AL$ F-statistic: 733.6 on 1 and 25 DF, p-value: < 2.2e-16
AL$
AL$ Call:
AL$ lm(formula = W ~ point_diff, data = nba)
AL$
AL$ Residuals:
AL$ Min 1Q Median 3Q Max
AL$ -4.9674 -1.6577 0.4238 1.6372 4.8466
AL$
AL$ Coefficients:
AL$ Estimate Std. Error t value Pr(>|t|)
AL$ (Intercept) 41.079025 0.457269 89.84 <2e-16 ***
AL$ point_diff 0.029635 0.001171 25.32 <2e-16 ***
AL$ ---
AL$ Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AL$
AL$ Residual standard error: 2.505 on 28 degrees of freedom
AL$ Multiple R-squared: 0.9581, Adjusted R-squared: 0.9566
AL$ F-statistic: 640.9 on 1 and 28 DF, p-value: < 2.2e-16
We can see that our standard error for \(\beta_1\) improved a little bit, going from 0.001171 -> 0.001128 , what is more intersting is our standard error for \(\beta_0\) greatly improved going from 0.457269 -> 0.426246.
The NBA is one of the most exciting sport leagues to watch due to the great skills of players and the unpredictability of what will happen each season. However with the power of statistics, we are able to predict some out comes given the right amount of data.
Looking at the difference of points is the most accurate way of determining how many wins a team is expected to reach in a reagular 82 game season. The data seems to answer the question of how many wins will a team have, and will they make the playoffs very presicely! In the 2018-2019 NBA season, the lowest amount of wins a team needed to advance to the playoffs was the Detroit Pistons 41 wins. our model predicted that the base amount of wins a team will have will be 41. Then depending on the difference of points a team will have, we can add/subtract that from the base amount of wins and see how the teams stack up against each other and if they will advance to the playoffs.
When we invoked the function to look at the Cook’s distance, we saw some big outliers that swayed our data. This may be caused due to the league spliting up the NBA league into 2 conferences (West and East). Because of the division, one conference may be tougher than the other from year to year. When a stronger conference players a weaker conference it will obviously sway the data more to the strong conference because they are more likely to win against their weaker opponents. For example in the 2018-2019 NBA season, for the eastern conference, the minimum number of wins needed to advance to the playoffs was 41, but for the western conference, the minimum wins a team needed was TIED by the Los Angeles Clippers and San Antinio Spurs with 48 wins.
A way we can maybe over come this is to look at data from only one conference so it will not be biased. However if we do that, we will cut our sample size in half, going from 30 -> 15. So to make sure we have a sufficient sample size, I would suggest that we look at multiple seasons from a conference. (IBM, n.d.a) (IBM, n.d.b) (CNN 2010) [IBM_Computer] (Forbes, n.d.)
Ajmera, Aman. 2018. “Linear Regression Model on National Basketball Association(NBA) Dataset,” no. 2.
CNN. 2010. “5 Billion Dollar Tech Gambles - Watson.”
Forbes. n.d. “IBM’s Watson Gets Its First Piece of Business in Healthcare.”
Gao, Qibing, Mihye Ahn, and Hongtu Zhu. 2015. “Cook’s Distance Measures for Varying Coefficient Models with Functional Responses.” Technometrics : A Journal of Statistics for the Physical, Chemical, and Engineering Sciences 57 (2): 268.
IBM. n.d.a. “Deep Thunder - Transforming the World.”
———. n.d.b. “Watson Machine Learning Pricing.”
Mendenhall, Sincich. 2016. Mendenhall, William M., and Terry L. Sincich. 2016. Simple Linear Regression. 6th Ed. Florida, Usa: Taylor & Francis Group, Llc.
Mirman, Daniel. 2014. Growth Curve Analysis and Visualization Using R / Daniel Mirman, Moss Rehabilitation Research Institute, Elkins Park, Pennsylvania, Drexel University, Philadelphia, Pennsylvania. Chapman & Hall/Crc the R Series (Crc Press).
Sheather, Simon J. 2008. A Modern Approach to Regression with R Simon J. Sheather. Springer Texts in Statistics. New York ; London: Springer.