Anda di halaman 1dari 12

Megan Robertson, Corey Schwab, Meredith Manley

Statistics 225 Nonparametric Statistics


Final Project Write-Up
An Analysis of NBA Player Tracking Data
Abstract
The purpose of this project was to investigate the player tracking data
available for the NBA. In the project, we explored the differences for the variables
related to position. A model is fit to predict the minutes per game that an NBA
player can expect to play based on other game level statistics as well as examining
the distributions of some variables. Finally, various distributions are explored for
select variables in the data and the best ones are selected.
Introduction
For this project, our group wanted to take advantage of the data available
through the tracking cameras to learn more about the game of basketball. The
tracking data provides more detailed information than what is available in the
standard box score. The data, described in the next section, provides information
about the locations of players at different points during the game as well as data on
things such as how many times a player touches the ball or drives to the basket. We
wanted to take advantage of this richer data set and see what more we could learn
about professional basketball.
The first part of the project examined the difference in distributions of
different variables for the guards versus the forwards/centers. Looking at the
differences between the variables for the positions could provide insight into the
roles of positons on the teams. It could also be helpful in recruiting since coaches
would know what skills a player of a certain position should have.

In the model-fitting section, we created two different models to predict the


minutes per game that a player would play. We fit one model using the parametric
multiple linear regression and the other model using the nonparametric method of
rank-based regression. Predicting the minutes per game that a player should play
could provide insight for a player who believes that he is not receiving the playing
time that he should. He could use the models to predict the minutes that he should
be playing based on his other statistics, and then approach the coach with an
argument. Our parametric model has a very high adjusted R 2. It also predicted
about 87% of players minutes per game within five minutes of the actual value for
the 2014-2015 regular NBA season.
In addition to examining the differences in distributions and fitting the
minutes per game model, we estimated the density of various variables by trying
different kernels and bandwidth options. Knowing the distributions of variables have
the potential to help players understand how their various skills compare to other
NBA players. Knowing the distributions of variables will allow us to understand how
the percentiles of different statistical categories change in the data.
Data
The data used for this project was obtained from stats.nba.com. The data is
collected using player tracking cameras that have been in every NBA arena since
the 2013-2014 season.1 These cameras are able to keep track of the locations of the
players as well as the ball throughout the course of the game by capturing multiple
images per second. As a result, there is more detailed information available in this
data than is in the typical box scores. Our final data set contained over 70 variables,

NBA partners with Stats LLC for tracking technology. Nba.com. NBA, September 5, 2013. http://www.nba.com/2013/news/09/05/nba-statsllc-player-tracking-technology/ (accessed April 30, 2015).

so the following table defines the variables that are referenced in the different
procedures that were carried out.
Variable
Distance.Traveled.per.game
Opp.FGA.at.rim.per.game*
Opponent.FGP.at.rim
Rebounds.per.game

Touches.per.game*
PTS.per.game*
Passes.per.game*
Uncontested.REB.per.game
*
Drives.per.game*

STL.per.game*
BLK.per.game*

Points.created.by.assist.per.
game*
Guard, Forward, Center*

Description
Distance that a player travels while on the court per game
(miles)
Number of field goals that an opponent attempts at the
rim (close to the basket) per game
Percentage of field goals made by an opponent at the rim
Number of rebounds by the player per game, a rebound
occurs when a player grabs a missed shot and occur when
the player is on offense or defense
Number of instances where a player touches and
possesses the ball
Number of points scored by a player per game
Number of passes that are either made or received by the
player per game
Rebounds that a player gets when an opponent is not
within 3.5 feet of the player
Number of times that a player drives to the basket per
game, a drive is defined as a touch that begins at least
twenty feet away from the basket and is then dribbled to
within ten feet of the basket
Number of steals per game, a steal occurs when a
defensive player takes the ball from an offensive player
Number of blocks per game, a block occurs when a player
hits the ball while an opponent is shooting and prevents
the opponent from scoring
Number of points that a player creates through assists, an
assist occurs if the player makes pass that directly leads
to a made basket
An indicator variable that is 1 if a player is the position
and a 0 otherwise
Table 1

Part I Comparing Distributions


One of the goals in this research project was to see where certain positions
add value to the team. In order to do this, the distributions of certain statistics were
compared between guards and forwards/centers. These groups were chosen
because there were not many pure centers in our data (only 61 out of 480), and
some of the players were listed as their position being both a forward and center. To

compare these two groups, both Fligner-Policello tests and t-tests were performed.
The distributions are not independent from each other because the players are
playing against each other, so if one center is running further, the center that is
covering him will also run more distance in that game. However, players are not
competing against players of the same skill level every game and players only play
against the same players for a small portion of the games. Thus, it is plausible to
determine that the independence condition is satisfied.
The distance each player ran per game (Distance.Traveled.per.game) was one
variable for which we compared the distributions for the two position groups. Before
running the tests, we thought that the forwards and centers might run more
because they run from one baseline of the court to the other, while the guards stay
more around the mid-court area and the three-point lines. The assumptions for the
t-test were not met here because the distributions of Distance.Traveled.per.game for
each group did not appear to be normal. In addition, the assumptions for the
Fligner-Policello test are not met here because the distributions are not symmetric
about the medians. Despite the issues with the assumptions, the tests were
conducted. Both the t-test and Fligner-Policello test indicated that there was not
evidence to conclude that a difference existed between the distance that guards ran
per game and the distance that forwards/centers ran per game.
Next, these procedures were performed on the distributions of opponents field goal
percentage at the rim (OPP.FGP.at.rim). This statistic provided some insight on how
to quantify the quality of a players defense. We thought that this would be
interesting to examine since there is not too much current research available on
how to measure a players defensive skills. For these distributions, the normality
condition appeared to be met based off of the density plots shown in the Appendix.

After running the procedures, both the t-test and the Fligner-Policello test indicated
that there is a statistically significant difference between the two groups. Both tests
demonstrated there is evidence that opponents field goal percentage at the rim is
higher for guards than it is for forwards and centers.
Another variable examined was rebounds per game (REB.per.game). The
results of the Fligner-Policello test concluded that there was a statistically significant
difference in the number of rebounds for each treatment group. This could be due to
the fact that forwards and centers are generally closer to the basket on both offense
and defense, whereas the guards are generally positioned around the perimeter. We
then ran the parametric t-test in which we obtained similar results of statistical
significance. However, it is important to note that the normality condition was not
met for this procedure seeing as it is quite evident that these distributions are righthand skewed. Thus, we would prefer the nonparametric for this particular variable.
In addition, the number of times a player touched the ball or had possession
of the ball during the game was examined (Touches.per.game). The Fligner-Policello
test concluded that there was a statistically significant difference in the number of
times that the guards touched the ball during a game versus the forwards and the
centers. This could be due to the fact that the guards maintain ball possession after
every opposing basket scored and are therefore accumulating more touches as they
bring the ball back up the court. They are also the ones setting up the majority of
the plays, so this could account for the disparity in touches between the positions.
The parametric t-test resulted in the same conclusion. The distributions of the
positions for this data set were not normal, so the nonparametric procedure is
preferred in this situation.
Part II Model Fitting

The variable investigated in the model is the minutes per game variable. In
order to do this, a model was fit using both parametric and nonparametric methods.
Because of the size of the data set (there were over seventy variables in the final
data set), it was necessary to choose a smaller number of the variables to
investigate in order to fit the model. These variables were selected using
scatterplots between minutes per game and the different variables. Since models
were created to predict MIN.per.game, the possible explanatory variables are on the
game level. In addition, the Spearmans rho was calculated between minutes per
game and the variables. Spearmans rho was used because it provides a
measurement of the existence of a relationship and does not assume that the
relationship is linear. The variables that were used as the beginning ones for model
fitting are marked with * in Table 1.
Parametric
Two of the aforementioned variables, Points.created.by.assist.per.game and
Total.Drives required transformations in order to satisfy the linearity condition. The
scatterplots between Min.per.game and the variables chosen showed that the
linearity assumption was met. The automated stepwise procedure in R was used in
order to reduce the number of variables in the model. The final model that resulted
from this was:
MIN.per.game ~ PTS.per.game + Passes.per.game + Uncontested.REB.per.game
+ STL.per.game + log.Points.created.by.assist.per.game + BLK.per.game +
Forward + Guard + Opp.FGA.at.rim.per.game + Center + Touches.per.game +
BLK.per.game:Forward + Guard:Opp.FGA.at.rim.per.game
The interaction terms included in the model are not surprising. One of these
interactions is between Opp.FGA.at.rim.per.game and Guard. Forwards and centers
tend to be near the basket on defense and thus are able to discourage field goal
attempts near the rim. Forwards are also more likely to get blocks than guards since
6

forwards tend to be taller. It is interesting that the interaction term


BLK.per.game*Center is not found to be significant as centers tend to be taller and
better at blocking. This term may not affect playing time because centers are
expected to block and thus the centers in the league are all good at blocking.
The linearity conditions for the variable appeared to be satisfied looking at
scatterplots of each variable and MIN.per.game. However, there were conditions
with the other conditions based on the residuals vs. fitted and qqplots, see the
technical appendix for these displays. The residuals vs. fitted plot showed a
downward trend for players who played more than 30 minutes per game. The data
is not entirely independent because players in the data set compete against one
another and thus the performance of one player affects the performance of another
player. For example, if two players are guarding one another, the distance they are
each run will be related. However, there are over 450 players in the data, so this
amount of dependence is small enough that we are comfortable proceeding with the
parametric model fitting. In addition, players are competing against players with a
variety of skills.
The 2014-2015 regular season data was used to analyze the performance of
the model. The model produced predictions for MIN.per.game for this data, and
these predictions were compared to the actual MIN.per.game for each player. The
average difference was about 2.42 minutes and the maximum difference was 11.04
minutes. An NBA game is 48 minutes, so being off by 11 minutes is a large portion
of the game. However, the IQR for the differences was 0.79-3.30, so there were not
a lot of differences near the eleven minute mark. The adjusted R 2 for the final
parametric model is 0.9229.

Figure
1

Figure 1 provides a comparison of the actual MIN.per.game and


the MIN.per.game predicted by the model. The density
plots show that the model does not do well with predicting
players who are getting more than 25 or 30
minutes a game. This may occur because there are
skills that our data set does not capture that could

influence playing time. For example, a player may be very good at setting screens
on defense or cutting through the lane to get their teammates open. Perhaps a
player sees an increase in playing time as a result of their leadership capabilities.
These are factors that are not accounted for in the data set.
Nonparametric

The same initial variables were used in the fitting of the nonparametric
models. In order to reduce the number of variables, backward regression was
utilized. The nonparametric model contains the same variables as the parametric
model except it also has the variables BLK.per.game*Center, sqrt.Drives.per.game,
sqrt.Drives.per.game*Guard and sqrt.Drives.per.Game*Center. The parametric
model has the variable Touches.per.game whereas the nonparametric model does
Figure
2

not. This difference could be the result of using stepwise regression


to fit the parametric model and backwards regression for
the nonparametric model.2 The adjusted R2 value for this model
is approximately 0.8548. The conditions for
the nonparametric model were fit. The residuals are
centered at zero and have a symmetric

distribution (see Technical Appendix for histogram).


Figure 2 demonstrates that the comparison of the densities for the predicted
and actual MIN.per.game displays a similar issue that was observed with the
parametric model. It appears as if the nonparametric model also does not do well
predicting the playing time for players who are averaging more than 25 or 30
minutes per game.
The preferred model is the parametric model. The conditions for both tests
were met, but the adjusted R2 value for the parametric model was higher than that
of the nonparametric model. Thus, the parametric model is preferred.

Part III Density Estimations


2 A nonparametric model was fit using the variables in the final parametric model.
All of these variables were found to be significant, but the model had an adjusted R 2
of only about 0.851.
9

To further explore the distributions of these variables, density plots were


created using different kernels and bandwidths. To begin with, the default options
for R were attempted. These options included the Gaussian kernel and a bandwidth
of NRD0. The other kernels used were the box and Epanechnikov kernels. The
different bandwidths used were Normal Reference Distribution (NRD), Unbiased
Cross-Validation (UCV), Biased Cross-Validation, and Sheather and Jones (SJ).
For the density plots of the distance per game distribution, the Gaussian
kernel option looked to fit the data the best because it was smoother than both the
box and Epanechnikov kernels. The density plots were then fit using the Gaussian
kernel and the bandwidth options. In changing the bandwidth options, it appears
that only the SJ and UCV options really changed the plots from the default. Since
the distributions were not normal, it might be best use the UCV, BCV, or SJ
bandwidth options for the distribution of distance per game.
Similarly, for the opponents field goal percentage at rim distribution, the
Gaussian kernel fit the data the best because it was the smoothest out of the 3
options. If the bandwidth options are changed, the density do did not change
significantly between the default and the other options. The distribution for this
variable was pretty normal however, so it would be appropriate to use the default
bandwidth option of NRD0.
The next distribution examined was that for rebounds per game
(REB.per.game). Testing the different kernels, the Gaussian kernel did the best job
smoothing the distribution. Differences between this kernel and the Epanechnikov
kernel were minimal, however it was still evident that the Gaussian kernel provided
the smoothest distribution. Looking at the different options for bandwidths, again,
there were small difference between the four options; however, it would appear as

10

though the default nrd bandwidth does the best job smoothing the distribution of
this variable.
Following the same procedure for testing different options of kernels and
bandwidths we looked at another variable, Touches per Game. Again, there were
not drastic differences between the extent of smoothing between the Epanechnikov
and Gaussian kernel, but the Gaussian kernel did in fact smooth the distribution just
slightly better. When testing the bandwidth options, we observed that the "bcv" and
"nrd" bandwidths do the best job smoothing these distributions, however, "nrd"
does a slightly better job, so this bandwidth is selected.
Conclusion
In the first section of the project, it was found that there were statistically
significant differences between the distributions of touches per game, rebounds per
game, and opponents field goal percentage at the rim for guards and that for
forwards/centers. There is not evidence found that there was a difference between
the distributions for distance traveled per game for the two groups. The conditions
for both the t-test and Fligner-Policello test were met for opponents field goal
percentage at the rim. Both tests resulted in the same conclusion, so there is no
preference for one over the other. For the other variables, the conditions were not
met so there conclusions need to be taken critically.
The second part of the project contains a model that was created in order to
predict the minutes per game that a player should receive. The conditions for both
the parametric and nonparametric methods of model fitting were satisfied.
However, the parametric model had a higher adjusted R 2 value than either of the
nonparametric models that were tested. Therefore, the parametric version is
preferred.
11

In the final part of the project, various distributions are fit to variables in
order to determine the best fit. The best distribution for the distance traveled per
game variables was fit using a Gaussian kernel. The Gaussian kernel was also the
smoothest for opponents field goal percentage at the rim and the best bandwidth
option was NRD0. For both the rebounds per game and touches per game
variables, the Gaussian kernel and the NRD bandwidth were found to provide the
best distributions.
The analyses in this project was conducted using data from the 2013-2014
NBA regular season. Thus, the conclusions, models and results in this project can
only be applied to NBA players. It would not be appropriate to use this information
to make decisions about other levels of basketball, such as high school or college. It
is also important to note that rule changes in the future could alter the nature of the
NBA and invalidate this report.

12