0 Suka0 Tidak suka

75 tayangan12 halamanAug 25, 2016

© © All Rights Reserved

DOCX, PDF, TXT atau baca online dari Scribd

© All Rights Reserved

75 tayangan

© All Rights Reserved

- Breaking Down the Double Motion Offense
- Sampling Analysis of Environmental. Statistical Terms
- ALENTON CHAPTER 2.docx
- The Comparative Study of the Aspects of Perceived Stress among the Internet Bank Customers on their Devotion in the Private and State Banks
- EKO PROJECT EDUCATIONAL RESOURCES AS PREDICTORS OF MATHEMATICS ACADEMIC ACHIEVEMENTS
- Statistic for management
- egsta43
- Method Man Cio
- data_Sol
- Week14a (1)
- Test of difference
- Regression ANOVA Compiled
- 4385-20545-1-PB
- A Comparative Study of Two-Sample t-Test Under Fuzzy Environments Using Trapezoidal Fuzzy Numbers
- Excel Guide Morgan
- Excel Guide Morgan
- IB372 FA10 Lab01 Intro Statistics Presentation
- Output
- Kruskl Main
- 00144b8d73c72b773792968607da90f7585f.pdf

Anda di halaman 1dari 12

Final Project Write-Up

An Analysis of NBA Player Tracking Data

Abstract

The purpose of this project was to investigate the player tracking data

available for the NBA. In the project, we explored the differences for the variables

related to position. A model is fit to predict the minutes per game that an NBA

player can expect to play based on other game level statistics as well as examining

the distributions of some variables. Finally, various distributions are explored for

select variables in the data and the best ones are selected.

Introduction

For this project, our group wanted to take advantage of the data available

through the tracking cameras to learn more about the game of basketball. The

tracking data provides more detailed information than what is available in the

standard box score. The data, described in the next section, provides information

about the locations of players at different points during the game as well as data on

things such as how many times a player touches the ball or drives to the basket. We

wanted to take advantage of this richer data set and see what more we could learn

about professional basketball.

The first part of the project examined the difference in distributions of

different variables for the guards versus the forwards/centers. Looking at the

differences between the variables for the positions could provide insight into the

roles of positons on the teams. It could also be helpful in recruiting since coaches

would know what skills a player of a certain position should have.

minutes per game that a player would play. We fit one model using the parametric

multiple linear regression and the other model using the nonparametric method of

rank-based regression. Predicting the minutes per game that a player should play

could provide insight for a player who believes that he is not receiving the playing

time that he should. He could use the models to predict the minutes that he should

be playing based on his other statistics, and then approach the coach with an

argument. Our parametric model has a very high adjusted R 2. It also predicted

about 87% of players minutes per game within five minutes of the actual value for

the 2014-2015 regular NBA season.

In addition to examining the differences in distributions and fitting the

minutes per game model, we estimated the density of various variables by trying

different kernels and bandwidth options. Knowing the distributions of variables have

the potential to help players understand how their various skills compare to other

NBA players. Knowing the distributions of variables will allow us to understand how

the percentiles of different statistical categories change in the data.

Data

The data used for this project was obtained from stats.nba.com. The data is

collected using player tracking cameras that have been in every NBA arena since

the 2013-2014 season.1 These cameras are able to keep track of the locations of the

players as well as the ball throughout the course of the game by capturing multiple

images per second. As a result, there is more detailed information available in this

data than is in the typical box scores. Our final data set contained over 70 variables,

NBA partners with Stats LLC for tracking technology. Nba.com. NBA, September 5, 2013. http://www.nba.com/2013/news/09/05/nba-statsllc-player-tracking-technology/ (accessed April 30, 2015).

so the following table defines the variables that are referenced in the different

procedures that were carried out.

Variable

Distance.Traveled.per.game

Opp.FGA.at.rim.per.game*

Opponent.FGP.at.rim

Rebounds.per.game

Touches.per.game*

PTS.per.game*

Passes.per.game*

Uncontested.REB.per.game

*

Drives.per.game*

STL.per.game*

BLK.per.game*

Points.created.by.assist.per.

game*

Guard, Forward, Center*

Description

Distance that a player travels while on the court per game

(miles)

Number of field goals that an opponent attempts at the

rim (close to the basket) per game

Percentage of field goals made by an opponent at the rim

Number of rebounds by the player per game, a rebound

occurs when a player grabs a missed shot and occur when

the player is on offense or defense

Number of instances where a player touches and

possesses the ball

Number of points scored by a player per game

Number of passes that are either made or received by the

player per game

Rebounds that a player gets when an opponent is not

within 3.5 feet of the player

Number of times that a player drives to the basket per

game, a drive is defined as a touch that begins at least

twenty feet away from the basket and is then dribbled to

within ten feet of the basket

Number of steals per game, a steal occurs when a

defensive player takes the ball from an offensive player

Number of blocks per game, a block occurs when a player

hits the ball while an opponent is shooting and prevents

the opponent from scoring

Number of points that a player creates through assists, an

assist occurs if the player makes pass that directly leads

to a made basket

An indicator variable that is 1 if a player is the position

and a 0 otherwise

Table 1

One of the goals in this research project was to see where certain positions

add value to the team. In order to do this, the distributions of certain statistics were

compared between guards and forwards/centers. These groups were chosen

because there were not many pure centers in our data (only 61 out of 480), and

some of the players were listed as their position being both a forward and center. To

compare these two groups, both Fligner-Policello tests and t-tests were performed.

The distributions are not independent from each other because the players are

playing against each other, so if one center is running further, the center that is

covering him will also run more distance in that game. However, players are not

competing against players of the same skill level every game and players only play

against the same players for a small portion of the games. Thus, it is plausible to

determine that the independence condition is satisfied.

The distance each player ran per game (Distance.Traveled.per.game) was one

variable for which we compared the distributions for the two position groups. Before

running the tests, we thought that the forwards and centers might run more

because they run from one baseline of the court to the other, while the guards stay

more around the mid-court area and the three-point lines. The assumptions for the

t-test were not met here because the distributions of Distance.Traveled.per.game for

each group did not appear to be normal. In addition, the assumptions for the

Fligner-Policello test are not met here because the distributions are not symmetric

about the medians. Despite the issues with the assumptions, the tests were

conducted. Both the t-test and Fligner-Policello test indicated that there was not

evidence to conclude that a difference existed between the distance that guards ran

per game and the distance that forwards/centers ran per game.

Next, these procedures were performed on the distributions of opponents field goal

percentage at the rim (OPP.FGP.at.rim). This statistic provided some insight on how

to quantify the quality of a players defense. We thought that this would be

interesting to examine since there is not too much current research available on

how to measure a players defensive skills. For these distributions, the normality

condition appeared to be met based off of the density plots shown in the Appendix.

After running the procedures, both the t-test and the Fligner-Policello test indicated

that there is a statistically significant difference between the two groups. Both tests

demonstrated there is evidence that opponents field goal percentage at the rim is

higher for guards than it is for forwards and centers.

Another variable examined was rebounds per game (REB.per.game). The

results of the Fligner-Policello test concluded that there was a statistically significant

difference in the number of rebounds for each treatment group. This could be due to

the fact that forwards and centers are generally closer to the basket on both offense

and defense, whereas the guards are generally positioned around the perimeter. We

then ran the parametric t-test in which we obtained similar results of statistical

significance. However, it is important to note that the normality condition was not

met for this procedure seeing as it is quite evident that these distributions are righthand skewed. Thus, we would prefer the nonparametric for this particular variable.

In addition, the number of times a player touched the ball or had possession

of the ball during the game was examined (Touches.per.game). The Fligner-Policello

test concluded that there was a statistically significant difference in the number of

times that the guards touched the ball during a game versus the forwards and the

centers. This could be due to the fact that the guards maintain ball possession after

every opposing basket scored and are therefore accumulating more touches as they

bring the ball back up the court. They are also the ones setting up the majority of

the plays, so this could account for the disparity in touches between the positions.

The parametric t-test resulted in the same conclusion. The distributions of the

positions for this data set were not normal, so the nonparametric procedure is

preferred in this situation.

Part II Model Fitting

The variable investigated in the model is the minutes per game variable. In

order to do this, a model was fit using both parametric and nonparametric methods.

Because of the size of the data set (there were over seventy variables in the final

data set), it was necessary to choose a smaller number of the variables to

investigate in order to fit the model. These variables were selected using

scatterplots between minutes per game and the different variables. Since models

were created to predict MIN.per.game, the possible explanatory variables are on the

game level. In addition, the Spearmans rho was calculated between minutes per

game and the variables. Spearmans rho was used because it provides a

measurement of the existence of a relationship and does not assume that the

relationship is linear. The variables that were used as the beginning ones for model

fitting are marked with * in Table 1.

Parametric

Two of the aforementioned variables, Points.created.by.assist.per.game and

Total.Drives required transformations in order to satisfy the linearity condition. The

scatterplots between Min.per.game and the variables chosen showed that the

linearity assumption was met. The automated stepwise procedure in R was used in

order to reduce the number of variables in the model. The final model that resulted

from this was:

MIN.per.game ~ PTS.per.game + Passes.per.game + Uncontested.REB.per.game

+ STL.per.game + log.Points.created.by.assist.per.game + BLK.per.game +

Forward + Guard + Opp.FGA.at.rim.per.game + Center + Touches.per.game +

BLK.per.game:Forward + Guard:Opp.FGA.at.rim.per.game

The interaction terms included in the model are not surprising. One of these

interactions is between Opp.FGA.at.rim.per.game and Guard. Forwards and centers

tend to be near the basket on defense and thus are able to discourage field goal

attempts near the rim. Forwards are also more likely to get blocks than guards since

6

BLK.per.game*Center is not found to be significant as centers tend to be taller and

better at blocking. This term may not affect playing time because centers are

expected to block and thus the centers in the league are all good at blocking.

The linearity conditions for the variable appeared to be satisfied looking at

scatterplots of each variable and MIN.per.game. However, there were conditions

with the other conditions based on the residuals vs. fitted and qqplots, see the

technical appendix for these displays. The residuals vs. fitted plot showed a

downward trend for players who played more than 30 minutes per game. The data

is not entirely independent because players in the data set compete against one

another and thus the performance of one player affects the performance of another

player. For example, if two players are guarding one another, the distance they are

each run will be related. However, there are over 450 players in the data, so this

amount of dependence is small enough that we are comfortable proceeding with the

parametric model fitting. In addition, players are competing against players with a

variety of skills.

The 2014-2015 regular season data was used to analyze the performance of

the model. The model produced predictions for MIN.per.game for this data, and

these predictions were compared to the actual MIN.per.game for each player. The

average difference was about 2.42 minutes and the maximum difference was 11.04

minutes. An NBA game is 48 minutes, so being off by 11 minutes is a large portion

of the game. However, the IQR for the differences was 0.79-3.30, so there were not

a lot of differences near the eleven minute mark. The adjusted R 2 for the final

parametric model is 0.9229.

Figure

1

the MIN.per.game predicted by the model. The density

plots show that the model does not do well with predicting

players who are getting more than 25 or 30

minutes a game. This may occur because there are

skills that our data set does not capture that could

influence playing time. For example, a player may be very good at setting screens

on defense or cutting through the lane to get their teammates open. Perhaps a

player sees an increase in playing time as a result of their leadership capabilities.

These are factors that are not accounted for in the data set.

Nonparametric

The same initial variables were used in the fitting of the nonparametric

models. In order to reduce the number of variables, backward regression was

utilized. The nonparametric model contains the same variables as the parametric

model except it also has the variables BLK.per.game*Center, sqrt.Drives.per.game,

sqrt.Drives.per.game*Guard and sqrt.Drives.per.Game*Center. The parametric

model has the variable Touches.per.game whereas the nonparametric model does

Figure

2

to fit the parametric model and backwards regression for

the nonparametric model.2 The adjusted R2 value for this model

is approximately 0.8548. The conditions for

the nonparametric model were fit. The residuals are

centered at zero and have a symmetric

Figure 2 demonstrates that the comparison of the densities for the predicted

and actual MIN.per.game displays a similar issue that was observed with the

parametric model. It appears as if the nonparametric model also does not do well

predicting the playing time for players who are averaging more than 25 or 30

minutes per game.

The preferred model is the parametric model. The conditions for both tests

were met, but the adjusted R2 value for the parametric model was higher than that

of the nonparametric model. Thus, the parametric model is preferred.

2 A nonparametric model was fit using the variables in the final parametric model.

All of these variables were found to be significant, but the model had an adjusted R 2

of only about 0.851.

9

created using different kernels and bandwidths. To begin with, the default options

for R were attempted. These options included the Gaussian kernel and a bandwidth

of NRD0. The other kernels used were the box and Epanechnikov kernels. The

different bandwidths used were Normal Reference Distribution (NRD), Unbiased

Cross-Validation (UCV), Biased Cross-Validation, and Sheather and Jones (SJ).

For the density plots of the distance per game distribution, the Gaussian

kernel option looked to fit the data the best because it was smoother than both the

box and Epanechnikov kernels. The density plots were then fit using the Gaussian

kernel and the bandwidth options. In changing the bandwidth options, it appears

that only the SJ and UCV options really changed the plots from the default. Since

the distributions were not normal, it might be best use the UCV, BCV, or SJ

bandwidth options for the distribution of distance per game.

Similarly, for the opponents field goal percentage at rim distribution, the

Gaussian kernel fit the data the best because it was the smoothest out of the 3

options. If the bandwidth options are changed, the density do did not change

significantly between the default and the other options. The distribution for this

variable was pretty normal however, so it would be appropriate to use the default

bandwidth option of NRD0.

The next distribution examined was that for rebounds per game

(REB.per.game). Testing the different kernels, the Gaussian kernel did the best job

smoothing the distribution. Differences between this kernel and the Epanechnikov

kernel were minimal, however it was still evident that the Gaussian kernel provided

the smoothest distribution. Looking at the different options for bandwidths, again,

there were small difference between the four options; however, it would appear as

10

though the default nrd bandwidth does the best job smoothing the distribution of

this variable.

Following the same procedure for testing different options of kernels and

bandwidths we looked at another variable, Touches per Game. Again, there were

not drastic differences between the extent of smoothing between the Epanechnikov

and Gaussian kernel, but the Gaussian kernel did in fact smooth the distribution just

slightly better. When testing the bandwidth options, we observed that the "bcv" and

"nrd" bandwidths do the best job smoothing these distributions, however, "nrd"

does a slightly better job, so this bandwidth is selected.

Conclusion

In the first section of the project, it was found that there were statistically

significant differences between the distributions of touches per game, rebounds per

game, and opponents field goal percentage at the rim for guards and that for

forwards/centers. There is not evidence found that there was a difference between

the distributions for distance traveled per game for the two groups. The conditions

for both the t-test and Fligner-Policello test were met for opponents field goal

percentage at the rim. Both tests resulted in the same conclusion, so there is no

preference for one over the other. For the other variables, the conditions were not

met so there conclusions need to be taken critically.

The second part of the project contains a model that was created in order to

predict the minutes per game that a player should receive. The conditions for both

the parametric and nonparametric methods of model fitting were satisfied.

However, the parametric model had a higher adjusted R 2 value than either of the

nonparametric models that were tested. Therefore, the parametric version is

preferred.

11

In the final part of the project, various distributions are fit to variables in

order to determine the best fit. The best distribution for the distance traveled per

game variables was fit using a Gaussian kernel. The Gaussian kernel was also the

smoothest for opponents field goal percentage at the rim and the best bandwidth

option was NRD0. For both the rebounds per game and touches per game

variables, the Gaussian kernel and the NRD bandwidth were found to provide the

best distributions.

The analyses in this project was conducted using data from the 2013-2014

NBA regular season. Thus, the conclusions, models and results in this project can

only be applied to NBA players. It would not be appropriate to use this information

to make decisions about other levels of basketball, such as high school or college. It

is also important to note that rule changes in the future could alter the nature of the

NBA and invalidate this report.

12

- Breaking Down the Double Motion OffenseDiunggah olehBaba Hans
- Sampling Analysis of Environmental. Statistical TermsDiunggah olehleovence
- ALENTON CHAPTER 2.docxDiunggah olehRodel Cañamo Oracion
- The Comparative Study of the Aspects of Perceived Stress among the Internet Bank Customers on their Devotion in the Private and State BanksDiunggah olehTI Journals Publishing
- EKO PROJECT EDUCATIONAL RESOURCES AS PREDICTORS OF MATHEMATICS ACADEMIC ACHIEVEMENTSDiunggah olehOnakoya Sunday Oluwaseun
- Statistic for managementDiunggah olehchidhu101
- egsta43Diunggah olehgaurav dixit
- Method Man CioDiunggah olehLiop
- data_SolDiunggah olehAnirban Baral
- Week14a (1)Diunggah olehHawJingZhi
- Test of differenceDiunggah olehYelle Buniag
- Regression ANOVA CompiledDiunggah olehsumit kumar
- 4385-20545-1-PBDiunggah olehRoy Januardi
- A Comparative Study of Two-Sample t-Test Under Fuzzy Environments Using Trapezoidal Fuzzy NumbersDiunggah olehinventionjournals
- Excel Guide MorganDiunggah olehsapkotalok
- Excel Guide MorganDiunggah olehdmrpanda9940
- IB372 FA10 Lab01 Intro Statistics PresentationDiunggah olehsarfaraz
- OutputDiunggah olehRizqiyahAlwi
- Kruskl MainDiunggah olehRafael Nobrega Stipp
- 00144b8d73c72b773792968607da90f7585f.pdfDiunggah olehniclover
- Sdf 45t FsdzDiunggah olehpero
- Paired T-Test (Pretest & Posttest)Diunggah olehSylvinceter
- Data TypesDiunggah olehHamid Ullah
- DOE Course - Parts 1-4oDiunggah olehpandaprasad
- CrecheDiunggah olehLexy James
- 13 Vinh_Introduction to BIOSTATISTICSDiunggah olehNueng Bovornpat
- 002-0012(2017)Diunggah olehSaqib Khan
- Amran & Devi - 2008Diunggah olehTri Suko Purnomo
- Monitoring the authenticity of organic rice via chemometric.pdfDiunggah olehFrancisco Panero
- Guia para análise estatística JPDDiunggah olehhuguimjp

- an analysis of nba spatio temporal dataDiunggah olehapi-327649933
- americas warzoneDiunggah olehapi-327649933
- megan robertson projectDiunggah olehapi-327649933
- robertsonsubmission2Diunggah olehapi-327649933
- k-nices projectfinalwriteupDiunggah olehapi-327649933
- robertsonyanchenkoprojectfinalwriteupDiunggah olehapi-327649933

- GarageFloorCoating.com Announces Revolutionary “Clean” Mica and Its New Bagari Stone Effects Coating SystemDiunggah olehPR.com
- Big-CPU-Big-Data.pdfDiunggah olehAlejandro Rodriguez
- Portal Ancient EgyptDiunggah olehl540l
- 11 Analytical WritingDiunggah oleharjun130788
- Model2.out_ Bloc de notas.pdfDiunggah olehMaira Tello
- Control System CommandsDiunggah olehAnish Benny
- Deep Marine Environments -AlabamaDiunggah olehmoonrock1
- Memory Management (Interrupt Priority approach) pptDiunggah olehAmit Kumar Karna
- 12_biology_impQ_CH06_molecular_basis_of_inheritance.pdfDiunggah olehJaskirat Singh
- bzx284-smd diodDiunggah olehИван Алексиев
- LQRDiunggah olehFawaz Parto
- IMDiunggah olehAdan Saman Sheikh
- Casos_clinicos_de_ginecologia_y_obstetricia__Clinical_cases_of_gynecology_and_obstetrics_Spanish_Edition_by_Roberto_Matorras_WeinigJose_Remohi.pdfDiunggah olehV Hugo J Choque
- TEKTRONIX Fiber Optic Cable and Test EquipmentDiunggah olehEdwin Giraldo
- Python CS1 as Preparation for C++ CS2.pdfDiunggah olehPedro
- Activity 4 - Linear InequalitiesDiunggah olehshahid
- Florida Living WillDiunggah olehRocketLawyer
- 06 Pemadatan Tanah n StabilisasiDiunggah olehdeodorant de araujo jeronimo
- powder additives and mixing.pdfDiunggah olehumar
- Tension Control Bolts, S10T.pdfDiunggah olehYG LI
- A Systematic Literature Review on the State of Research 2015Diunggah olehamirebrahimi002
- mtdkDiunggah olehraviteja1840
- 01. Configure IPv4 and IPv6 AddressingDiunggah olehcenceptmen
- maths snakes and laddersDiunggah olehapi-133212105
- INOVA Gear Flyer A4 EnglischDiunggah olehBhagesh Lokhande
- FORENSIC TOXICOLOGY MULTIPLE CHOICE QUESTIONS (MCQs PDF)Diunggah olehMINANI Theobald
- How to Install GlusterFS With a Replicated High Availability Storage Volume on Ubuntu Linux 16Diunggah olehSpeedyKazama
- Assignment 4Diunggah olehAlf Håkon Lille-Mæhlum
- IdDiunggah olehMadhu Prasher
- Karl Fischer TitrationDiunggah olehJea Ayu Putri

## Lebih dari sekadar dokumen.

Temukan segala yang ditawarkan Scribd, termasuk buku dan buku audio dari penerbit-penerbit terkemuka.

Batalkan kapan saja.