Basketball Random Walk

Random Walk Picture of Basketball Scoring
Alan Gabel1 and S. Redner1

1
Center for Polymer Studies and Department of Physics, Boston University, Boston, Massachusetts 02215, USA
arXiv:1109.2825v1 [physics.data-an] 13 Sep 2011
We present evidence, based on play-by-play data from all 6087 games from the 2006/072009/10 seasons of the National Basketball Association (NBA), that basketball scoring is well described by a weakly-biased continuous-time random walk. The time between successive scoring events follows an exponential distribution, with little memory between dierent scoring intervals. Using this randomwalk picture that is augmented by features idiosyncratic to basketball, we account for a wide variety of statistical properties of scoring, such as the distribution of the score dierence between opponents and the fraction of game time that one team is in the lead. By further including the heterogeneity of team strengths, we build a computational model that accounts for essentially all statistical features of game scoring data and season win/loss records of each team.
Keywords: scoring statistics, hot hand, stochastics, random walk, Poisson process, antipersistence
INTRODUCTION
Sports provide a rich laboratory in which to study competitive behavior in a well-dened way. The goals of sports competitions are simple, the rules are well dened, and the results are easily quantiable. With the recent availability of high-quality data for a broad range of performance metrics in many sports [1], it is now possible to address questions about measurable aspects of sports competitions that were inaccessible only a few years ago. Accompanying this wealth of new data is a rapidly growing body of literature, both for scientic and lay audiences (for some general references, see, e.g., [27]). In this spirit, our investigation is motivated by the following simple question: can basketball scoring be described by a random walk? We will present evidence based on play-by-play data [8] from National Basketball Association (NBA) games to answer this question in the armative. We focus on basketball primarily because there are many points scored per game roughly 100 scoring events in a 48minute game and also many games in a season. Thus the number of scoring events is suciently large that we can reach unambiguous and statistically signicant conclusions. Our random walk picture directly addresses the question of whether sports performance metrics are determined by memory-less stochastic processes or by processes with long-time correlations [913]. For example, to the untrained eye, hot streaks or slumps namely, sustained periods of superior performance or of inferior performance seem so unusual that they ought to have exceptional explanations. However, this impression is at odds with the data. Impartial analysis of individual player data in basketball has discredited the notion of a hot hand [9, 14]. Rather, a players shooting percentage is independent of past performance, so that apparent hot streaks or slumps are simply a consequence of a series of random uncorrelated scoring events. Similarly, in baseball, teams do not get hot or cold [15, 16]; instead,
winning and losing streaks arise from random statistical uctuations. In our study, we nd that, much like individual players, basketball teams do not have hot steaks or slumps. In this work, we extend our analysis to game kinetics by focusing on the statistical properties of scoring frequency. As we will discuss, the data show that the scoring rate in basketball is well described by a continuous-time Poisson process. Thus scoring bursts or scoring droughts arise from the statistics of the Poisson process rather than from a temporally correlated process. The idealized picture of random scoring has to augmented by two features, one that may well be ubiquitous and one that is idiosyncratic to basketball. The former is the existence of a weak linear restoring force, in which the leading team scores at a slightly lower rate (conversely, the losing team scores at a slightly higher rate). This restoring force seems a natural human response a team with a large lead may be tempted to coast, while a lagging team would likely play with greater urgency. A similar rich get poorer and poor get richer phenomenon has been found to occur in economic competitions where each battle has low decisiveness [17, 18]. This low payo typies basketball, where the outcome of any single play is unlikely to determine game. The second feature, idiosyncratic to basketball, is anti-persistence, in which a score by one team is likely to be followed by a score from the opponent because of the change in ball possession after each score. By incorporating these two enhancements into a continuous-time random-walk description of scoring, we build a computational model that accurately accounts for many statistical features of individual game scores and of team win/loss records.
SCORING RATE
Basketball is played between two teams with ve players each. Points are scored by making baskets that are
2 each worth 2 points (typically) or 3 points, for suciently long-range baskets. Additional single-point baskets can occur by foul shots that are awarded after a physical or a technical foul. The number of successive fouls shots is typically 1 or 2, but more can sometimes occur as a result of agrant or technical fouls. The duration of a game is 48 minutes (2880 seconds). Games are divided into four 12-minute quarters, with stoppage of play at the end of each quarter. The ow of the game is ostensibly continuous, but play does stop for fouls and time-outs. An important feature that sets the time scale of scoring is the 24-second clock. In the NBA, a team must score within 24 seconds of gaining possession of the ball, or else possession is forfeited to the opposing team. At the end of the game, the team with the most points wins. We analyze play-by-play data from 6087 NBA games from the 200607 through the 200910 seasons, including playo games [8]; for win/loss we also analyzed a larger dataset that consist of 20 NBA seasons [1]. To simplify the data analysis and its interpretation, we study the scoring data only up to the end of regulation time. Thus every game is exactly 48 minutes long and some games end in ties. We do not consider overtime to avoid the complications of games of dierent durations and the possibility that scoring patterns during overtime could be dierent from those during regulation time.
0.08
Scoring Rate [plays/s]
0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 30 20 10 0 10
1st Quarter 2
nd rd th
Quarter
3 Quarter 4 Quarter
20
30
Time After Quarter Ends [s]
FIG. 2. Scoring rate near the change of each quarter; zero corresponds to the start/end of a quarter.
Scoring Rate [plays/s]
0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0 500 1000 1500 2000 2500
Time [s]
FIG. 1. Scoring rate as a function of time.
In our analysis, we focus on what we term scoring plays, rather than individual baskets. A scoring play includes any number of baskets that are made with no time elapsed between them on the game clock. For example, a 2-point play could be a single eld goal or two consecutive successful foul shots; a 3-point play could be could be a normal eld goal that is immediately followed by a successful foul shot, or a single successful shot from outside the 3-point line. The mean score per play is 2.09 points. The scoring rate is roughly constant over the course of a game, with mean value of 0.033 plays/sec (Fig. 1), corresponding to 95 scoring scoring plays per
game and nal score in which each team has approximately 100 points [19, 20]. As a matter of curiosity, there are signicant deviations to the constant scoring rate for short durations at the end and beginning of each quarter (Fig. 2). During roughly the rst 10 seconds of each quarter, scoring is less likely because of a natural minimum time to make a basket after the initiation of play. Near the end of each of the rst three quarters, the scoring rate rst decreases and then sharply increases right at the end of the quarter. This anomaly arises because, within the last 24 seconds of the quarter, teams often intentionally delay their nal shot until the last moment so that the opponent has no chance for another shot before the quarter ends. However, there is only an increase in the scoring rate before the end of the game, possibly because of urgent eorts of a losing team in attempting to mount a last-minute comeback. While these deviations from a constant scoring rate are visually prominent, they occupy roughly 8% of the game time, and their overall impact on the nal score is minor. Thus to a good approximation, scoring in basketball is temporally homogeneous. In addition to temporal homogeneity, the data show that there is essentially no memory between scoring events. To illustrate this property, we study the distribution of time intervals, P (t), between successive scoring plays. There are two such time intervals that are natural to dene: (a) the time interval te between successive scores of either team, and (b) the time interval ts between successive scores of the same team. The probability P (te ) has a peak at roughly 16 seconds, which evidently is determined by the deadline of the 24-second shot clock. At longer times, the probability distribution decays exponentially with time over essentially the entire data range for which data exists (Fig. 3). The longest interval where neither team scored was 402 seconds, while the longest interval for a single team to not score was 685 seconds. Essentially the same behavior arises for P (ts )
3 alternating scores is characteristic of an anti-persistent random walk, in which a step in a given direction is more likely to be followed by a step in the opposite direction [21].
10
Probability
10
10
50
100
150
200
Scoring Interval [s]

FIG. 3. Probability distributions of time intervals between successive scores: P (te ) vs. te () and 2P (ts /2) vs. ts /2 (). The line is the linear t of P (te ) vs. te over the range te > 30 sec and corresponds to a decay rate = 0.048.
Probability
10
10
10
10
15
20
25
30
Streak Length [points]

FIG. 4. Data for the probability P (s) of a consecutive point streak of s points versus s (). The dashed line corresponds to P (s) = Aq s/2.1 , with q = 0.348 and A the normalization constant. The solid line corresponds to a rened model that incorporates the dierent probabilities of 1, 2, 3, and 4-point plays.
except that the time scale is larger by an obvious factor of 2. When all the same-team time intervals are divided by 2, the distributions P (te ) and P (ts ) overlap substantially. Furthermore, the long-time tails of both P (te ) and 2P (ts /2) agree well with that of a Poisson process with rate = 0.048 plays/sec. This rate is larger than the 0.33 plays/sec from the scoring data because Poisson process includes scoring intervals of less than 10 seconds, which occur only rarely. An important feature of the time intervals between scores is that they are uncorrelated. This feature is illustrated by form the time-ordered list of scoring intervals T1 , T2 , T3 , . . ., and computing the correlation function C(n) = Tk Tk+n Tk 2 Tk Tk 2
2
(1)
This anti-persistence controls the streak-length distribution, dened as the probability that a given team scores s consecutive points before the opposing team scores. Since there are 2.1 points scored, on average, in a single play, a scoring streak of s points corresponds to c = s/2.1 consecutive scoring plays. In terms of a of an anti-persistent random walk, the probability for a scoring streak of c consecutive plays is q c , with q = 0.348. As a result, the probability for a streak of s points is P (s) q s/2.1 . (2)
We studied both the intervals te , independent of which team scored, or the intervals ts for a single team. This correlation function is essentially zero for n 1 (see the Supplemental Material); there is little long-term memory between scoring events. Thus the notions of scoring bursts or scoring drought are nothing more than manifestations of the uctuations inherent in a Poisson process of random and temporally homogeneous scoring events.
RANDOM-WALK DESCRIPTION OF SCORING
We now turn to the question of which team scores in each play to build a random-walk description of the scoring dynamics. After a given team (labeled A) scores, possession of the ball reverts to the opposing team (B). This change of possession confers a signicant advantage for team B to score immediately after a score by team A. On average, team B scores immediately after a score by team A 65.2% of the time. This pattern of preferentially
This simple form agrees fairly well with the observed probability distribution of scoring streaks (Fig. 4). However, we can do much better by a slightly more rened model that incorporates the dierent probabilities for 1, 2, 3, and 4 point plays (see the Supplemental Material). This rened model accounts for essentially all detailed features of the streak-length data (see Fig. 4). Thus long streaks arise simply from random statistical uctuations there is no need to invoke the notion that teams get hot or cold to explain point-scoring streaks. Another intriguing empirical feature is that the score dierence between the two teams aects the scoring probability (Fig. 5). The data indicate that there is a weak restoring force whose eect is to slightly reduce the score dierence [22]. That is, the probability that the winning team scores decreases systematically with its lead size or, conversely, the probability that the losing team scores increases systematically with the size of its decit. This eect is well-t by a linear dependence of the bias
4
200 0.6
Score Probability
150
0.5 0.45 0.4 0.35 40
[points ]
2 2 20 0 20 40
0.55
100
50
0 0
500
1000
1500
2000
2500
Lead Size [points]

FIG. 5. Data for the probability S(L) that a team will score next given a lead L (). The line is the best linear t, S(L) = 1 0.0022L 2
Time [s]
FIG. 6. Score dierence variance, 2 , as a function of time. The line represents the best linear t, excluding the last 2.5 minutes of data. The variance reaches its maximum 2.5 minutes before the end of the game (dashed line).
on the lead (or decit) size. The magnitude of the eect is small; assuming a linear dependence, the data gives a decrease in the scoring rate of 0.00022 per point of lead. It is natural to speculate that this restoring force could originate from the winning team coasting or the losing team working to mount a comeback. While the data indicate that the score in an NBA basketball game evolves as an anti-persistent random walk with an additional small restoring force, clearly some teams are better than others. This dierence in intrinsic quality should lead to an overall bias in a random-walk description of basketball scoring. We quantify the role of such a bias by studying the evolution of the dierence in score (t) between the two teams as a function of time. In a random-walk picture, the variance 2 = ( )2 should grow with time as 2Dt, with D the diusion coefcient associated with basketball scoring. As illustrated in Fig. 6, the variance indeed grows roughly linearly with time, except for the last 2.5 minutes of the game; we will discuss this latter anomaly in more detail below. From a best linear t to all but the last 2.5 minutes of the game data, we obtain the estimate Dt = 0.0363 points2 /sec. This value of the diusion coecient accords well with the anti-persistent random walk picture of basketball scoring [21]. In terms of basic parameters of an antipersistent random walk, we can express the diusion coecient of basketball as (see the Supplemental Material) D= (points)2 q 2 = 0.0383 , 1 q 2 sec (3)
where q = 0.348 is the probability for the same team to score consecutively, = 2.1 points is the mean size of a scoring event, and = 30.3 seconds is the observed average time between successive scoring events. This value for D is close to the value Dt = 0.0363 that was obtained from the empirical time dependence of the variance. We
attribute this small discrepancy to our neglect of the linear restoring force in the estimate (3). Perhaps surprisingly, the inuence of intrinsic team quality on basketball scoring is not decisive. Basketball games are suciently short that diusive uctuations substantially obscure the dierences in intrinsic team strength. For a biased random walk, the interplay between diusion and bias is quantied by the Pclet e number, P e v 2 t/2D, where v is bias velocity, t is the elapsed time, and D is the diusion coecient [23, 24]. For P e 1, a random walk is dominated by diusion and the eects of the drift are minuscule, whereas for P e 1 the bias is important. For basketball, we estimate a typical bias velocity from the observed average nal score dierence, || 10.7 points, divided by the game duration of t = 2880 seconds to give v 0.0037 points/sec. Using the value of D 0.0363 points2 /sec, we obtain P e = 0.55 a small, but not negligible, Pclet e number. Consequently, the bias that stems from intrinsic dierences in team strengths is not the dominating factor in determining the outcome of a typical NBA basketball games. Finally, it is revealing to examine the anomaly associated with the last 2.5 minutes of a game. If the score evolves as an anti-persistent random walk, then distribution of the score dierence should be Gaussian whose width grows with time as Dt. This is what we observe except during the last 2.5 minutes. As shown in Fig. 7, the distribution of score dierence has a Gaussian appearance, with a width that grows slightly more slowly than Dt. This small deviation arises from the weak restoring force, which leads to a diusion constant that decreases with time. However, in the nal 2.5 minutes of the game, the score-dierence distribution develops a sharp spike at zero due to tie games. There is also a decit in probability for small nal score dierences.
5 next, immediately after a scoring event, are:

0.8 t=2880s (final) t=2730s t=720s
Probability
0.6
PB = IB + 0.152r + 0.0022,
PA = IA 0.152r 0.0022,
(4)
0.4
0.2
0 3
0 / 2Dt
where IA and IB are the intrinsic team scoring probabilities (which must satisfy IA + IB = 1; see below) and the term, 0.152r, accounts for the anti-persistence. Here r is dened as +1 team A scored previously, r = 1 team B scored previously, (5) 0 rst play of the game. Finally, the term 0.0022, accounts for the restoring force with the empirically measured restoring coecient (Fig. 5). When averaged over all teams, Eq. (4) gives 0.348 for the probability for team A to score again immediately after it has just scored. We now determine the intrinsic team scoring probabilities computationally. We rst assign a strength parameter Xi for the ith team, which is xed for the season, with better teams having higher strengths. We assume that the distribution of strengths is drawn from a Gaussian distribution with average value Xi and variance X. Essentially identical results occur for other distributions of team strengths. In a game between teams A and B, the team strengths are assumed to determine the intrinsic scoring probabilities according to the classic Bradley-Terry competition model [25]: IA = XA , XA + XB IB = XB . XA + XB (6)
FIG. 7. Scaled probability distributions of score dierences at the end of the rst quarter, after 45.5 minutes, and at the end of the game.
Thus close games tend to end in ties much more often than expected from a random-walk picture of the score evolution. A natural mechanism for this anomaly is that the losing team plays urgently to force a tie, a hypothesis that is consistent with the increase in scoring rate that is observed at the end of NBA games (Fig. 1).
COMPUTATIONAL MODEL
We now build on these empirical facts about scoring to construct a random-walk model to account for a broad range of point-scoring phenomena in basketball and the win/loss record of all teams at the end of the season. In our model, games are viewed merely as a series of scoring plays. The time between plays is drawn from a Poisson distribution whose mean is the observed value of 30.3 seconds. We ignore the short-lived spikes and dips in the scoring rate at the end of each quarter and treat scoring in a basketball game as a temporally homogeneous process. Plays can be worth 1, 2, 3, or 4 points, with the probabilities for each such outcome drawn from the observed distribution of play values (see table I in the Supplemental Material). Plays of 5 or 6 points, which involve multiple technical fouls, have negligible probabilities of occurrence (0.023% and 0.0012% of all plays, respectively) and are ignored in our computational model. Simulations continue until the nal game time of 48 minutes is reached. There are three factors that determine which team scores. First, intrinsic natural dierences in ability mean that a better team has a greater chance of scoring. The second factor is the anti-persistence of successive scoring events. The last is the linear restoring force, in which the scoring probability of a team decreases as its lead increases (and vice versa for a team in decit). Thus the probabilities PA and PB that team A or team B scores
Since these intrinsic probabilities depend only on the ratio of the strengths XA , XB , we may choose X = 1 without loss of generality. Thus the only parameter is the variance X. We now calibrate the model by simulating many NBA seasons for a league of 30 teams for a range of values for X and comparing the simulated probability distributions for basic game observables with corresponding empirical data. Specically, we examined: (i) the probability for a given nal score dierence, (ii) the season team winning percentage as a function of its normalized rank (Fig. 8), (iii) the probability for a team to lead for a given fraction of the total game time (Fig. 9), and (iv) the distribution of the number of lead changes during a game, (see the Supplemental Material). Here normalized rank is dened so that the team with the best winning percentage has rank 1, while the team with worst record has rank 0. The probability for a given lead time is motivated by the well-known, but mysterious arcsine law [26]. According to this law, trajectories of a one-dimensional random walk that spend equal total amounts of time to the left and to the right of the origin are unlikely to occur. It is
6
0.8
Winning Percentage
0.7 0.6 0.5 0.4 0.3 0.2 0 0.2 0.4 0.6 0.8 1
Rank
FIG. 8. Winning Percentage as a function of team rank. The data correspond to the 19912010 seasons [1]. The dashed curve is the simulated win/loss record if all teams have equal strength, X = 0, while the solid curve is the simulated win/loss record with team strength variance, X = 0.0083.
x 10 9 8
Probability
7 6 5 4 3 2 0 500 1000 1500 2000 2500
Time Leading [s]

FIG. 9. The probability that a randomly-selected team leads for a fraction F of the total game time. The curve is the result of simulating 104 complete seasons with the optimal variance of team strengths, X = 0.0083.
much more likely to have trajectories where the walk is always on one side of the origin. As a corollary to the arcsine law, there are typically N crossings of the origin for a one-dimensional random walk of N steps, and the distribution in the number of lead changes is Gaussian. These origin crossings correspond to lead changes in basketball games. For all four quantities, the best match between the empirical NBA game data and the data that is generated by our random-walk model occurs for X 0.0083 (see the Supplemental Material).
OUTLOOK
Our main conclusion is that scoring in basketball is well described by a continuous-time random walk with an ex-
ponential distribution of times between successive scoring events. This idealized model accounts for a wide range of statistical features about scoring patterns in NBA basketball games. From this exponential distribution, we can calculate all details of the scoring streak-length distribution. The excellent agreement with the data of Fig. 4 decisively shows that there is no streakiness in scoring patterns. When we additionally account for the intrinsic heterogeneity in team strengths, the resulting model accurately describes the win/loss records of each team as well. A variety of open issues are worth exploring further. First, is it possible that the exponential distribution of time intervals between scoring events is a ubiquitous feature of sports competitions? We speculate that perhaps other sports with free-owing games, such as lacrosse [12], soccer [13], or hockey [27, 28], will have the same scoring pattern as basketball when the time intervals between scores are rescaled by the average scoring rate for each sport. It also seems plausible that other tactical metrics, such as the times intervals between successive crossings of mid-eld by the game ball (or puck) may also be described by Poisson statistics. If borne out, it could be that there is a universal rule that governs the scoring time distribution in sports. Another intriguing facet of NBA basketball games is the small but clearly-discernible restoring force toward a small score dierential between competitors. This feature seems to be a natural consequence of human nature. It is not unreasonable for a winning team to relax its eort slightly as its lead grows. Conversely, a team would likely play with more urgency as it falls behind. A test of this hypothesis should be easy to accomplish for any sport by the analysis of game-scoring data. Another remarkable feature of basketball scoring data is that it can be well described by a nearly unbiased random walk, even though there is a measurable dierence in the intrinsic strength between two typical teams. Because of the elite nature of the competition, the disparities between strong and weak teams are small, so that the Pclet number that quanties the relative eect of e team strength and stochasticity is not large. In practical terms, it means that if one views a typical game it will be dicult to determine which is the superior team, and essentially impossible if one only views a short segment of the game. Seen through the lens of coaches, sports commentators, and fans, basketball is a complex sport that requires considerable analysis to understand and respond to its many nuances. As a result, a considerable industry has built up to quantify every aspect of basketball and thereby attempt to improve a teams competitive standing. However, this competitive and evolutionary rat race largely eliminates gross systematic advantages between teams, so that all that remains from a competitive standpoint are small surges and ebbs in performance that arise from the
7 underlying stochasticity of the game. Thus seen through the lens of the theoretical physicist, basketball is merely a random walk (albeit in continuous time and with some additional subtleties) so that all of the observable consequences of the game that are of interest to the quantitative scientist follow from this random-walk description. We thank Ravi Heugel for initial collaborations on this project. We also thank Aaron Clauset for helpful comments on an earlier version of the manuscript. This work was supported in part by NSF grant DMR0906504.
(1952). [26] W. Feller, An Introduction to Probability Theory and its Applications, Vol. I (Wiley, New York, 1968). [27] A. C. Thomas, J. Quantitative Analysis in Sports 3, Issue 3 Article 5 (2007). [28] S. E. Buttrey, A. R. Washburn, and W. L. Price, J. Quantitative Analysis in Sports 7, Issue 3 Article 24 (2011).
[1] See, e.g., the extensive compilation in www.shrpsports.com. [2] F. Mosteller, Amer. Statistician, 51, 305 (1997). [3] See, e.g., J. Albert, J. Bennett, J. J Cochran (eds.) , Anthology of Statistics in Sports, ASA-SIAM Series on Statistics and Applied Probability 16 (2005). [4] J. Albert and R. H. Koning, Statistical thinking in sports (Taylor and Francis, Boca Raton, 2008). [5] M. Glickman and S. Evans, J. Quantitative Analysis in Sports 6, Issue 2 Article 5 (2009). [6] J. Arkes and J. Martinez, J. Quantitative Analysis in Sports 7, Issue 3 Article 13 (2011). [7] J. Kubatko, D. Oliver, K. Pelton, and D. T. Rosenbaum, J. Quantitative Analysis in Sports 3 Issue 3 Article 1 (2007). [8] The play-by-play data were obtained from www.basketballvalue.com. [9] T. Gilovich, R. Vallone, and A. Tversky, Appl. Cognitive Psych. 17, 295 (1985). [10] S. J. Gould, Full House: The Spread of Excellence from Plato to Darwin (Three Rivers Press, New York, 1996). [11] S. Miller and R. Weinberg, Sport Psychologist. 5 211 (1991). [12] P. Everson and P. S. Goldsmith-Pinkham, J. Quantitative Analysis in Sports. 4 Issue 2 Article 13 (2008). [13] D. Dyte and S. R. Clarke, J. Oper. Res. Soc. 51 993 (2000). [14] P. Ayton and I. Fischer, Memory & Cognition. 32, 1369 (2004). [15] C. Sire and S. Redner, Eur. Phys. Jour. B 67, 473 (2009). [16] R. Vergin, J. Sports Behavior. 23 Issue 2, 181 (2000). [17] Y. Durham, J. Hirshleifer, and V. L. Smith, Am. Econ. Rev. 88, 891 (1998). [18] M. Garnkel and S. Skaperdas, Handbook of Defense Economics. 2, 649 (2007). [19] A similar result for time intervals between scoring events was discussed in Y. de Sa Guerra, J. M. Martn Gonzlez, a a S. Sarmiento Montesdeoca, D. R. Ruiz, N. ArjonillaLpez, and J. M. Garca-Manso, arXiv.org:1108.0779. o [20] P. H. Westfall, Am. Statistician 44, 305 (1990). [21] R. Garc a-Pelayo, Physica A, 384 143 (2007). [22] G. E. Uhlenbeck and L. S. Ornstein, Phys. Rev. 36, 823 (1930). [23] See e.g., R. F. Probstein, Physicochemical Hydrodynamics second edition (J. S. Wiley & Sons, New York, 1994). [24] S. Redner, A Guide to First-Passage Processes (Cambridge University Press, New York, 2001). [25] R. A. Bradley and M. E. Terry, Biometrika 39, 324
8
SUPPLEMENTAL MATERIAL: Scoring intervals
The exponential distribution of time intervals between scores (Fig. 3) suggests that scoring is governed by a Poisson process. Under the assumption that scores occur at the empirically-observed rate of = 0.033 plays/sec., the probability that a game has k scoring plays is given 1 by the Poisson distribution, P (k) = k! (T )k e T , where T = 2880 sec. is the game duration. According to the data, the average score of each play is 2.1 points, so that a game that contains k plays will have a total score of S = 2.1k. By eliminating k in favor of S in the above Poisson distribution, the probability that a game has a total score S is P (S) = 1 (T )S/2.1 eT . 2.1 (S/2.1)! (7)
and C(n) will be small. Figure 11 shows that as soon as n 1, C(n) < 0.03. Thus nearby intervals are nearly uncorrelated. This property holds both for the time intervals between either team scoring and the intervals between consecutive scores of the same team.
0
10
C(n)
10
10
This distribution accurately accounts for the game data shown in Fig. 10.
n
FIG. 11. The correlation function, C(n), versus the interval separation, n, for time intervals between either team scoring () or the same team scoring ().
0.02
Probability
0.015
0.01
Thus a team that is perceived to speed up its scoring pace (or to deliberately slow its pace) ultimately is a reection of statistical uctuations, rather than a fundamental shift in game play.
0.005
150
200
250
Scoring Probabilities
Total Score [points]

FIG. 10. Probability P (S) for a total score S in a single game. Circles are the data, and the solid curve is the Poisson distribution (7).
Additionally, the time intervals between successive scoring events are nearly uncorrelated. To check for such correlations, we use the correlation function between intervals introduced in Eq. (1): C(n) = Tk Tk+n Tk 2 Tk Tk 2
2
The probabilities for 1, 2, and 3 point baskets is given in the left side of table I, while the probability distribution of points per play is given on the right. Again, a play is dened as a set of scoring events with no time elapsed between them (such as a 2-point basket followed by a free throw). High-value plays of 5 and 6 points are rare and arise from multiple technical or agrant fouls.
Points per Play 1 pt. 8.70% 2 pts. 73.86% 3 pts. 17.28% 4 pts. 0.14% 5 pts. 0.023% 6 pts. 0.0012%
where Tk is the k th scoring interval in a game. Thus n = 1 corresponds to correlations between adjacent intervals, n = 2 to next-nearest neighbor intervals, etc. The factor in the denominator normalizes the correlation function so that C(0) = 1. If there is little correlation between nearby intervals, then Tk Tk+n Tk Tk+n = Tk 2
Points per Basket 1 pt. 33.9% 2 pts. 54.6% 3 pts. 11.5%
TABLE I. (left) The point value of each basket (left) and each play (right) and their respective frequencies.
9
Anti-Persistent Random Walk
of length s is
We model the evolution of the score between two teams as an anti-persistent random walk. In this picture, two successive scores corresponds to two random-walk steps in the same direction. Averaged over all games, the probability of such an outcome is q = 0.348, while the probability that the walk changes direction is 1q. Let P (, t) be the probability that the score dierence is at time t. From [21], this probability obeys the recursion P (, t + ) = qP ( , t) + qP ( + , t)
P (s) =
n=1
q n1 (1 q)
wvk ,
{vk }
(10)
where the product is over all allowed sequences {vk } of n consecutive point-scoring events. For example, the probabilities for streaks up to s = 4 are: P (1) = (1 q)w1
2 P (2) = (1 q)[w2 + qw1 ]
+ [(1 q)2 q 2 ]P (, t ). (8a)
3 P (3) = (1 q)[w3 + 2qw2 w1 + q 2 w1 ]
(11)
To understand this equation, we rewrite it as P (, t + ) = q[P ( , t) + P ( + , t) P (, t )] +(1 q)P (, t ),
2 4 P (4) = (1 q)[w4 + 2qw3 w1 + 3q 2 w2 w1 + q 3 w1 ].
(8b)
where is the point value of a single score. The second term corresponds to alternating scores so that the dierence is at time t , at , and nally again at t + . This event occurs with probability 1 q. The terms in the square bracket correspond to two successive scores by one team. Thus the score dierence reaches at time t + from 2 at t . Therefore, the walk must be at at time t but not at at time t . Expanding P to rst order in t and second order in yields 2 2 P 2P q P =D . = 2 t (1 q) 2 2 (9)
To calculate these probabilities for general s becomes tedious for large s. However, we can calculate these probabilities for s > 4 recursively. To do so, we decompose a streak of length s as a streak of s vn points, followed by a play that is worth vn points. The probability of such a play is qwvn . Because the last play can be worth 1, 2, 3, or 4 points, the probability for a streak of length s is given recursively by P (s) = q[w1 P (s1)+w2 P (s2)+w3 P (s3)+w4 P (s4)]. (12) From Eqs. (11) and (12), we can calculate P (s) for any s numerically. The resulting values match the empirical data very closely, as shown in Fig. 4.
Fitting the Model to Scoring Data
where D is the eective diusion coecient associated 1 with the score evolution. For q = 2 the score evolution reduces to a simple symmetric random walk, for which the diusion coecient is Drw = 2 /(2 ).
Derivation of the Point Streak Probability
To accurately calculate the probability of a point streak of a given length, we must separately incorporate the probabilities of 1, 2, 3, and 4 point plays (we ignore the possibility of higher-point plays because their eect is negligible). Let wi be the probability that a play is worth i points (table I). The probability that a team wins n consecutive plays to result in a streak of s points is {v1 , v2 , v3 , . . .}, where vk is the point value of the k th play, and the play sequence must obey the constraint k vk = s. The probability for a streak to last exactly n plays is q n1 (1 q). The probability that these n plays have the scoring sequence {v1 , v2 , v3 , . . .} is given by k wvk . Because a streak of length s points can involve a variable number of plays, the total probability for a streak
To t our model to the NBA game data, we compared the distribution of nal score dierences predicted by the model with the data. However, since the nature of the scoring has a signicantly dierent character during the last 2.5 minutes of the game (Fig. 7), we compare the model and the data at 45.5 minutes for the optimal value of team-strength variance, X = 0.0083 (Fig. 12). We also studied the distribution of the number of lead changes during each game; here, a tie is not counted as a lead change. Figure 13 shows the game data for the distribution of lead changes per game, as well as the simulated results for X = 0.0083. Intriguingly, and consistent with the arcsine law for the fraction of time that a team is in the lead (Fig. 9), the most probable outcome is that no lead changes occur. For each these four statistical measures of a basketball game: (i) the distribution in the dierence in scores, (ii) the fraction of game-time during which one team is in the lead, (iii) the rank versus winning percentage, and (iv) the number of lead changes we compare the game data for these quantities with the corresponding simulations result for a given value of the team strength variance. For each quantity, we quantify the goodness-of-t between
10 Here FE (x) is the empirically-observed function, FS (x) is the corresponding simulated function, and x is the underlying variable. A small value of 2 indicates a good t between simulation and experiment. Figure 14 shows the values of 2 as a function of the variance in team strength, X for each of the four basic game observables discussed above. The best t between the data and the simulations all occur for in the range X [0.00665, 0.00895]. To extract a single optimum value for X, we combine
6 20 0 20 40 5 Difference Lead Changes Time in Lead Rank
0.035 0.03
Probability
0.025 0.02 0.015 0.01 0.005 0 40
Score Difference [points] / min( )

2
FIG. 12. Probability distribution of score dierences at 45.5 minutes: data () and simulation of 104 seasons with X = 0.0083 (curve).
4 3 2
10
1 0
0.005
0.01
0.015
Probability
10
10
FIG. 14. Goodness-of-t measures as a function of X for: the score dierence distribution at 45.5 minutes (), number of lead changes per game (), distribution of time that a team is leading (), and winning percentage as a function of rank (). Each point is based on simulation of 103 seasons.
10
10
15
20
25
30
Lead Changes
FIG. 13. Probability distribution for the number of lead changes per game: data () and simulation of 104 seasons with X = 0.0083 (curve).
the four 2 measurements into a single function. Two natural choices are the additive and multiplicative forms:
4
fadd =
i=1 4
2 i , min(2 ) i (14)
fmult =
2 i , min(2 ) i i=1
the simulation and the game data by the value , where 2 is dened by 2 =
x
(FE (x) FS (x))2 .
(13)
where the sum and product are performed over the four basic game observables 2 is associated with the ith obi servable, and min(2 ) is its minimum over all X values. i Both fadd and fmult have minima at X = 0.0083. At this value of X, the 2 value of each observable exceeds i its minimum value by no more than 1.095.

Basketball Random Walk

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Basketball Random Walk

Diunggah oleh

Hak Cipta:

Format Tersedia

Random Walk Picture of Basketball Scoring

Alan Gabel1 and S. Redner1

arXiv:1109.2825v1 [physics.data-an] 13 Sep 2011

Scoring Rate [plays/s]

0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 30 20 10 0 10

Time After Quarter Ends [s]

Scoring Rate [plays/s]

Scoring Interval [s]

Streak Length [points]

0.5 0.45 0.4 0.35 40

Lead Size [points]

5 next, immediately after a scoring event, are:

7 6 5 4 3 2 0 500 1000 1500 2000 2500

Time Leading [s]

Total Score [points]

Points per Basket 1 pt. 33.9% 2 pts. 54.6% 3 pts. 11.5%

2 P (2) = (1 q)[w2 + qw1 ]

+ [(1 q)2 q 2 ]P (, t ). (8a)

3 P (3) = (1 q)[w3 + 2qw2 w1 + q 2 w1 ]

To understand this equation, we rewrite it as P (, t + ) = q[P ( , t) + P ( + , t) P (, t )] +(1 q)P (, t ),

2 4 P (4) = (1 q)[w4 + 2qw3 w1 + 3q 2 w2 w1 + q 3 w1 ].

Fitting the Model to Scoring Data

Derivation of the Point Streak Probability

0.025 0.02 0.015 0.01 0.005 0 40

Score Difference [points] / min( )

(FE (x) FS (x))2 .

Anda mungkin juga menyukai