In-Play Prediction Model
7.2 Data Collection
In-play game data are collected from play-by-play data. Play-by-play data are composed of important events and their corresponding times in basketball matches. The important events include points (2 and 3 points), free-throws, rebounds, assists, and turnovers. Play- by-play data are available at NBAreference. Figure 7.1 is an illustration of play-by-play data. Each timeline consists of game information on events, time, and related players. All such information was imported into Microsoft Excel spreadsheets. Then, through the use of a few
154
functions, the data were transformed into meaningful numbers. A trace of all basketball data in time was obtained, which is a key requirement in in-play prediction.
The play-by-play data lines of one match involve well over 500 transactions and several types of Excel functions are included in every line. Thus, the data size turns out much larger than we expect. The data for in-play prediction are mainly composed of score information. The data necessary for prediction are derived via spreadsheets. Sample data for one season was utilized to demonstrate in-play prediction..
Figure 7.1: Example of play-by-play data.
7.3 Methods
With play-by-play data, we can easily obtain all basic basketball statistics and scores for every play event. Notably, a few basic basketball statistics are useful in advancing our predictions. We use the true shooting percentage (TS%) factor to fit the score probability distribution. According to Zak et al. (1979), their analysis of the logarithm ratio of score (Home Team versus Away Team) showed that the coefficients for field goal shooting percentage (FG%) and free throw shooting percentage (FT%) were dominant in the
155
regression. These factors were the most significant of all factors in the regression analysis, as shown in Chapter 4. These factors continued to remain significant in the analysis of data from other seasons.
Thus, it is on this premise that we use TS% as the main estimator in our Monte Carlo simulation. Unlike the Markov model in Strumbelj and Vracar (2012), the approach taken was to consider a team’s quality and game time to predict exact score information. The pre- match point spread data from bookmakers were adopted as a pseudo team quality factor. In brief, the model divides the score distribution on the basis of game conditions and yields score distribution probabilities. Then, a forward simulation is done on it, accounting for the quality by considering the pre-match betting line.
The simulation of the model uses the probability distribution of all past scores at fixed time intervals. Use is made of score information and statistics from the 2009–2010 NBA regular season(1,230 matches).
Figure 7.2 is an overview of the simulation process. The method is to predict the score in set time divisions. The first step in the estimation of unit scores is dependent on the state (m) and state probabilities of the table and pre-game betting line ( ). The created scores from k0 to the fourth quarter are accumulated n times. Score distributions of home and
away teams are obtained. The final score will be the sum of fixed scores and the mean values of simulated score distributions of both teams from k0 to the end of the match.
Let us denote the predicted score for the home and away team i,j by PSi,k+1 and PSj,k+1 at
specific time k. The fixed scores FSi,k and FSj,k indicate the actual scores of home and away
156
Fixed score:
Score (Home), Score (Away) State decision by condition (Betting line, time division) : LL, LH , HL, HH
Randomized score From the state probability
function 1,000 simulations
Unit score probability creation Averaged value from probability distribution
Predicted score =
fixed score + unit score
Figure 7.2. Score simulation process
Let ̅ and ̅ at k be the predicted mean values of n time-based unit interval score
distributions Xi,k,k+1,m,l and Xj,k,k+1,m,l. The betting line groups are adjusted such that we have
as equal a sample size as possible. These score probability functions are generated from the collected score data between unit time intervals in the 1,230 previous matches. The k value indicates the specific time from 1 (3 min) to 15 (48 min). At each time point a score is predicted from 3 minutes to the final 48 minute result. The following equations yield the predicted score of home and away teams at time k+1.
̅ ∑ ( ) (7.1)
157
̅ ( ) (7.3)
̅ ( ) (7.4) where 0 ≤ i ≤ 30, 0 ≤ j ≤ 30, 1 ≤ k ≤ 15, i ≠ j, m = {LL, LH, HL, HH}, and is the group in the betting line
range { , , , , }.
The final score is the sum of the fixed and predicted scores from time k0 to the final time.
∑ ∑ ( ) (7.5)
∑ ∑ ( ) (7.6)
As mentioned, all score distributions are classified into quality groups for the Monte Carlo simulation based on the team’s quality. The betting lines of bookmakers represent the difference in a team’s ability, as indirectly judged by the public or experts. In the score simulation, the next three minute’s score is decided by the average value of the simulated score distribution, which is based on both the line and four states (m = LL,LH,HL,HH) based on the TS% of both teams. The previous season’s average TS% for Home and Away games at all unit times TSH, TSA are each denoted as the LL state (<TSH, <TSA), LH state (<TSH,
≥TSA), HL state (≥TSH, <TSA), and HH state (≥TSH, ≥TSA) (i.e., Low Low, Low High, High
Low and High High, respectively).
Figures 7.3–7.6 illustrate the four states grouping true shooting percentage for the three minutes of all time divisions. The mean values in each state have completely different values based on TS%. The grey and black bars denote the score distribution of home and away teams. When the actual value of the in-play TS% is more than TSH or TSA for either
home or away games, the next three simulated minutes are typically 4–5 points higher than that of the score distribution, whose TS% are under TSH or TSA.