Many predictor variables are created in SAS based on data imported from ACCESS.
Most are the indicator type: 1 if present in a data field, 0 if not present. Specific jockeys and trainers are examples - 72 individual trainers have their own covariate from the one field TrainID and from JockID 24 jockey covariates are created. Other predictor variables are calculated in SAS and have continuous values such as wbfOld1 and wbfOld2 which are the win bet fractions for odds1 and odds2 respectively.
3.3.1 Non-Indicator Covariates Created in SAS
See Table 2.1 for statistics on these covariates.
wbf : Win bet fraction of odds: 1 / ( 1 + odds )
wbfAll : Win bet fraction raised to an exponent determined through Box-Cox Method wbfOld1 : Win bet fraction from odds of previous race
wbfOld2 : Win bet fraction from odds of 2nd race back
3.3.2 Indicator Type Covariates Created in SAS
The Post Position field yielded five indicator variables that were of interest: the three inside posts 1 to 3 and the two far outside post positions: pp1, pp2, pp3, ppOut (far outside post) and ppInOut (the post just to the left of far outside post). Since saving ground (running distance) on the turns is naturally quite important since the less distance a horse has to run, the better its chances of a good finish. Post position is a definite factor for getting a horse into favorable position on turns. On many two-turn races such as a mile at Santa Anita and Del
Mar, the first turn comes up in less than a furlong and the inside positions can be an advantage for quick starting horses who then save ground on the first turn. However, post position 1 is considered the most dangerous position because of its proximity to the inside rail where many horse racing accidents have taken place - oftentimes horses are pincehed between the rail and other horses. Seven countries and eight states indicator variables came from the stateBred field.
The jockey field is used to create 24 Indicator-type covariates for individual jockeys.
In a similar fashion, 72 Indicator-type covariates for individual trainers were created: Table 3.1. Other indicator variables included three Claimed indicators: cl1 (horse claimed in last race), cl2 (claimed 2nd back), and cl3) from the cl12 field, two (blinksOn and BlinksOff) from the blinkers field, two (start1st and start2nd) from the numLines Field, and two input fields were changed to indicator types (Lasix1st and notLasix) to facilitate processing.
Table 3.1. Trainer Names and ID Codes
ID Name ID Name ID Name
A Barry Abrams Ag Paul Aguirre AV A. C. Avila
B Bob Baffert Bec Rafael Becerra C Jack Carava
Cad Ruben Cardenas Cec Ben Cecil CJ Julio Canani
Cs James Cassidy CV Vladmir Cerin D Neil Drysdale
DC Caesar Dominguez Dej Jose DeLima DO Craig Dollase
EL Ronald Ellis Eur Peter Eurton F Robert Frankel
FA Jerry Fanning Ga Carla Gaines GL Patrick Gallagher
Gla Mark Glatt Gok Sal Gonzalez Gp Paco Gonzalez
Gre Beau Greely Gut Jorge Guitierrez H Robert B. Hess
HA Mike Harrington Hab Eoin Harty HD Bruce Headley
Hen Dan Hendriks HF David Hofmans Hol Jerry Hollendorfer Jom Martin F. Jones Kna Steve Knapp Kor Brian Koriner
La David La Croix LE Craig Lewis Ma Michael Machowsky
Ma2 Gary Mandella MC Ronald McAnally MD Richard Mandella
Mii Peter Miller MM Mike Mitchell Mo Henry Moreno
Mog Ed Moger Mul Jeff Mullins Mum Kristin Mulhall
ON Doug O’Neil Paa Christopher Paasch Pei Jorge Periban Pol Marcelo Polanco Pow Leonard Powell Puy Mike Puype
SA John Sdler SH Sanford Shulman Shc Gary Sherlock
She Art Sherman Shi John Shirreffs Si Clifford Sise
SJ Jenine Sahadi SM Melvin Stute SP William Spawr
Ste Roger Stein Stg Gary Stute TR Eddie Truman
VB Jack Van Berg VD Darrell Vienna Wa Ward Wesley
WK Kathy Walsh WT Ted West Zuc Howard Zucker
3.3.3 WBF Exponent Found Using Box-Cox Method
The best predictor of a horse’s performance is the odds it goes off at, as shown by Table 2.2 where the two performance measurements, win percentage and Perf, decrease reading down the table as the odds increase. The powerful betting public made up of
thousands of bettors wagering many thousands and frequently millions of dollars on a single race, is constantly searching for a “bargin” horse - one whose return is better than expected.
Like the stock market, there are last minute “corrections” to horses that appear to have value.
Although the odds are the best predictor, they do not come in an easy-to-use form since odds do not translate directly to probabilities and the total odds of all the horses in a race has no significance. Inverting the odds to get the win bet fraction: wbf = 1/(odds + 1) is a start since the total win bet fractions would add to one if there was no House Cut. With the House Cut which varies due to Breakage, the win bet fractions sum to around 1.20. Thus win bet fractions indicate how strongly each horse is bet relative to each other. In the early stages of this project, it was noticed that the square root of wbf was a better fit than wbf itself. So it seemed likely that the best fit was wbf raised to an optimal exponent. Thus the well-known Box-Cox [12] transformation procedure, based on a maximum likelihood estimation routine, is used to find the optimal exponent for wbf. Notice that in this instance, wbf is the response variable and Perf is the predictor variable. This procedure was performed starting with coarse intervals of 0.1 for the exponent, then 0.01, 0.001, and 0.0001 was used, reaching the limits of accuracy for the SAS Box-Cox procedure. Thus an exponent was found to the 4th decimal place (0.1548). A new predictor variable was then created for each horse:
wbf All = wbf0.1548.
3.3.4 SAS Regression and Model Selection
The REG procedure in SAS fits a linear regression model by least squares to find estimated coefficinets for each predictor variable. The Stepwise, Forward, and Backward Selection processes are used (with a selection criterion of 0.05) and compared to find the best model. These selection processes depended on Mallows’ Cpcriterion. The Variance Inflation Factor (VIF) selection is used to check for multicollinearity. After considerations, various covariates were deleted from the final model due to correlation problems and low significance.
Data Subgroups are run through through the same process as the above section and if warranted, new predictor variables are created - always of the indicator type since they are specific to the subgroups. Note that in some cases original covariates may be set to 0 when the new covariates are set to 1 to avoid correlation problems.
The regression process is repeated with the new and orginal covariates. The VIF diagnostic is especially important for checking for correlation between old and new
covariates. The standard deviation used for Monte Carlo processing of test results is generated in this step. A Baseline model for testing was created using wbfAll and the number of horses in the race to get a predicted Perf value for each horse. Table 3.2 presents the ANOVA Table and parameter estimates.
Two files containing the predicted perfs for the Horses in the Test Dataset were created: the Test Baseline File and the Test Results File. They were then exported to Matlab for Monte Carlo-style processing. The standard error used here is generated in the step described in Section 3.3.4. For this step, horses are grouped together by race. Each horse in each race has a random normal number times the standard deviation added to its predicted perf to simulate the variances in performance as predicted by the Regression Model of Table 3.3. Each race was simulated 100,000 times. The number of simulations a particular horse has the highest total was divided by 100,000 to get the estimated probability of winning. The same process was used to get estimated values for 2nd,3rd, and 4thplace probabilities.
3.5 C
OMPARINGP
ROBABILITYF
ILES INACCESS
The two probability files, Test Baseline and Test Results, are exported to ACCESS for comparison reports. The final model probabilities that orginated from the estimated regression parameters equation are compared to the baseline probabilities. Those that differ significantly are separated into two groups: estimated probabilities higher than the baseline’s are
considered good bets “Overlays,” while those less than the baseline probabilities are
“Underlays” - bad bets. Each group is displayed in an odds-based report format. The Results File was generated using the Regression Function on each horse in the Test Data, plugging in Regression Coefficients to get predicted values for perf (pred) as shown in Table 3.3.