2.2 Data
2.2.4 Extracting attributes for statistical modeling
Observations for each of the 1,714 tournaments have 20 attributes implying that the total amount of attributes available for calculating winning prob-abilities of a future golf tournament could be very high. The amount of attributes could be high because: (1) all previous information could poten-tially be used in determining future winning probabilities; (2) observations for each tournament are located in different places in the space spanned by the date and the protour dimensions. It is clear that some form of dimen-sionality reduction is needed in order to capture the important information in the dataset more effectively.3
I turn to the existing literature on sports betting for ideas to reduce the dimensionality of my dataset. The literature for estimating winning prob-abilities for golfers in golf tournaments contains, to my present knowledge, one article. Shmanske (2005) models winning probabilities for golfers based on summary statistics provided by the PGA Tour. He does thus not directly use past golf results in his model. However, many articles have been written
3The methodology and techniques used in for feature extraction and variable transfor-mation in the following comes from (Tan et al., 2013, Chap. 2)
on horse-racing with this focus (see e.g. Bolton & Chapman, 1986; Lessmann et al., 2007; Sung & Johnson, 2012). Many aspects of horse-racing are com-parable to golf tournaments; horses compete against other horses of varying quality and form just as golfers compete against other golfers of varying qual-ity and form; each horse races in many races just as each golfer plays in many tournaments; the courses varies in lengths etc. Table 2.5 lists some of the aggregating attributes that have been used in the horse-racing literature in order to reduce dataset-dimensionality.
Table 2.5: Attributes used for winning probability estimation in horse-racing
No. Attribute descriptions
1 Speed rating for the previous race in which the horse ran 2
The average of a horse’s speed rating in its last 4 races; zero when there is no past run
3
Total prize money earnings (finishing first, second or third) to date/Number of races entered
4 The percentage of the races won by the horse in its career 5 The natural logarithm of the normalised final odds probability
Only attributes deemed relevant in the golf context are included. Complete attribute list can be found in Sung & Johnson (2012). First four attributes were proposed by Bolton &
Chapman (1986), the last was proposed by Benter (1994).
Table 2.5 contains attributes whose goal it is to proxy: (1) a horse’s quality:
via e.g. the attributes for prize money earnings and win percentages; (2) the horse’s form: via e.g. the speed rating attributes; (3) potential inside information encapsulated in the odds.
The idea is to incorporate attributes which both capture the underlying,
probably slowly time-varying, horse quality as well as a measure, probably more volatile, for current form. The attributes given in the table are likely strongly correlated and thus capture aspects of both of the underlying mea-sures.
It is clear from the table that the academics who have used these attributes have made some arbitrary choices with regard to the dimensionality reduc-tion, e.g. speed rating the last four races. There is, to my present knowledge, no a priori reason why four is the right number. Furthermore, issues could arise due to the fact that the time-dimension in the dataset has not been incorporated into the attributes. A horse could, for example, have been sick for a year and the four previous races (averaged over in the attribute) would then have been prior to the horse’s sickness. The attribute is therefore not likely to be a good estimator of the horse’s current quality and form.
I propose two sets of attributes to be used in predicting the winner prob-abilities for golfers: (1) a set of attributes resembling the attributes used in the literature for horse-race estimation (Table 2.5); (2) a set of attributes with less arbitrary choices with regard to the dimensionality reduction.
I will introduce a notation of golfers and tournaments in order to make the following easier to read. The dataset contains n tournaments denoted j = 1, 2, . . . , n. mj golfers are competing against each other in tournament j. These golfers are denoted i = 1, 2, . . . , mj.
Static, arbitrary dimensionality reduction
A set of attributes resembling the attributes used in the literature for horse-race winning probability estimation (see Table 2.5) is created based on the original dataset (described in subsection 2.2.1 and subsection 2.2.2). The idea from the horse-racing literature of including attributes to proxy form and quality is used. The attributes are listed in Table 2.6.
The basic ideas from the economic literature on economic incentives and psychological aspects of the golf tournament (section 2.1.1) are used to cre-ate the following two attributes:
1. A substitute for speed rating (from the horse-racing literature see Ta-ble 2.5). The feature-substitute is named score rating and is given by the number of strokes used by golfer i in the first round of tournament j subtracted by the median of strokes used by golfers in round 1 of tournament j.
There is a difference between making a good score in a first-tier tour such as the PGA Tour and a second-tier tour such as the Nationwide Tour. I have analyzed the difference by looking at score ratings for golfers participating in both first-tier and second-tier pro-tours in the same calendar year. I find that the score rating on average is 1.43 strokes higher in first tier than second-tier pro tours.
I compensate for this difference by adding 1.43 to all score ratings from tournaments in the Champions Tour and the Nationwide Tour.
2. An attribute to proxy a golfer’s ability to perform under pressure. The attribute is named keep cool and is given by the number of wins divided by the number of top 10 positions. I assume that this attribute gives
some indication of the golfers ability not to choke under pressure.
The amount of pressure golfers are under is higher in first-tier tours than second-tier tours. I make the simplifying assumption that victories in first-tier tours should count four times that of a victory in a second-tier tournament.
Table 2.6: Attributes set no. 1 Attribute Attribute description
avg score rating year The average of a golfer’s score rating* the last year.
avg keep cool year The average of a golfer’s wins compared to top 10 positions last year.
avg purse 2years
Total prize money earnings last two years divided by the number of tournaments entered last two years.
ln odds The natural logarithm of the normalized final odds probability.
*score rating is given by the number of strokes used by golfer i in round 1 of tournament j subtracted by the median of strokes used by golfers in round 1 of tournament j. Score ratings from second-tier tours are added with 1.43 to compensate for the difference in level between first-tier and second-tier pro tours.
The list of attributes furthermore includes measures for: previous winnings in GBP; winning percentages. Betfair odds are included for a part of the dataset.
Dynamic dimensionallity reduction
I create a new set of attributes (listed in Table 2.7). The set contain the same sort of information as in the static set (Table 2.6), but the proposed attributes in this subsection reduce the original dataset less in terms of dimensionality.
This dataset contains vectors instead of numbers, e.g. avg purse last year (from Table 2.6) contains one number per golfer per tournament which av-erages the previous years winnings. In the attribute set in this subsection, historical purse information is captured in a vector, πij, with D elements.
Each element, πij,d, contains the purse won d days prior to start of tourna-ment j. The following table lists all the attribute-vectors.
Table 2.7: Attributes set no. 2 Attribute Attribute description
πij
A vector containing D elements, where the dth element, πij,d, specifies the purse won by golfer i, d days prior to the beginning of tournament j.
γij
A vector containing D elements, where the dth element, γij,d, specifies the score rating of golfer i, d days prior to the beginning of tournament j.
wij
A vector containing D elements, where the dth element, wij,d, specifies whether golfer i won a tournament d days prior to the beginning of tournament j.
ψij
A vector that contains D elements, where element d, ψij,d, is a binary attribute that specifies whether golfer i participated in a tournament d days prior to tournament start.
cij
A vector that contains D elements, where element d, ψij,d, is a binary attribute that specifies whether golfer i ended in top 10 in a tournament d days prior to tour-nament start.