Grouped binary data - Categorical responses

Categorical responses

7.5 Grouped binary data

estimated as

e^β^ˆ⁰= e^−1.750= 0.174.

This translates to an estimated probability of a claim of ˆ

π = 0.174

1 + 0.174 = 0.148.

The multiplicative effect on the odds of a claim, for a policy being in a category other than the base level, for any of the risk factors, is given in the column

“e^β^ˆ.” For example, the effect of driver’s age being in age band 1 on the odds of a claim is 1.333, or a 33.3% increase. The estimate of the odds of a claim for a vehicle with driver’s age band 1 and all of the other risk factors at base levels, is therefore 0.174× 1.333 = 0.232, giving an estimated probability of a claim of 0.232/(1 + 0.232) = 0.188. The effect of any other combination of risk factors is calculated in this way. For example, the estimated odds of a claim on a policy with driver’s age band 1, area A, vehicle body panel van and vehicle value $25 000 to $50 000, is

0.174× 1.333 × 0.965 × 1.020 × 1.234 = 0.282

giving an estimated probability of a claim of 0.282/(1 + 0.282) = 0.220.

7.5 Grouped binary data

If all explanatory variables are categorical, it is possible to express a data set in grouped form, as discussed in Section 4.8. A group consists of all cases with the same explanatory variable values and may correspond to a homogeneous risk set.

In the case of a binary response, once the data are grouped the observed response is number of events occurring in the group. This is different from the situation with a continuous response, where the response is the group mean – see the example in Section 4.8. Write

m = number of groups

n_i = number of cases (policies) in group i yi = number of events occurring in group i

πi = probability of the event occurring for a case (policy) in group i n = total sample size =m

i=1ni.

Then y_i is the number of occurrences of the event, out of a possible n_i, where the probability of the event occurring in each case is πi. The observed response is assumed to have a binomial distribution: y_i ∼ B (ni, π_i), where the probability πiis modeled as a function of explanatory variables. With this

106 Categorical responses

Table 7.6. Subset of vehicle insurance claims grouped data i Dr iv er’ s Area V ehicle V ehicle Number of Number of

age body v alue policies claims

($000’ s) ni yi

1 1 A Bus <25 2 0

2 1 A Convertible <25 1 0

3 1 A Convertible 25–50 1 0

4 1 A Convertible 75–100 2 0

5 1 A Coupe <25 18 2

6 1 A Coupe 25–50 2 0

7 1 A Coupe 75–100 1 0

8 1 A Hatchback <25 554 47

9 1 A Hatchback 25–50 8 0

10 1 A Hardtop <25 19 2

11 1 A Hardtop 25–50 9 3

12 1 A Hardtop 50–75 1 0

13 1 A Minicaravan <25 3 2

14 1 A Minibus <25 4 0

23 1 A Station wagon <25 108 11

24 1 A Station wagon 25–50 86 5

approach, differing exposure of individual policies cannot be adjusted for, nor can one include continuous explanatory variables in the model.

Vehicle insurance. Consider the explanatory variables driver’s age, area, vehicle body type and vehicle value. There are 6× 6 × 13 × 6 = 2808 com-binations of the levels. Two–thirds of these are empty with just m = 929 non-empty combinations. A subset of the grouped data is shown in Table 7.6, those of age band 1 and area A. For example, group 8 consists of policies having driver’s age 1, area A, hatchback body type and a value of less than

$25 000. There are n8 = 554 policies in this category, of which y8 = 47 have a claim. The model estimates are identical to those computed from the individual policy data, without the exposure adjustment.

7.6 Goodness of fit for logistic regression 107 7.6 Goodness of fit for logistic regression

A variety of goodness of fit statistics or methods are available for logistic regression. This section considers some of them.

Deviance. The deviance Δ is not a useful measure of goodness of fit of the logistic regression model. To see this suppose ˆπiis the predicted probability of success:

πi= e^xⁱ^β^ˆ 1 + e^xⁱ^β^ˆ ,

where ˆβ is the maximum likelihood estimate. Algebraic manipulation shows that the deviance is (Collett 2003):

Δ =−2

This depends on the counts yi only through the fitted values ˆπi. Hence the deviance is not informative about the goodness of fit of the ˆπ_i to the y_i. In addition (Collett 2003), “... the deviance is not even approximately distributed as χ².”

A further problem with the deviance for logistic regression is that its value depends on whether the data are ungrouped or grouped. For the vehicle insur-ance data, the deviinsur-ance for the ungrouped fit is 33 624 on 67 828 degrees of freedom. The same fit on the grouped data yields a deviance of 868 on 901 degrees of freedom. The deviance of the individual-level data com-pares individual responses (0 or 1) with the individual fitted probabilities ˆπi, i = 1, . . . , n. The deviance of the grouped data fit compares group means yi/nito fitted group probabilities ˆπi, i = i, . . . , m.

Pearson chi-square statistic. This is defined as

n i=1

(y_i− ˆπi)² ˆ

πi(1− ˆπi).

This has the usual form of the square of the difference between observed and expected values, divided by the expected value. The statistic has, approxi-mately, the χ²_n−pdistribution and is asymptotically equivalent to the deviance (Dobson 2002). Unlike the deviance, this statistic does depend on the actual counts yi. However, the approximate chi-square distribution can be poor and the statistic is not considered a reliable measure of fit.

SAS notes. The deviance and Pearson chi-square statistic are produced by default by proc genmod.

108 Categorical responses

Table 7.7. Classification table with 0.08 threshold Predicted claim

No Yes T otal

Actual No 54 196 9 036 63 232

Claim Yes 3740 884 4624

T otal 57 936 9 920 67 856

Classification tables and ROC curves. One way of examining the perfor-mance of a model for binary data is via a classification table. The fitted probabilities ˆπ_i are computed and each case i is predicted (or classified) as an “event” or “non-event” depending on whether ˆπiis greater than or less than a given threshold. The resulting 2× 2 classification table compares actual occurrences to predictions.

To illustrate, consider Table 7.7 constructed using a threshold of 0.08 with the exposure-adjusted logistic regression of vehicle insurance claims. Of the 4624 claims, 884 had ˆπi> 0.08 and are correctly predicted to have a claim. Of the 63 232 policies with no claim, 54 196 are correctly predicted not to have a claim.

Given the classification table, the predictive usefulness of a model is often summarized using the following two measures:

• Sensitivity. This is the relative frequency of predicting an event when an event does take place.

• Specificity. This is the relative frequency of predicting a non-event when there is no event.

Ideally both sensitivity and specificity are near 1. If the threshold is set at 0 then the sensitivity and specificity are 1 and 0, respectively. As the threshold increases, fewer events are predicted, the sensitivity declines and the specificity increases. If the threshold is 1 then the sensitivity and specificity are 0 and 1, respectively.

In the above example a threshold of 0.08 leads to a sensitivity of 884/4624 = 0.19 and specificity of 54 196/63 232 = 0.85. Thus while the model has good ability to identify policies on which there is no claim, it does not accurately predict situations where there is a claim. The fact that the specificity is high is partly a consequence of the fact that most policies do not lead to a claim.

The ROC (Receiver Operating Characteristic) curve plots the sensitivity against specificity for each threshold. Traditionally one minus the specificity is plotted on the horizontal axis, and sensitivity on the vertical axis. With this orientation of the axes, a value near zero on the x axis (high specificity) generally implies a low value on the y axis (low sensitivity) and vice versa.

7.6 Goodness of fit for logistic regression 109

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

1specificity

Sensitivity

Model Diagonal line

AUC = 0.662 AUC = 0.5

Fig. 7.3. ROC curve, exposure-adjusted model for vehicle claim

Figure 7.3 shows the ROC curve for the exposure-adjusted vehicle insurance claim model.

All ROC curves start at (0,0) and end at (1,1) as these points correspond to threshold probabilities 0 and 1, respectively. ROC curves increase monoton-ically. A model with perfect predictive ability has sensitivity and specificity both equal to one, giving a ROC curve consisting of the single point in the top left hand corner. A “good” ROC curve rises quickly to 1: the further a curve is to the top left hand corner, the better its predictive ability. This is quantified by computing the area under the ROC curve (AUC), as a measure of the model’s predictive ability. The AUC has a maximum of 1. A model with ROC equal to the 45^◦line (AUC = 0.5) has a predictive ability no better than chance. For the exposure-adjusted model, AUC = 0.662. In comparison, the model unadjusted for exposure has AUC = 0.548, indicating weaker predictive ability.

The concepts of sensitivity and specificity are of particular interest in the area of medical diagnosis. Here an individual case is a patient, the event of interest is whether or not the patient has a particular disease, and the predictors are diagnostic tests. Accurate prediction of presence of a disease at the indi-vidual patient level, is critical. In insurance applications, however, prediction of a claim or no claim on an individual policy is rarely the point of statistical modeling. Rather, it is the average prediction which is of interest. For this reason, a model with low sensitivity is often adequate. The model is useful provided it explains the variability in claims behavior, as a function of risk

110 Categorical responses

factors. The ROC curve is a sensible means of comparing the performance of differing models.

SAS notes. The classification table and other output specific to logistic regres-sion, are computed in proc logistic, but not in proc genmod. How-ever, the exposure-adjusted model needs to be computed in proc genmod because proc logistic does not allow user-defined link functions. ROC curves and the AUC are demonstrated on page 169.

In document Generalized Linear Models for Insurance Data (Page 117-122)