2 Data to Insights to Decisions
3. In what ways could a predictive analytics model help to address the business problem? For any business problem, there are a number of different analytics
2.3 Designing the Analytics Base Table
Once we have decided which analytics solution we are going to develop in response to a business problem, we need to begin to design the data structures that will be used to build, evaluate, and ultimately deploy the model. This work sits primarily in the Data Understanding phase of the CRISP-DM process (see Figure 1.4[14]) but also overlaps with the Business Understanding and Data Preparation phases (remember that the CRISP-DM process is not strictly linear).
The basic data requirements for predictive models are surprisingly simple. To build a predictive model, we need a large dataset of historical examples of the scenario for which we will make predictions. Each of these historical examples must contain sufficient data to describe the scenario and the outcome that we are interested in predicting. So, for example, if we are trying to predict whether or not insurance claims are fraudulent, we require a large dataset of historical insurance claims, and for each one we must know whether or not that claim was found to be fraudulent.
The basic structure in which we capture these historical datasets is the analytics base table (ABT), a schematic of which is shown in Table 2.1[28]. An analytics base table is a simple, flat, tabular data structure made up of rows and columns. The columns are divided
Although the ABT is the key structure that we use in developing machine learning models, data in organizations is rarely kept in neat tables ready to be used to build predictive models. Instead, we need to construct the ABT from the raw data sources that are available in an organization. These may be very diverse in nature. Figure 2.1[28]
illustrates some of the different data sources that are typically combined to create an ABT.
Figure 2.1
The different data sources typically combined to create an analytics base table.
Before we can start to aggregate the data from these different sources, however, a significant amount of work is required to determine the appropriate design for the ABT. In designing an ABT, the first decision an analytics practitioner needs to make is on the prediction subject for the model they are trying to build. The prediction subject defines the basic level at which predictions are made, and each row in the ABT will represent one instance of the prediction subject—the phrase one-row-per-subject is often used to describe this structure. For example, for the analytics solutions proposed for the motor insurance fraud scenario, the prediction subject of the claim prediction and payment prediction models would be an insurance claim; for the member prediction model, the prediction subject would be a member; and for the application prediction model, it would be an application.
Each row in an ABT is composed of a set of descriptive features and a target feature.
The actual features themselves can be based on any of the data sources within an organization, and defining them can appear a mammoth task at first. This task can be made easier by making a hierarchical distinction between the actual features contained in an ABT and a set of domain concepts upon which features are based—see Figure 2.2[29].
Figure 2.2
The hierarchical relationship between an analytics solution, domain concepts, and descriptive features.
A domain concept is a high-level abstraction that describes some characteristic of the prediction subject from which we derive a set of concrete features that will be included in an ABT. If we keep in mind that the ultimate goal of an analytics solution is to build a predictive model that predicts a target feature from a set of descriptive features, domain concepts are the characteristics of the prediction subject that domain experts and analytics experts believe are likely to be useful in making this prediction. Often, in a collaboration between analytics experts and domain experts, we develop a hierarchy of domain concepts that starts from the analytics solution, proceeds through a small number of levels of abstraction to result in concrete descriptive features. Examples of domain concepts include customer value, behavioral change, product usage mix, and customer lifecycle stage.
These are abstract concepts that are understood to be likely important factors in making predictions. At this stage we do not worry too much about exactly how a domain concept will be converted into a concrete feature, but rather try to enumerate the different areas from which features will arise.
Obviously, the set of domain concepts that are important change from one analytics solution to another. However, there are a number of general domain concepts that are often useful:
Prediction Subject Details: Descriptive details of any aspect of the prediction subject.
Demographics: Demographic features of users or customers such as age, gender, occupation, and address.
Usage: The frequency and recency with which customers or users have interacted with an organization. The monetary value of a customer’s interactions with a service. The mix of products or services offered by the organization that a customer or user has used.
Changes in Usage: Any changes in the frequency, recency, or monetary value of a customer’s or user’s interactions with an organization (for example, has a cable TV subscriber changed packages in recent months?).
Special Usage: How often a user or customer used services that an organization considers special in some way in the recent past (for example, has a customer called a customer complaints department in the last month?).
Lifecycle Phase: The position of a customer or user in their lifecycle (for example, is a customer a new customer, a loyal customer or a lapsing customer?).
Network Links: Links between an item and other related items (for example, links between different customers or different products, or social network links between customers).
The actual process for determining domain concepts is essentially one of knowledge elicitation—attempting to extract from domain experts the knowledge about the scenario we are trying to model. Often, this process will take place across multiple meetings, involving the analytics and domain experts, where the set of relevant domain concepts for the analytics solution are developed and refined.
2.3.1 Case Study: Motor Insurance Fraud
At this point in the motor insurance fraud detection project, we have decided to proceed with the proposed claim prediction solution, in which a model will be built that can predict the likelihood that an insurance claim is fraudulent. This system will examine new claims as they arise and flag for further investigation those that look like they might be fraud risks. In this instance the prediction subject is an insurance claim, and so the ABT for this problem will contain details of historical claims described by a set of descriptive features that capture likely indicators of fraud, and a target feature indicating whether a claim was ultimately considered fraudulent. The domain concepts in this instance will be concepts from within the insurance domain that are likely to be important in determining whether a claim is fraudulent. Figure 2.3[31] shows some domain concepts that are likely to be useful in this case. This set of domain concepts would have been determined through the claimant (such as the different types of claims they have made in the past and the frequency of past claims); Claimant Links, which captures links between the claimant and any other people involved in the claim (for example, the same people being involved in multiple insurance claims together is often an indicator of fraud); and Claimant Demographics, which covers the demographic details of the claimant (such as age, gender, and occupation). Finally, a domain concept, Fraud Outcome, is included to cover the target feature. It is important that this is included at this stage because target features often need to be derived from multiple raw data sources, and the effort that will be involved in this should not be forgotten.
In Figure 2.3[31] the domain concepts Claimant History and Claimant Links have both been broken down into a number of domain subconcepts. In the case of Claimant History, the domain subconcept of Claim Types explicitly recognizes the importance of designing descriptive features to capture the different types of claims the claimant has been involved in in the past, and the Claim Frequency domain subconcept identifies the
need to have descriptive features relating to the frequency with which the claimant has been involved in claims. Similarly, under Claimant Links the Links with Other Claims and Links with Current Claim domain subconcepts highlight the fact that the links to or from this claimant can be broken down into links related to the current claim and links relating to other claims. The expectation is that each domain concept, or domain subconcept, will lead to one or more actual descriptive features derived directly from organizational data sources. Together these descriptive features will make up the ABT.