HP NonStop SQL/MX
Data Mining Guide
Abstract
This manual presents a nine-step knowledge-discovery process, which was developed over a series of data mining investigations. This manual describes the data structures and operations of the NonStop™ SQL/MX approach and implementation.
Product Version
NonStop SQL/MX Release 2.0
Supported Release Version Updates (RVUs)
This publication supports G06.23 and all subsequent G-series releases until otherwise indicated by its replacement publication.
Part Number Published 523737-001 April 2004
Document History
Part Number Product Version Published 424397-001 NonStop SQL/MX Release 1.0 February 2001 523737-001 NonStop SQL/MX Release 2.0 April 2004
HP NonStop SQL/MX Data
Mining Guide
Index Figures Tables
What’s New in This Manual iii Manual Information iii
New and Changed Information iii About This Manual v
Audience v Organization v
Related Documentation vi Notation Conventions viii
1. Introduction
The Traditional Approach 1-1 The SQL/MX Approach 1-2
Data-Intensive Computations Performed in the DBMS 1-2 Use of Built-In DBMS Data Structures and Operations 1-2 The Knowledge Discovery Process 1-3
Defining the Business Opportunity 1-4 Preparing the Data 1-7
Creating the Mining View 1-10 Mining the Data 1-10
Knowledge Deployment and Monitoring 1-11
2. Preparing the Data
Loading the Data 2-2
Creating the Database 2-2
Importing Data Into the Database 2-2 Profiling the Data 2-2
Cardinalities and Metrics 2-3 Transposition 2-3
Quick Profiling 2-5 Defining Events 2-6
Aligning the Data 2-6 Deriving Attributes 2-9
Contents 2. Preparing the Data (continued)
2. Preparing the Data (continued)
Moving Metrics 2-9 Rankings 2-10
3. Creating the Data Mining View
Creating the Single Table 3-2 Pivoting the Data 3-3
4. Mining the Data
Building the Model 4-2
Building Decision Trees 4-2 Checking the Model 4-9
Applying the Model to the Mining Table 4-10 Applying the Model to the Database 4-10 Deploying the Model 4-10
Monitoring Model Performance 4-11
A. Creating the Data Mining Database
B. Inserting Into the Data Mining Database
C. Importing Into the Data Mining Database
Importing Customers Data C-1 Customers Format File C-1 Customers Data File C-1
Importing Account History Data C-2 Account History Format File C-3 Account History Data File C-3
Index
Figures
Figure 4-1. Initial Branches of Decision Tree 4-4 Figure 4-2. Decision Tree for Divorced Branch 4-5 Figure 4-3. Decision Tree for Single Branch 4-6 Figure 4-4. Final Decision Tree 4-9
Tables
What’s New in This Manual
Manual Information
HP NonStop SQL/MX Data Mining Guide
Abstract
This manual presents a nine-step knowledge-discovery process, which was developed over a series of data mining investigations. This manual describes the data structures and operations of the NonStop™ SQL/MX approach and implementation.
Product Version
NonStop SQL/MX Release 2.0
Supported Release Version Updates (RVUs)
This publication supports G06.23 and all subsequent G-series releases until otherwise indicated by its replacement publication.
Document History
New and Changed Information
This publication has been updated to reflect new product names:
•
Since product names are changing over time, this publication might contain both HP and Compaq product names.•
Product names in graphic representations are consistent with the current product interface.•
The technical content of this guide has been updated and reflects the state of the product at the G06.23 RVU.•
Previous versions of the guide used the Object Relational Data Mining (ORDM) approach and architecture. ORDM advocates performing data mining and other parts of the knowledge discovery process against data in the SQL/MX data base. This technique has been updated. Readers are encouraged to perform the dataPart Number Published 523737-001 April 2004
Part Number Product Version Published
424397-001 NonStop SQL/MX Release 1.0 February 2001 523737-001 NonStop SQL/MX Release 2.0 April 2004
What’s New in This Manual New and Changed Information
preparation steps in SQL/MX but reserve the mining or model building for UNIX or Microsoft Windows platforms.
•
All sections of the manual have been updated to reflect the impact of major changes of SQL/MX Release 2.0 (for example, the introduction of SQL/MX tables).•
Introductions to the data preparation steps have been revised and rewritten.•
The DDL statements in Appendix A, B, and C have been updated to use SQL/MXDDL syntax.
•
Appendix A syntax has been removed. Readers can consult the SQL/MXReference Manual for the most current syntax and examples.
About This Manual
This manual presents a nine-step knowledge discovery process, which was developed over a series of data mining investigations. This manual describes the data structures and operations of the NonStop SQL/MX approach and implementation.
Audience
This manual is intended for database administrators and application programmers who are using NonStop SQL/MX to solve data mining problems, either through the SQL conversational interface or through embedded SQL programs.
Organization
The sections listed in Table i describe the knowledge discovery process (or the data mining process) and present examples that carry out the process.
The appendixes listed in Table i provide the syntax for the data mining features of NonStop SQL/MX and the SQL scripts that create the data mining database used in the examples.
Table i. Manual Organization
Section Description
Section 1, Introduction Presents an overview of the knowledge discovery process and the SQL/MX approach to this process. Defines the example business opportunity used in this manual.
Section 2, Preparing the Data Describes the data preparation steps of the knowledge discovery process.
Section 3, Creating the Data Mining View
Describes how to create the mining view. Section 4, Mining the Data Describes the data mining steps of the
knowledge discovery process. Appendix A, Creating the Data
Mining Database
Contains DDL statement scripts that you can use to create the data mining database used in the examples in this manual.
Appendix B, Inserting Into the Data Mining Database
Contains INSERT statement scripts that you can use to populate the data mining
database used in this manual. Appendix C, Importing Into the Data
Mining Database
Contains IMPORT statement scripts that you can use to create the data mining database used in this manual.
About This Manual Related Documentation
Related Documentation
This manual is part of the SQL/MX library of manuals, which includes:
Introductory Guides
SQL/MX Comparison Guide for SQL/MP Users
Describes SQL differences between SQL/MP and SQL/MX.
SQL/MX Quick Start Describes basic techniques for using SQL in the
SQL/MX conversational interface (MXCI). Includes information about installing the sample database. Reference Manuals
SQL/MX Reference Manual Describes the syntax of SQL/MX statements, MXCI
commands, functions, and other SQL/MX language elements.
SQL/MX Connectivity Service Command Reference
Describes the SQL/MX administrative command library (MACL) available with the SQL/MX conversational interface (MXCI).
DataLoader/MX Reference Manual
Describes the features and functions of the DataLoader/MX product, a tool to load SQL/MX databases.
SQL/MX Messages Manual Describes SQL/MX messages.
SQL/MX Glossary Defines SQL/MX terminology.
Programming Manuals
SQL/MX Programming Manual for C and COBOL
Describes how to embed SQL/MX statements in ANSI C and COBOL programs.
SQL/MX Programming Manual for Java
Describes how to embed SQL/MX statements in Java programs according to the SQLJ standard.
SQL/MX Guide to Stored Procedures in Java
Describes how to use stored procedures that are written in Java within SQL/MX.
About This Manual Related Documentation
The following manuals are part of the SQL/MP library of manuals and are essential references for information about SQL/MP Data Definition Language (DDL) and SQL/MP installation and management:
Specialized Guides
SQL/MX Installation and Management Guide
Describes how to plan, install, create, and manage an SQL/MX database. Explains how to use
installation and management commands and utilities.
SQL/MX Query Guide Describes how to understand query execution
plans and write optimal queries for an SQL/MX database.
SQL/MX Data Mining Guide Describes the SQL/MX data structures and
operations to carry out the knowledge-discovery process.
SQL/MX Queuing and Publish/Subscribe Services
Describes how SQL/MX integrates transactional queuing and publish/subscribe services into its database infrastructure.
SQL/MX Report Writer Guide Describes how to produce formatted reports using
data from a NonStop SQL/MX database.
SQL/MX Connectivity Service Manual
Describes how to install and manage the SQL/MX Connectivity Service (MXCS), which enables applications developed for the Microsoft Open Database Connectivity (ODBC) application
programming interface (API) and other connectivity APIs to use SQL/MX.
Online Help
The SQL/MX Online Help consists of:
Reference Help Overview and reference entries from the SQL/MX
Reference Manual.
Messages Help Individual messages grouped by source from the
SQL/MX Messages Manual.
Glossary Help Terms and definitions from the SQL/MX Glossary.
NSM/web Help Context-sensitive help topics that describe how to
use the NSM/web management tool.
Related SQL/MP Manuals
SQL/MP Reference Manual Describes the SQL/MP language elements,
expressions, predicates, functions, and statements.
SQL/MP Installation and Management Guide
Describes how to plan, install, create, and manage an SQL/MP database. Describes installation and management commands and SQL/MP catalogs and files.
About This Manual Notation Conventions
This figure shows the manuals in the SQL/MX library:
Notation Conventions
Hypertext Links
Blue underline is used to indicate a hypertext link within text. By clicking a passage of text with a blue underline, you are taken to the location described. For example:
VST001.vsd Introductory Guides SQL/MX Comparison Guide for SQL/MP Users SQL/MX Quick Start SQL/MX Glossary SQL/MX Messages Manual SQL/MX Reference Manual Reference Manuals SQL/MX Programming Manual for C and COBOL SQL/MX Programming Manual for Java SQL/MX Guide to Stored Procedures in Java Programming Manuals SQL/MX Query Guide SQL/MX Data Mining Guide
Specialized Guides SQL/MX Online Help Data-Loader/MX Reference Manual Messages Help Glossary Help Reference Help SQL/MX Installation and Management Guide SQL/MX Queuing and Publish/ Subscribe Services SQL/MX Report Writer Guide SQL/MX Connectivity Service Manual SQL/MX Connectivity Service Command Reference NSM/web Help
About This Manual General Syntax Notation
This requirement is described under Backup DAM Volumes and Physical Disk Drives on page 3-2.
General Syntax Notation
This list summarizes the notation conventions for syntax presentation in this manual.
UPPERCASE LETTERS. Uppercase letters indicate keywords and reserved words. Type
these items exactly as shown. Items not enclosed in brackets are required. For example:
MAXATTACH
lowercase italic letters. Lowercase italic letters indicate variable items that you supply.
Items not enclosed in brackets are required. For example:
file-name
computer type. Computer type letters within text indicate C and Open System Services (OSS) keywords and reserved words. Type these items exactly as shown. Items not enclosed in brackets are required. For example:
myfile.c
italic computer type. Italic computer type letters within text indicate C and Open System Services (OSS) variable items that you supply. Items not enclosed in brackets are required. For example:
pathname
[ ] Brackets. Brackets enclose optional syntax items. For example:
TERM [\system-name.]$terminal-name INT[ERRUPTS]
A group of items enclosed in brackets is a list from which you can choose one item or none. The items in the list can be arranged either vertically, with aligned brackets on each side of the list, or horizontally, enclosed in a pair of brackets and separated by vertical lines. For example:
FC [ num ] [ -num ] [ text ]
K [ X | D ] address
{ } Braces. A group of items enclosed in braces is a list from which you are required to
About This Manual General Syntax Notation
braces on each side of the list, or horizontally, enclosed in a pair of braces and separated by vertical lines. For example:
LISTOPENS PROCESS { $appl-mgr-name } { $process-name } ALLOWSU { ON | OFF }
| Vertical Line. A vertical line separates alternatives in a horizontal list that is enclosed in
brackets or braces. For example:
INSPECT { OFF | ON | SAVEABEND }
… Ellipsis. An ellipsis immediately following a pair of brackets or braces indicates that you
can repeat the enclosed sequence of syntax items any number of times. For example:
M address [ , new-value ]…
[ - ] {0|1|2|3|4|5|6|7|8|9}…
An ellipsis immediately following a single syntax item indicates that you can repeat that syntax item any number of times. For example:
"s-char…"
Punctuation. Parentheses, commas, semicolons, and other symbols not previously
described must be typed as shown. For example:
error := NEXTFILENAME ( file-name ) ; LISTOPENS SU $process-name.#su-name
Quotation marks around a symbol such as a bracket or brace indicate the symbol is a required character that you must type as shown. For example:
"[" repetition-constant-list "]"
Item Spacing. Spaces shown between items are required unless one of the items is a
punctuation symbol such as a parenthesis or a comma. For example:
CALL STEPMOM ( process-id ) ;
If there is no space between two items, spaces are not permitted. In this example, no spaces are permitted between the period and any other items:
$process-name.#su-name
Line Spacing. If the syntax of a command is too long to fit on a single line, each
continuation line is indented three spaces and is separated from the preceding line by a blank line. This spacing distinguishes items in a continuation line from items in a vertical list of selections. For example:
ALTER [ / OUT file-spec / ] LINE
About This Manual Notation for Messages
!i and !o. In procedure calls, the !i notation follows an input parameter (one that passes data
to the called procedure); the !o notation follows an output parameter (one that returns data to the calling program). For example:
CALL CHECKRESIZESEGMENT ( segment-id !i , error ) ; !o
!i,o. In procedure calls, the !i,o notation follows an input/output parameter (one that both
passes data to the called procedure and returns data to the calling program). For example:
error := COMPRESSEDIT ( filenum ) ; !i,o
!i:i. In procedure calls, the !i:i notation follows an input string parameter that has a corresponding parameter specifying the length of the string in bytes. For example:
error := FILENAME_COMPARE_ ( filename1:length !i:i , filename2:length ) ; !i:i
!o:i. In procedure calls, the !o:i notation follows an output buffer parameter that has a
corresponding input parameter specifying the maximum length of the output buffer in bytes. For example:
error := FILE_GETINFO_ ( filenum !i , [ filename:maxlen ] ) ; !o:i
Notation for Messages
This list summarizes the notation conventions for the presentation of displayed messages in this manual.
Bold Text. Bold text in an example indicates user input typed at the terminal. For example:
ENTER RUN CODE ?123
CODE RECEIVED: 123.00
The user must press the Return key after typing the input.
Nonitalic text. Nonitalic letters, numbers, and punctuation indicate text that is displayed or
returned exactly as shown. For example:
Backup Up.
lowercase italic letters. Lowercase italic letters indicate variable items whose values are
displayed or returned. For example:
p-register process-name
About This Manual Notation for Management Programming Interfaces
[ ] Brackets. Brackets enclose items that are sometimes, but not always, displayed. For
example:
Event number = number [ Subject = first-subject-value ]
A group of items enclosed in brackets is a list of all possible items that can be
displayed, of which one or none might actually be displayed. The items in the list can be arranged either vertically, with aligned brackets on each side of the list, or
horizontally, enclosed in a pair of brackets and separated by vertical lines. For example:
proc-name trapped [ in SQL | in SQL file system ]
{ } Braces. A group of items enclosed in braces is a list of all possible items that can be
displayed, of which one is actually displayed. The items in the list can be arranged either vertically, with aligned braces on each side of the list, or horizontally, enclosed in a pair of braces and separated by vertical lines. For example:
obj-type obj-name state changed to state, caused by { Object | Operator | Service }
process-name State changed from old-objstate to objstate { Operator Request. }
{ Unknown. }
| Vertical Line. A vertical line separates alternatives in a horizontal list that is enclosed in
brackets or braces. For example:
Transfer status: { OK | Failed }
% Percent Sign. A percent sign precedes a number that is not in decimal notation. The
% notation precedes an octal number. The %B notation precedes a binary number. The %H notation precedes a hexadecimal number. For example:
%005400 %B101111 %H2F
P=%p-register E=%e-register
Notation for Management Programming Interfaces
This list summarizes the notation conventions used in the boxed descriptions of programmatic commands, event messages, and error lists in this manual.
UPPERCASE LETTERS. Uppercase letters indicate names from definition files. Type these
names exactly as shown. For example:
About This Manual Notation for Management Programming Interfaces
lowercase letters. Words in lowercase letters are words that are part of the notation,
including Data Definition Language (DDL) keywords. For example:
token-type
!r. The !r notation following a token or field name indicates that the token or field is required. For example:
ZCOM-TKN-OBJNAME token-type ZSPI-TYP-STRING. !r
!o. The !o notation following a token or field name indicates that the token or field is optional. For example:
1
Introduction
Knowledge discovery is an iterative process involving many query-intensive steps. The challenges of data management in supporting this process efficiently are significant and continue to grow as knowledge discovery becomes more widely used.
Data mining identifies and characterizes interrelationships among multiple variables without requiring a data analyst to formulate specific questions. Software tools look for trends and patterns and flag unusual or potentially interesting ones. Because data mining reveals previously unknown information and patterns, rather than proving or disproving a hypothesis, mining enables knowledge discovery rather than just knowledge verification.
Knowledge discovery is an iterative process involving many query-intensive steps. The challenges of data management in supporting this process efficiently are significant and continue to grow as knowledge discovery becomes more widely used.
This section discusses these approaches to data mining:
•
The Traditional ApproachToday, most data mining is performed in the database by using client tools. This approach is limited because important information might be omitted from the data extract.
•
The SQL/MX ApproachThe SQL/MX approach to knowledge discovery enables you to perform many data intensive tasks in the database itself, rather than on extracts. Examples include statistical sampling, statistical functions, temporal reasoning through sequence functions, cross-table generation, database profiling, and moving-window aggregations.
•
The Knowledge Discovery ProcessIn the SQL/MX approach, fundamental data structures and operations are built into the database management system (DBMS) to support a wide range of knowledge discovery tasks and algorithms. The knowledge discovery process is described as a series of steps that starts with the selection and definition of a business
opportunity, continues through data preparation and modeling, and ends with the deployment of the new knowledge.
The Traditional Approach
Today’s traditional knowledge discovery systems consist of an application program on top of a data source. The main emphasis in these systems is data mining—inventing new techniques and algorithms, proving their statistical soundness, and validating their effectiveness given a suitable problem.
Data should be available in a convenient form, typically a flat file, extracted from an appropriate data source. The knowledge discovery system consists of specific
Introduction The SQL/MX Approach
algorithms that load the entire data set into memory and perform necessary computations.
The extract approach has two major limitations:
•
It does not scale to large data sets because the entire data set is required to fit in memory. Statistical sampling can be used to avoid this limitation. However,sampling is inappropriate in many situations because sampling might cause patterns to be missed, such as those in small groups or those between records.
•
It cannot conveniently manage multiple versions of data across numerousiterations of a typical knowledge discovery investigation. For example, each iteration might require extracting additional data, performing incremental updates, deriving new attributes, and so on.
The SQL/MX Approach
In most enterprise organizations today, database systems are crucial for conducting business. The DBMS systems serve as the transaction processing systems for daily operations and manage data warehouses containing huge amounts of historical information. The validated data in these warehouses is already being used for online analysis and is a natural starting point for knowledge discovery.
The SQL/MX approach identifies fundamental data structures and operations that are common across a wide range of knowledge discovery tasks and builds such structures and operations into the DBMS. The primary advantages of the SQL/MX technology over traditional data mining techniques include:
•
The ability to mine much larger data sets, not only data in flat-file extracts•
Simplified data management•
More complete results•
Better performance and reduced cycle timesThe main features of the SQL/MX approach are summarized next.
Data-Intensive Computations Performed in the DBMS
Tools and applications perform data-intensive data-preparation tasks in the DBMS by using an SQL interface. As a result, you can access the powerful and parallel DBMS data manipulation capabilities in the data preparation stage of the knowledge discovery process.
Use of Built-In DBMS Data Structures and Operations
Fundamental data structures and operations are built into the DBMS to support a wide range of knowledge discovery tasks and algorithms in an efficient and scalable
Introduction The Knowledge Discovery Process
Building these data structures and operations into the DBMS allows mining tasks to be moved into the SQL engine for tighter integration of data and mining operations and for improved performance and scalability.
Adding new primitives, such as moving-window aggregate functions, simplifies queries needed by knowledge discovery tools and applications. This type of query
simplification often results in significant improvements in performance.
The Knowledge Discovery Process
The knowledge discovery process is a nine-step process that starts with the selection and definition of a business opportunity, continues through several data preparation steps and a modeling step, and ends with the deployment of the new knowledge. This subsection describes the first step of that process.
1. Identify and define a business opportunity.
The process begins with the identification and precise specification of a business opportunity.
See Defining the Business Opportunity on page 1-4. 2. Preprocess and load the data for the business opportunity.
Real-world data is often inconsistent and incomplete. The first preparation step is to address these problems by preprocessing the data in various ways—for
example, verifying and mapping the data. Then load the data into your database system.
See Preparing the Data on page 1-7 3. Profile and understand the relevant data.
Generate a variety of statistics such as column unique entry counts, value ranges, number of missing values, mean, variance, and so on.
See Profiling the Data on page 1-7
4. Define events relevant to the business opportunity being explored.
Events are used to align related data in a single set of columns for mining. Example events are life changes, such as getting married or switching jobs, or customer actions, such as opening an account or requesting a credit limit increase. See Defining Events on page 1-8
5. Derive attributes.
For example, customer age can be derived from birth date. Account summary statistics, such as maximum and minimum balances, can be derived from monthly status information.
Introduction Defining the Business Opportunity
6. Create the data mining view.
Transform the data into a mining view, a form in which all attributes about the primary mining entity occur in a single record.
See Creating the Mining View on page 1-10. 7. Mine the data and build models.
Core knowledge discovery techniques are applied to gain insight, learn patterns, or verify hypotheses. The main tasks are either predictive or descriptive in nature. Predictive tasks involve trying to determine what will happen in the future, based upon historical data. Descriptive tasks involve finding patterns describing the data. See Mining the Data on page 1-10.
8. Deploy models.
Deployment can take many different forms. For example, deployment might be as simple as documenting and reporting the results, or deployment might be
embedding the model in an operational system to achieve predictive results. 9. Monitor model performance.
Performance of the model must be monitored for accuracy. When accuracy begins to decline, the model must be updated to fit the current situation.
See Knowledge Deployment and Monitoring on page 1-11.
In Step 1, a business opportunity is identified and defined. In Steps 2 through 6, data mining data is gathered, preprocessed, and organized in a form that is suitable for mining. These steps require the most time in the process. For example, selecting the data is an important step in the process and typically requires the assistance of a data mining expert or subject matter expert who has knowledge of the data to be mined. In Step 7, models are built. In Steps 8 and 9, the models are deployed and monitored. This latter part of the knowledge discovery process focuses on analyzing the data mining view prepared in Steps 2 through 6.
Defining the Business Opportunity
The process begins with the identification and precise specification of a business opportunity. Several factors must be considered when evaluating potential
opportunities:
•
Quantification of the return on investmentWhat is the answer worth? How much money can be saved? How much of a competitive advantage does it offer?
Introduction Defining the Business Opportunity
•
Usability of the resultsMerely identifying patterns is not enough. The opportunity and analysis must be structured so that any interpretation of results obtained develops into deployable business strategy.
•
Political and organizational reactionIn assessing probabilities for organizational resistance, it is helpful to examine similar past efforts and understand why these efforts succeeded or failed.
•
Availability of business analysts and data mining experts and technologyAre data, domain, and mining experts available to participate in the process? Is sufficient technology, both hardware and software, available?
•
Data availabilityDoes preclassified data exist or can it be derived? Do sufficiently large amounts of data exist? Both internal and external data sources should be considered.
•
LogisticsHow difficult is it to collect, extract, and transport the relevant data? Is confidentiality an issue?
Careful consideration of these factors helps to ensure that the opportunity selected is both amenable to data mining and likely to provide significant value.
After an operation is selected, the next task is to specify it precisely. In the scenario of building a model to predict credit card account attrition, the goal is to build a model that will predict, as early as possible, whether a credit card customer will close their
account.
To specify this opportunity precisely, decide on an explicit definition of attrition, such as when a customer calls and closes their account. Another option is implicit—when a customer stops using their card. For simplicity, define attrition as a customer closing their account or maintaining a zero balance for three months.
Another aspect of specifying the opportunity is defining what it means to predict as early as possible when an account will be closed. For this example, choose three months as the prediction window. This window should be long enough to allow the card issuer to take some action to try to retain customers likely to leave, but short enough to capture attrition-related patterns.
The goal is to build a model that will predict, as early as possible, customer attrition.
Example Business Opportunity
The precise specification of our example opportunity is to build a model that will predict at any point in time, based on such things as current account status, account activity, and demographics, whether a credit card customer will close their account in the future. Note that the precise specification of the opportunity might be modified or
Introduction Defining the Business Opportunity
refined later in the knowledge discovery process as more information becomes available.
This manual uses this opportunity scenario to describe the knowledge discovery process and how to implement it. The data set used to illustrate techniques and SQL/MX features consists of two tables: one containing customer information and the other containing account history information. This data set is presented in Appendix A through C of this manual.
A subset of this data set is shown in these tables:
Customers Table
Account History Table
The first table, the Customers table, contains one row for each credit card account and consists of customer demographic information such as marital status, income, and so on. For a large financial institution, a customers table such as this one might contain approximately 10 million rows and 100 columns.
The second table, the Account History table, contains monthly status records, one for each account for each month the account was open over a given time period, and consists of about 200 columns. For this example, suppose the time period is three years. The history table would then contain about 360 million rows, assuming 10 million customers.
Given these parameters, the size of the first table is about 5 GB (10 million rows, 500 bytes in each row), and the size of the second table is about 360 GB (360 million rows, 1000 bytes in each row).
For the example business opportunity, the Status and Balance fields of the Account History table are used to determine if a customer will close their account. If the Status changes from Open to Closed or if the Balance is zero for three consecutive months,
Account Name Marital Status Home Income ...
1234567 Jones, Mary Single Own 65,000
2500000 Abbas, Ali Divorced Rent 32,000
4098124 Kano, Tomoko Divorced Own 44,000
2400000 Lund, Erika Widow Own 28,000
Account Month Status Limit Balance Payment Fin. Chrg ...
1234567 01/03 Open 10,000 1232.50 1232.50 0.00
2500000 07/02 Open 5,000 566.00 32.00 8.00
4098124 10/00 Open 6,000 3200.00 3200.00 0.00
1234567 02/03 Open 10,000 3000.00 3000.00 0.00
Introduction Preparing the Data
then a customer is defined as having left—that is, no longer holds a credit card account.
Preparing the Data
After a business opportunity has been identified and defined, the next task is to prepare a data set for mining. This is done in Steps 2 through 6 of the knowledge discovery process. See The Knowledge Discovery Process on page 1-3.
The first two steps are preprocessing the mining data to make it consistent and then loading the data into a database system. For further information, see Loading the Data on page 2-2.
The next step is to generate a variety of statistics—for example, column unique entry counts, value ranges, number of missing values, mean, variance, and so on. This type of data profile is helpful in gaining an understanding of the data, and this profile also serves as a valuable reference throughout the knowledge discovery process.
Profiling the Data
A profile of the database helps to solve the data mining problem in these ways:
•
To better understand the data•
To decide which columns to use for analysis•
To decide whether to treat attributes as discrete or continuousTypes of Information
The type of information used to create a profile of the data mining view comes from the following elements:
•
Tables in the database•
Table attributes (or columns to be used in the analysis)•
Data types of the table attributes•
Relationships between tables•
Cardinalities of discrete attributes•
Statistics about continuous attributes•
Derived table attributes (or derived columns to be used in the analysis)Determining the derived columns to be constructed requires knowledge of the table attributes and how these attributes relate to the data mining problem. See Preparing the Data on page 1-7 for a full discussion of these elements.
SQL/MX provides the TRANSPOSE clause of the SELECT statement to display the cardinalities of discrete attributes. See Transposition on page 2-3 and the
TRANSPOSE Clause entry in the SQL/MX Reference Manual for details.
Example of Finding Cardinality of Discrete Attributes
The customers table in your data set has Age and Number_Children columns. Both of these attributes are discrete, and you can compute the cardinality of each attribute.
Introduction Preparing the Data
You obtain the cardinality of an attribute, which is the count of the number of unique values for the attribute, by using a COUNT DISTINCT query. For example:
SELECT COUNT(DISTINCT Age) FROM Customers;
or
SELECT COUNT(DISTINCT Number_Children) FROM Customers;
Instead of having to submit a query for each attribute, you can obtain counts for multiple attributes of a table by using the TRANSPOSE clause. For example:
SET NAMETYPE ANSI; SET SCHEMA dmcat.whse;
SELECT ColumnIndex, COUNT(DISTINCT ColumnValue) FROM Customers
TRANSPOSE Age, Number_Children AS ColumnValue KEY BY ColumnIndex GROUP BY ColumnIndex; COLUMNINDEX (EXPR) --- --- 1 17 2 4 --- 2 row(s) selected.
The first row of the result table of the TRANSPOSE clause contains the distinct count for the column Age, and the second row contains the distinct count for the column Number_Children. You can treat the Age values as categories, consisting of age ranges. Similarly, if Number_Children is greater than five, you can put the count into the category for the Number_Children equal to five.
The number of attributes in a TRANSPOSE clause is unlimited.
For further information, see Profiling the Data on page 2-2.
Defining Events
In the scenario considered in this manual, the relevant event is the account holder leaving. This event occurs at different points in time for customers that leave and not at all for customers that stay.
This event must be defined so that account status and activity in the months leading up to a customer leaving can be located and aligned in columns. For example, suppose you create three derived attributes that describe the account balance for each of the
Note. The data types of attributes to be transformed into a single column must be compatible. The data type of the result column is the union compatible data type of the attributes.
Introduction Preparing the Data
three months before a customer leaves, because these attributes are predictors of attrition.
For the customers that do leave, the months leading up to leaving occur at various points in time. For customers that do not leave, these months are chosen to be any three consecutive months in which the account is open.
The information about these months should be aligned for all accounts in a single set of columns, one for each of the three months. Most mining algorithms require a single logical attribute, such as the balance one month before leaving, to be stored in one column in all records, rather than in different columns in different records.
For example, consider this data in a table that contains monthly account balances for each month in the three-year history period:
The balances prior to the event (of the customer leaving) are in different date columns for these accounts, and therefore algorithms that build predictive models are not able to consider this information.
A table organization that allows this information to be considered:
In this table, columns Bal-1 through Bal-3 contain account balances one through three months prior to a customer leaving. Consequently, this information is aligned within a single set of columns and can be considered during model creation.
For further information, see Defining Events on page 2-6.
Deriving Attributes
The next task is to derive attributes that are not relative to events. For example, customer age can be derived from birth date. Part of the challenge of effective data mining is identifying a set of derived attributes that capture key indicators relevant to the business opportunity being explored.
For further information, see Deriving Attributes on page 2-9.
Account ... Bal 08/03 Bal 09/03 Bal 10/03 Bal 11/03 ... Left
1234567 7800.00 3000.00 2870.00 1200.00 Yes (closed)
2500000 0.00 0.00 0.00 0.00 Yes (0 bal)
Account ... Bal 07/02 Bal 08/02 Bal 09/02 Bal 10/02 ... Left
4098124 4817.94 4596.10 4347.63 4069.34 Yes (closed)
Account ... Bal-3 Bal-2 Bal-1 Date Left ... Left
1234567 3000.00 2870.00 1200.00 12/03 Yes (closed)
2500000 0.00 0.00 0.00 11/03 Yes (0 bal)
Introduction Creating the Mining View
Creating the Mining View
The final data preparation step is to transform the data set into a mining view, a form in which all attributes about the main mining entity appear in a single record. The mining entity used in this manual is a credit card account. The data mining challenge is to determine predictors for when a customer will close a credit card account.
Transforming the data set to a single record for each mining entity often involves a pivot operation, in which attributes in multiple rows are collapsed and put into a single row. For example, in the credit card example, the set of history records associated with each account is collapsed to a single record and then appended to the corresponding customer record.
For further information, see Section 3, Creating the Data Mining View. The resulting table looks similar to this:
Mining View
This table contains demographic information from the Customers table, such as marital status and income, and also pivoted columns from the Account History table, such as balances prior to leaving. You use example data set in the data mining step, the next step in the knowledge discovery process.
Mining the Data
In the data mining step, core knowledge discovery techniques are applied to gain insight, learn patterns, or verify hypotheses. The main tasks performed in this step are either predictive or descriptive in nature. Predictive tasks involve trying to determine what will happen in the future, based upon historical data. Descriptive tasks involve finding patterns describing the data.
The task used in this customer scenario is predictive: to build a model to predict attrition of credit card customers based on historical information, such as
demographics and account activity. The most common predictive tasks are:
•
Classification—Classify a case (or record) into one of several predefined classes.•
Regression—Map a case (or record) into a numerical prediction value.Account Mar Status Income Bal-3 Bal-2 Bal-1 Date Left Left
1234567 Single 65,000 3000.00 2870.00 1200.00 12/99 Yes
2500000 Divorced 32,000 0.00 0.00 0.00 11/99 Yes
4098124 Divorced 44,000 4817.94 4596.10 4347.63 10/98 Yes
Introduction Knowledge Deployment and Monitoring
Descriptive tasks involve finding patterns describing the data. The most common are:
•
Database segmentation (clustering)—Map a case into one of several clusters.•
Summarization—Provide a compact description of the data, often in visual form.•
Link analysis—Determine relationships between attributes in a case.•
Sequence analysis—Determine trends over time.You use a variety of algorithms, and the models they produce, to perform these predictive and descriptive tasks.
For example, classification can be done by building a decision tree model, where each branch of the tree is represented by a predicate involving attributes in the mining data set and where each branch is homogeneous with respect to whether the predicate is true or false. The main task in classification is to determine which predicates form the decision tree that predicts the goal. The most common algorithms for classification come from the field of machine learning in computer science.
Typically, the model building step involves the use of client-mining tools that require the interactive participation of the user to guide the investigation. A description of these special-purpose tools is beyond the scope of this manual.
For further information, see Section 4, Mining the Data.
Knowledge Deployment and Monitoring
The last two steps of the knowledge discovery process involve deploying and monitoring discovered knowledge. Deployment can take many different forms. For example, deployment might be as simple as documenting and reporting the results, or deployment might be embedding the model in an operational system to achieve predictive results.
Most data mining tools support model deployment either by applying a model to data within the tool or by exporting a model as executable code, which can then be
embedded and used in applications. In the credit card attrition example, one form of model deployment is to periodically use the model to identify profitable customers that are likely to leave, and then to take some action, such as lowering interest rates or waiving fees, to try to retain these customers.
2
Preparing the Data
Section 1, Introduction identifies and defines a business opportunity, the first step in the knowledge discovery process supported by SQL/MX. This section describes Steps 2 through 5.
1. Identify and define a business opportunity.
2. Preprocess and load the data for the business opportunity.
The first preparation step is to address these problems by preprocessing the data in various ways—for example, verifying and mapping the data. Then load the data into your database system.
See Loading the Data on page 2-2.
3. Profile and understand the relevant data.
Generate a variety of statistics, such as column unique entry counts, value ranges, number of missing values, mean, variance, and so on.
See Profiling the Data on page 2-2.
4. Define events relevant to the business opportunity being explored.
Events are used to align related data in a single set of columns for mining. Example events are life changes, such as getting married or switching jobs, or customer actions, such as opening an account or requesting a credit limit increase. See Defining Events on page 2-6.
5. Derive attributes.
For example, customer age can be derived from birth date. Account summary statistics, such as maximum and minimum balances, can be derived from monthly status information.
See Deriving Attributes on page 2-9. 6. Create the data mining view.
7. Mine the data and build models. 8. Deploy models.
Preparing the Data Loading the Data
Loading the Data
The first step in preparing a data set for mining is loading the data into database tables. Suppose the credit card organization has a customers data warehouse. The customer data and the account history data are stored in this warehouse. In a typical real-world scenario, the warehouse could have millions of records representing millions of customers dating back many years.
Creating the Database
Suppose a data mining database is created consisting of the Customers table and the Account History table described in the previous section.
You can use the DDL scripts included with this manual to create a database to run the examples in this manual. To create the database:
1. Open the .pdf file for this manual.
2. Navigate to Appendix A, Creating the Data Mining Database of this manual, which contains the DDL script that creates the database.
3. On the tool bar, select the Table/Formatted Text Select Tool.
4. Copy and paste from the DDL script, one page at a time, into an OSS text file. 5. Within MXCI (the SQL/MX conversational interface), obey the OSS file you have
created.
Importing Data Into the Database
After the data mining database is created, the warehouse data is imported into the database. In a typical real-world scenario, you would import the data by using some type of database utility—for example, you can use the DataLoader/MP utility to import a large quantity of data into an SQL/MP database. For further information, see the
DataLoader/MX Reference Manual and the SQL/MX Reference Manual for discussions
of the Import Utility.
Alternatively, you can also use INSERT statements to insert values into the data mining database. The INSERT statements for the example in this manual are included in Appendix B, Inserting Into the Data Mining Database.
Profiling the Data
Profiling often begins with the computation of basic information about each attribute. For discrete attributes, this basic information is typically a table of the unique values and a count of how many times each value occurs. However, as cardinality increases, these frequencies become less and less meaningful. For continuous attributes, the approach is to use metrics such as minimum, maximum, mean, and variance.
Preparing the Data Cardinalities and Metrics
Cardinalities and Metrics
For any attribute, one approach to profiling is to run a separate query for each attribute. As an example, consider the following queries, which profile the discrete attribute Marital Status from the Customers table and the continuous attribute Balance from the Account History table.
Example of Discrete Attribute
This query finds the number of discrete values of the Marital Status column of the Customers table:
SELECT marital_status, COUNT(*) FROM customers
GROUP BY marital_status;
Example of Continuous Attribute
This query computes statistical information about the continuous attribute Balance in the Account History table:
SELECT MIN(balance), MAX(balance), AVG(balance), VARIANCE(balance) FROM acct_history;
Transposition
Other than the computation of a few metrics, both of the previous queries require a complete scan of the data. In this way, a table with N attributes requires N queries, resulting in the same number of complete scans. For a wide mining table, this procedure can result in thousands of queries and scans of the data.
Using transposition, SQL/MX can perform the above profiling operations by using a total of only two queries, regardless of the number of attributes to be profiled. Through the TRANSPOSE clause of the SELECT statement, different columns of a source table can be treated as a single output column, enabling similar computations to be
performed on all such source columns.
TRANSPOSE takes each row in the source table and converts each expression listed in the transpose set to an individual output row. Used in this way, TRANSPOSE can compute frequency counts for all discrete attributes in a table in a single query. See the TRANSPOSE Clause entry in the SQL/MX Reference Manual for more information.
Example of Computing Counts for Character Discrete Attributes
This query computes the frequency counts for the discrete attributes Gender, Marital Status, and Home, which are all type character:
SET NAMETYPE ANSI;
Preparing the Data Transposition
SELECT attr, c1, COUNT(*) FROM customers TRANSPOSE ('GENDER', gender),
('HOME', home), ('MARITAL_STATUS', marital_status) AS (attr, c1) GROUP BY attr, c1 ORDER BY attr, c1; ATTR C1 (EXPR) --- --- ---GENDER F 20 GENDER M 22 HOME Own 33 HOME Rent 9 MARITAL_STATUS Divorced 12 MARITAL_STATUS Married 9 MARITAL_STATUS Single 15 MARITAL_STATUS Widow 6 --- 8 row(s) selected.
Because this query produces counts for three different attributes, use the ATTR column to distinguish from which attribute the values are drawn. The C1 column contains the values for these character attributes.
Example of Computing Counts for Character and Numeric Discrete Attributes
This query also shows the transpose clause and illustrates how profiling can be
achieved. The column C2 has been added to the statement because Number_Children has numeric data type.
SELECT attr, c1, c2, COUNT(*) FROM customers TRANSPOSE ('GENDER', gender, null),
('HOME', home, null),
('MARITAL_STATUS', marital_status, null), ('NUMBER_CHILDREN', null, number_children) AS (attr, c1, c2) GROUP BY attr, c1, c2 ORDER BY attr, c1, c2; ATTR C1 C2 (EXPR) --- --- --- ---GENDER F ? 20 GENDER M ? 22 HOME Own ? 33 HOME Rent ? 9 MARITAL_STATUS Divorced ? 12 MARITAL_STATUS Married ? 9 MARITAL_STATUS Single ? 15 MARITAL_STATUS Widow ? 6 NUMBER_CHILDREN ? 0 25 NUMBER_CHILDREN ? 1 4
Preparing the Data Quick Profiling
NUMBER_CHILDREN ? 2 10 NUMBER_CHILDREN ? 3 3 --- 12 row(s) selected.
Because this query produces counts for four different attributes, use the ATTR column to distinguish from which attribute the values are drawn. The C1 column contains the values for the character attributes, and the C2 column contains the values for the numeric attribute.
Example of Computing Statistics for Continuous Attributes
Similarly, a single query using TRANSPOSE can compute the necessary statistics for all continuous attributes. The next query computes the minimum, maximum, mean, and variance for the continuous attributes Customer Credit Limit and Balance, which are both numeric:
SELECT attr, MIN(c1), MAX(c1), AVG(c1), VARIANCE(c1) FROM acct_history
TRANSPOSE (1,cust_limit), (2,balance) AS (attr, c1) GROUP BY attr
ORDER BY attr;
Sample results are:
Sample results are:
By using TRANSPOSE to compute attribute profiles, you gain performance and
scalability advantages. Performance is improved because the data set is scanned only once. In addition, the number of queries is reduced to two: one for discrete attributes and one for continuous attributes. Scalability is enhanced because the amount of data accessed grows linearly with the number of attributes actually profiled.
Quick Profiling
The profiling step is highly iterative, because many different data sources are inspected and evaluated for possible analysis. Getting a quick impression of an
attribute before proceeding to a more detailed profile is often necessary. For example, by quickly estimating cardinality, you can determine whether to treat a column as discrete or continuous. You can make a determination accurately without a scan of every single data element.
Use the SQL/MX sampling feature to:
ATTR MIN(C1) MAX(C1) AVG(C1) VARIANCE(C1)
1 5000.00 40000.00 18225.81 2.01E+008
2 .00 32000.00 2539.12 1.46E+007
ATTR MIN(C1) MAX(C1) AVG(C1) VARIANCE(C1)
1 5000.00 40000.00 20139.86 2.35E+008
Preparing the Data Defining Events
•
Randomly sample source data•
Improve computing efficiency for a profile using a selected sampling percentage•
Reduce both the I/O costs and the CPU costs associated with computing a profile See the SAMPLE Clause of SELECT in the SQL/MX Reference Manual.Defining Events
Events are used to align related data in a single set of columns for mining. Example events are life changes, such as getting married or switching jobs, or customer actions, such as opening an account or requesting a credit limit increase.
The critical event to be defined for the business opportunity described in this manual is the month the customer left—either by closing their account or by maintaining a zero balance for three months. The problem is to align the data so that this event can be derived as an attribute of the mining view.
Aligning the Data
Most mining algorithms and tools require that the input data be arranged so that all the information pertaining to a given entity is contained in a single record. However, in typical raw mining data, observations about a given entity can be stored in separate rows and tables.
For example, the Account History table contains one record per customer per month, summarizing the account status for that customer. The related Customers table contains static information in the form of one row per customer. For this example, the account status information must be reduced to a single row of information for each customer. This data is paired with the static customer information to form the mining view.
Two methods exist for mapping time-dependent data in the mining view. One method is to take a value from a particular month and include that value in the mining view. For example, the checking account balance for January 1998 can be included in the mining view for each customer because the balance is a single value.
Alternatively, a value can be aggregated over a time period to compute a single value for the mining view. An example is the average checking account balance for January 1998 through June 1998.
Absolute and relative methods exist for aligning time-dependent data in the mining view. Specifying an event relative to a customer is often more meaningful than to specify an absolute event, such as a given year and month.
The account balance one month prior to closing an account or the average account balance for six months prior to closing an account are both examples of relative
events. In this type of relative time specification, the actual months selected depend on an event that is different for each customer. Aligning the data by using relative events
Preparing the Data Aligning the Data
is crucial for building models to predict events that occur at different times for each customer.
Example of Aligning Data
This statement creates an SQL/MX table named Close_Temp that contains the
account number, the month the account is considered closed (if not closed, an arbitrary month), and an indicator of whether or not the customer left:
SET SCHEMA mining.whse; CREATE TABLE Close_Temp
( account NUMERIC (7) UNSIGNED NO DEFAULT
NOT NULL
HEADING 'Account Number' ,close_month DATE
NO DEFAULT NOT NULL
HEADING 'Close Month' ,cust_left CHAR(1)
NO DEFAULT ,PRIMARY KEY (account) );
In this query, the source data for the column named Close_Month is defined to be the month the customer left—either by closing their account or by maintaining a zero balance for three months. If the customer did not leave, the month is arbitrarily defined to be a month in the middle of their account history.
INSERT INTO close_temp (SELECT p.account, CASE
WHEN p.close_month2 IS NOT NULL THEN p.close_month2 WHEN p.close_month1 IS NOT NULL THEN p.close_month1
ELSE p.open_month + ((DATE '1999-12-01' - p.open_month)/2) - INTERVAL '16' DAY
END, CASE
WHEN p.close_month2 IS NOT NULL THEN 'Y' WHEN p.close_month1 IS NOT NULL THEN 'Y' ELSE 'N'
END FROM
(SELECT t.account, MAX(t.close_month1),
MAX(t.close_month2), MIN(t.year_month) FROM
(SELECT m.account ,m.year_month, CASE WHEN m.status = 'Closed'
AND OFFSET(m.status,1) = 'Open' AND account = OFFSET(account,1) THEN m.year_month
END,
CASE WHEN ROWS SINCE INCLUSIVE(balance <> 0) = 3.0 AND account = OFFSET(account,2)
Preparing the Data Aligning the Data
THEN m.year_month END
FROM acct_history m
SEQUENCE BY m.account, m.year_month)
t (account, year_month, close_month1, close_month2) GROUP BY t.account)
p (account, close_month1, close_month2, open_month));
The derived attribute Close_Month1 contains the month when a customer explicitly closed their account (the Account Status is marked Closed). The first CASE expression in the inner query uses the OFFSET sequence function to determine the month when an account is closed explicitly.
The derived attribute Close_Month2 contains the month when a customer implicitly closed their account (maintained a zero balance for three months). The second CASE expression in the inner query uses the OFFSET sequence function and the ROWS SINCE INCLUSIVE sequence function to determine the month when an account has a zero balance for three months.
The derived attribute Open_Month is the month when the account was opened. In the CASE expression of the outer query, this month is adjusted to be the month in the middle of the account history. The account history interval is defined to start with the first month the account is open up to the date 1999-12-01.
The derived attribute Close_Month in the Close_Temp table is set to either
Close_Month1 (when a customer explicitly closed their account), Close_Month2 (when a customer maintained a zero balance for three months), or the month in the middle of the Account History interval (when an account is open).
The derived attribute Cust_Left is set to Y if a customer has a zero balance for three months or if the Account Status is marked Closed.
In queries that use sequence functions, note the use of the SEQUENCE BY clause. See SEQUENCE BY Clause and Sequence Functions in the SQL/MX Reference
Manual for more information.
Here are the contents of the Close_Temp table after the preceding row insertion:
Account Number Close Month Cust_Left
1000000 1999-03-01 N 1234567 1999-12-01 Y 2300000 1999-11-01 Y 2400000 1998-12-01 Y 2500000 1999-10-01 Y 2900000 1999-06-01 N 3200000 1999-05-01 Y 3900000 1998-10-01 Y 4098124 1998-10-01 Y
Preparing the Data Deriving Attributes
Deriving Attributes
In the preceding Example of Aligning Data on page 2-7, the derived attributes in the Close_Temp table are Close_Month and Cust_Left. These attributes are critical for the task of building a model that will predict at any point in time, based on such things as current account status, account activity, and customer demographics, whether a credit card customer will leave three months in the future.
To produce good models, the source mining data typically needs to be supplemented with appropriate derived attributes. Typical derived attributes include computing ratios between key quantities, mapping postal codes to average demographics, computing metrics, and computing rankings, percentiles, or quartiles.
Moving Metrics
Moving metrics measure a dynamic behavior in terms of rates of events or trends for a state of condition. In the data mining environment, moving metrics are good predictors for many modeling tasks involving historical or time series data. For example, the moving average of an account balance produces attributes that could be included in the mining view for each customer.
SQL/MX supports a number of sequence functions that you can use to simplify queries and to execute queries more efficiently.
Example Using MOVINGAVG and ROWS SINCE
This query uses the sequence functions MOVINGAVG and ROWS SINCE:
SELECT account, year_month, MOVINGAVG (balance,
ROWS SINCE INCLUSIVE (account <> OFFSET (account,1)) +1, RUNNINGCOUNT(*))
FROM acct_history
SEQUENCE BY account, year_month; ACCOUNT YEAR_MONTH (EXPR)
--- --- 1000000 1998-07-01 3678.67 1000000 1998-08-01 5229.33 1000000 1998-09-01 4253.15 1000000 1998-10-01 5189.86 1000000 1998-11-01 5221.06 4300000 1999-06-01 N 4400000 1999-07-01 Y 4500000 1998-09-01 Y 4600000 1999-12-01 Y 4700000 1999-06-01 N
Preparing the Data Rankings
1000000 1998-12-01 5134.22 1000000 1999-01-01 4572.19 ... ... ... --- 186 row(s) selected.
In this query, the ROWS SINCE INCLUSIVE sequence function is used to limit the moving average window to records for the current customer. The third argument of MOVINGAVG is RUNNINGCOUNT(*), which ensures MOVINGAVG does not include rows before the beginning row.
In practice, similar queries can be used to compute several metrics at the same time, and the results, which conceptually are new columns in the Account History table, can be realized in an auxiliary table. This auxiliary table can then be referenced when computing the mining view.
By using sequence functions, you eliminate the dependency on the number and location of moving averages computed for each customer. Even if customers have different numbers of history records, sequence functions allow the computation of a metric for each customer.
Rankings
Simple rankings provide good predictors for many modeling tasks. An example is the rank of a customer’s average account balance relative to all other customers. This query computes the absolute rank of the average account balance for each customer:
SELECT cid, RUNNINGCOUNT(*), avg_bal FROM
(SELECT account, AVG(balance) FROM acct_history
GROUP BY account) AS t(cid, avg_bal) SEQUENCE BY avg_bal DESC;
CID (EXPR) AVG_BAL
--- --- 4300000 1 6203.33 2900000 2 5184.04 4098124 3 4920.02 2300000 4 4610.28 1000000 5 4067.44 1234567 6 2807.20 ... ... ... --- 14 row(s) selected.
In practice, the results of this type of query are realized in an auxiliary table that can be thought of as an extension to the Customers table. Percentiles and quartiles can also be computed easily with similar queries.
See the Sequence Fnctions entry in the SQL/MX Reference Manual for more information.
3
Creating the Data Mining View
Because data mining often involves executing a series of similar queries before getting satisfying results, it can be helpful to use materialized results of previous queries when answering a new one. Creating a data mining view allows you to access intentionally gathered and permanently stored results of a data mining query.
Creating a data mining view is Step 6 of the knowledge discovery process. 1. Identify and define a business opportunity.
2. Preprocess and load the data for the business opportunity. 3. Profile and understand the relevant data.
4. Define events relevant to the business opportunity being explored. 5. Derive attributes.
6. Create the data mining view.
Transform the data into a mining view, a form in which all attributes about the primary mining entity occur in a single record. This transformation involves:
•
Creating the Single Table•
Pivoting the Data7. Mine the data and build models. 8. Deploy models.
Creating the Data Mining View Creating the Single Table
Creating the Single Table
After computing derived attributes and storing these attributes in auxiliary tables, you create the mining view by combining all the information into a single table with one row for each entity. Continuing with the credit card example, the mining view contains the information in the Customers table along with the auxiliary customer data. In addition, information in the Account History and related tables is also used.
Typically, after the mining view is computed and inserted into a single database table, the data is extracted and loaded into a mining tool for the model building step. The mining data would be extracted via ODBC/MX or the Genus Mining Integrator for NonStop SQL.
Example of Creating the View
The derived attributes consisting of the three balances for the three months prior to a customer leaving are specified in the following SQL/MX CREATE TABLE statement. This view aligns the data around the month of a particular event—account attrition.
SET SCHEMA mining.whse; CREATE TABLE mineview
( account NUMERIC (7) UNSIGNED NO DEFAULT
NOT NULL
HEADING 'Account Number' ,marital_status CHARACTER (8)
DEFAULT NULL
HEADING 'Marital Status' ,home CHARACTER (4)
DEFAULT NULL HEADING 'Home'
,income NUMERIC (8, 2) UNSIGNED DEFAULT NULL HEADING 'Income' ,gender CHAR(1) DEFAULT NULL ,age NUMERIC (3) DEFAULT NULL HEADING 'Age' ,number_children NUMERIC (2) DEFAULT NULL
HEADING 'Number of Children' ,year_month DATE NO DEFAULT NOT NULL ,close_month DATE NO DEFAULT NOT NULL ,balance_close_1 NUMERIC (9,2) NO DEFAULT NOT NULL ,balance_close_2 NUMERIC (9,2) NO DEFAULT
Creating the Data Mining View Pivoting the Data NOT NULL ,balance_close_3 NUMERIC (9,2) NO DEFAULT NOT NULL ,cust_left CHAR(1) NO DEFAULT ,PRIMARY KEY (account) );
Pivoting the Data
All the data in the Customer, Account History, and auxiliary tables must be collapsed to a single row for each customer. Collapsing the data is accomplished by pivoting the data. Data is purged from separate rows for each customer into different columns of a single customer row. For example, the balance one month prior to account closure can be placed in column BALANCE_CLOSE_1, the balance two months prior to account closure in column BALANCE_CLOSE_2, and the balance three months prior to account closure in column BALANCE_CLOSE_3.
To accomplish this pivoting operation, use the OFFSET sequence function to collect data from various months and place the results in a single row.
Example Using OFFSET Sequence Function
This query populates the mining view:
INSERT INTO miningview (SELECT t.account ,c.marital_status ,c.home ,c.income ,c.gender ,c.age ,c.number_children ,t.year_month ,t.close_month ,t.balance_close_1 ,t.balance_close_2 ,t.balance_close_3 ,t.cust_left FROM (SELECT account ,year_month ,close_month
,CASE WHEN year_month = close_month THEN balance END AS balance_close_1
,CASE WHEN year_month = close_month
AND account = OFFSET(account,1) THEN OFFSET(balance, 1)
END AS balance_close_2
,CASE WHEN year_month = close_month
AND account = OFFSET(account,2) THEN OFFSET(balance,2)
Creating the Data Mining View Pivoting the Data
END AS balance_close_3 ,cust_left
FROM acct_history a NATURAL JOIN close_temp m
SEQUENCE BY account, year_month) AS t, customers c WHERE t.balance_close_1 IS NOT NULL AND
t.balance_close_2 IS NOT NULL AND t.balance_close_3 IS NOT NULL AND c.account = t.account);
Sequence functions are used in the preceding query to create a derived table with the various balances for each customer. This derived table has one row per customer that consists of a single copy of the relevant data.
Here are the contents of the Miningview table after the preceding row insertion:
This table continues the mining view table.
Account Number
Marital Status
Home Income Gender Age Number Children ... 1000000 Married Own 175500.00 M 45 3 1234567 Single Own 65000.00 F 34 0 2300000 Divorced Own 137000.00 M 42 2 2400000 Widow Own 28000.00 F 65 0 2500000 Divorced Rent 32000.00 M 23 0 2900000 Divorced Rent 136000.00 F 50 0 3200000