Kerie Wenzlick, Morteza Marzjarani
Saginaw Valley State University and Saginaw Valley State University
7165 New Lothrop Road, New Lothrop, Michigan, 48460, USA 7400 Bay Road, University Center, Michigan, 48710, USA
Keywords: Algorithms ; Tools ; Probability ; Clustering ; Regression Abstract:
Data Mining is a common area to both statisticians and computer scientists. In this research work, we present some user-friendly data-mining tools. Among the data mining tools, we will analyze methods such as probability, regression, decision trees, clustering, etc. A comparison of some of these tools, in terms of features of interest such as speed, space requirement, and performance will also be presented.
Introduction:
The data mining software program that we created follows along with a statistical based Data Mining course at Saginaw Valley State University. The software will show students( who are majoring in areas other than statistics)several different aspects of Data Mining, starting with simple tasks such as
finding the mean, median, mode, and then moving along to more in depth tasks such as linear regression and multiple regression. The software also includes an example on clustering, as well as an example to show the students how the statistical program S.A.S.
works. This paper will go through each of the topics that are covered within our software program, explain their theories, and then elaborate on how the theories were converted into Visual Basic code.
Creating the Anova Table
We read in data from a table to create the Anova Table and then those numbers help us to decide whether to accept or reject the hypothesis. For example, if we have a table with 8 levels of diet, with 3 replications per level, it looks like this:
Diet
1 2 3 4 5 6 7 8
1 10 16 27 23 16 16 25 28
2 18 19 17 18 25 16 23 24
3 8 16 16 28 22 19 30 20
Mean 12 17 20 23 21 17 26 24
Total Mean = 20
The Anova Table looks as shown below:
Source of Variability
SSQ DF Mean Square F Calc
Between Columns SSBc C – 1 MSBc =
SSbc/ (C – 1)
MSBc – MSW Within
Columns
SSW (R-1) * C MSW = SSW / ((R
– 1) * C
Variable Definition
R The number of replications per level. In our case, we have three replications per level, so R = 3.
C The number of levels. We have 8 levels of diet in our table, so C = 8.
SSBc This is the Sum of Squares between columns. Basically, for each column, take the mean, and subtract the total mean of the table. Square each of those answers, and then add all the individual answers up at the end. To finish, multiply that answer by R, and you will have SSBc.
SSW This is the sum of squares within columns. This is rather difficult to compute by itself, so we can use a formula and basic algebra to compute it easier. Use the formula: TSS (Total Sum of Squares) = SSBc (Sum of Squares Between Columns) + SSW (Sum of Squares Within Columns). We already know SSBc, so we are just going to calculate TSS and then use Basic Algebra to get SSW.
To calculate TSS, take each observation (in our case, 24 different
observations), subtract the total mean, and square each answer. Then, total each of the individual answers to get TSS. Then, SSW = TSS - SSBc.
MSBc SSBc / (C-1)
MSW SSW / ((R-1) * C)
Fcalc MSBc / MSW. This is the F Distribution with ( (C-1), (R-1)*C).
Fcalc is the most important factor. If Fcalc > F Distribution, we then can reject the hypothesis. In this case, the hypothesis is that the level of diet does not have an impact on weight gain. So, first of all, we need to find the F Distribution. Using the formula, we need to find F (7,16). So, using the F Distribution table at a 5% level of significance, we get the number 2.66. This is less than the Fcalc number of 3.37, so we can reject the hypothesis and conclude that the level of diet does have an impact on weight gain.
Mean, Median, Mode, Variance, Standard Deviation
Depending on the number of observations, the mean and variance are calculated differently. Below are the formulas used to calculate mean, median, mode, variance, and standard deviation depending on the number of observations.
Small Number of Observations (Rule: Less than 10) Mean Sum of all observations / Number of observations Median (Observations + 1) / 2
Mode Observation that has the most frequency
Variance 1 / (n – 1) * Sum of all (observation – mean) ^ 2 Standard Deviation (Variance) ^ (½)
Large Number of Observations (Rule: Between 10 and 100) Categorize Data
Observation Frequency 13
2 5
Mean (Sum of all ( observation * frequency )) / Sum of all Frequencies Median (Observations + 1) / 2
Mode Observation that has the most frequency
Variance 1 / (n – 1) * Sum of all ( Frequency * (observation – mean) ^ 2) Standard Deviation (Variance) ^ (½)
Classify Data Classes Frequency 1 to 5 4
6 to 10 7
Mean (1 / observations) * Sum of all (Frequency * Mean of Class) Median (Observations + 1) / 2
Mode Observation that has the most frequency
Variance 1 / (n – 1) * Sum of all (Frequency * (Mean of Class – mean) ^ 2 Standard Deviation (Variance) ^ (½)
Regression
Regression is used to measure the degree of association between variables and/or to predict the value of one variable from the knowledge of the values of another variable. We explore the statistical relationship, which is true only on the average.
For example, let’s say that we have sales over the last 8 periods:
116 109 117 112
122 113 108 115
We want to know what to predict for the next sales period. We can use linear regression to do this. So, we need to consider the linear statistical relationship between the 2 variables X and Y, both of which are given and then find the best fitting line to the data.
We call this the least squares line and it is represented by the equation Yc = a + bX.
Here is our data again:
Y X X*Y X^2
116 2 232 4
109 1 109 1
117 3 351 9
112 1 112 1
122 4 488 16
113 2 226 4
108 1 108 1
105 2 230 4
114 2 1856 40
In the equation for the Least Squares Line, we need to figure out values for A and B.
B = Sum of all (X*Y) – (Number of Observations * Mean of X * Mean of Y)
Sum of all (X^2) – Number of Observations * (Mean of X) ^ 2
A = Mean of Y – B * Mean of X
So, with our data, B is equal to 4, A is equal to 106, and the Least Squares Line is: Yc = 106 + 4x. Now we can draw the line and make an educated guess on
what the sales for the next period might be.
Next, we can go and find the confidence intervals for A, B, and Y, so we can say with 95% confidence that we know the values for each of the variables are between the values we find.
To find the confidence intervals, we first need to find the values of a few more quantitiess:
Variable Definition
SSE Sum of All ((Y-Yc)^2)
Syx (Standard Error of Estimate) (SSE/(n-2)) ^ ( ½ ) Sb (Estimate of how different a value of b is
likely to be from B
Syx * (1 / ((Sum of all (Xj)^2 – Observations * (Mean of X) ^2) ^ ( ½ )
Sa (Estimate of how different a value of a is likely to be from A)
Syx * (((1/observations) + (Mean of X)^2/ Sum of all (Xj – Mean of X)^2 ) ^ ( ½ )
Confidence Interval for B b + t 1- theta * Sb Confidence Interval for A
a + t 1- theta * Sa
T 1- theta is a value that can be looked up in the T- Distribution table, with only knowing the degrees of freedom. The degrees of freedom, in our case, are the number of observations – 2. Then, just look up the value in the table, and insert that value into the equation. Once the equations are complete, then we know with a 95% confidence that the value for the next period will be between the set of numbers that our confidence interval comes out with.
Multiple Regressions
We use multiple regressions when there is more than one independent variable. For example:
Y = Job Performance Y = Sales
X1 = Score on Exam 1 X1 = Advertising
X2 = Score on Exam 2 X2 = Number of Sales People
X3 = Score on Exam 3 X3 = Number of Competitors
Therefore, we use equations rather than (x,y) points.
So, our Least Squares Line is carried out to include the number of independent variables that there are.
We have 5 X values, so we account for each of those X values in our equation. The basic theory is close to linear regression, however. We are still trying to accomplish the same purpose of predicting the next value.
Implementing SAS
command to come to a result. This is a very good option to use when checking answers to make sure that your own code is working correctly, or when something is complicated to code. Then, you can just run the data through SAS and have an answer within minutes.
Clustering
if we have data that looks like the following:
X1i X1i^2 X2i X2i^2 X1i*X2i
85 7225 85 7225 7225
80 6400 90 8100 7200
72 5184 95 9025 6840
69 4761 70 4900 4830
75 5625 70 4900 5250
381 29195 410 34150 31345
X1i represents temperature, and X2i represents humidity. The mean of the temperature is 76.2, while the mean of the humidity is 82. Next, we need to find the values of a few variables, and then insert them into a matrix.
F 1 ^2 = (number of observations * Sum of all (X1i) ^ 2 – (Sum of All (X1i)) ^ 2) / 20 = 40.7
F 2 ^2 = (number of observations * Sum of all (X2i) ^ 2 – (Sum of All (X2i)) ^ 2) / 20 = 132.5
F12 = (number of observations * Sum of all (X1i * X2i) – (Sum of All (X1i)) * (Sum of All (X2i)) / 20 = 25.75
Our matrix then looks like this:
F 1 ^2 F12
F12 F 2 ^ 2
We then find the determinant. D = 40.7(132.5) – (25.75)(25.75) = 4729.69
And this inverse of our matrix is:
.028 -.0059
-.0059 .0086
The observation that we want to look at is temperature = 76, and humidity = 72. We want to find the probability that it will be sunny out with those observations.
We need to find X – M = [-2 -10]
Now, we multiply this new matrix by the inverse of the original matrix:
[-.2 -10] *[ .028 -.0059]
-.0059 .0086
=[ .0534 -.0842]
We then multiply the answer by the original observation – mean
[.0534 -.0842] * [ -.2 10]’
= .83752
We are finally down to one number, which we can take and enter into the equation to find out the probability of whether it will be sunny or not with a temperature of 76 degrees and a humidity of 72 percent.
F(x) = ½*3.14 * 1/D * e ^ (- final number/
2) = .001522
F = f(x) * (5/14) = .000544 chance that it will be sunny.
Conclusion:
The topics discussed above are the main points that we are including in our software program to demonstrate data mining theory to future students.
All of the individual programs are linked together with a simple, easy to use switchboard. All the student has to do is click through and run the various programs to see the basic theory of data mining.
A sample run of the menu generated by the program is followed. The code is too long to be included here.
The entire code is available from the second author upon request.
References:
Ronald P. Cody, Jeffery K. Smith(1991): Applied Statistics and the SAS Programming Language, Printice-Hall.
Robert D. Mason, Douglas A. Lind(1990):Statistical Techniques in Business and Economics, Irwin.
SAS User Guides, Basic(1982): SAS Institute.