1
Statistical Analysis of Big Data Sets
Seemant Ujjain
Statistics and Informatics
Department of Mathematics
Indian Institute of Technology (IIT), Kharagpur
Seemant.ujjain@gmail.com
Project guide: Dr. Jitendra Kumar
Assistant Professor
Institute of Development and Research in Banking Technology (IDRBT)
Road No. 1, Castle Hills, Masab Tank, Hyderabad – 500 057
http://www.idrbt.ac.in/
2
CONTENTS
Certificate
Declaration
Abstract
1.
Introduction . . . . . . . . . .6
2.
Statistical analysis and big data . . . 7
3.
Methodology. . . . . . . 7
4. Numerical Illustration . . . . . 10
5. Empirical Example . . . . 13
3
CERTIFICATE
This is to certify that project report titled Statistical Analysis of Big Data
submitted by Seemant Ujjain of Integrated M.Sc. 5
thyear, Dept. of Mathematics
IIT Kharagpur, is record of a bonafide work carried out by him under my
guidance during the period 8
thMay 2012 to 6
thJuly 2012 at Institute of
Development and Research in Banking Technology, Hyderabad.
The project work is a research study, which has been successfully completed as
per the set objectives.
Dr. Jitendra Kumar
Assistant Professor
IDRBT, Hyderabad
4
DECLARATION
I declare that the summer internship project report titled
Statistical Analysis of
Big Data
is my own work conducted under the supervision of Dr. Jitendra
Kumar at the Institute of Development and Research in Banking Technology,
Hyaderabad. I have put in 60 days of my attendance with my supervisor at
IDRBT and have been awarded project fellowship. I further declare that to the
best of my knowledge, the report does not contain any part of any work which
has been submitted for the award of any degree either in this institute or any
other institute without proper citation.
Seemant Ujjain
Int. M.Sc. 5
thyear
Dept. of Mathematics
IIT Kharagpur
5
ABSTRACT
The big data has been generated by multiple known/unknown sources which cannot be normally stored by normal storage tools as well as it is continuously increasing. This generating features of data create hurdles for statisticians or data scientists in analysing because it requires fix set of data and in similar fashion the computing tools are handling with huge volume. Present project deals the statistical analysis of data which is continuously increasing with some velocity. We managed the velocity using the concept of realization of time series, where the accelerating nature of data is controlled by realizing the parametric values and fitted the suitable model. The modelling of realized parameters gives us a better estimate as well as takes less time in comparison to voluminous data. The simulation study is also carried out and applied the same for analysis the daily ATM withdrawal.
6
INTRODUCTION
Big data often represent multiple, non random samples of unknown population shifting in composition within the short term. Big data are an outgrowth of today’s digital environment, which generates data flowing continuously from all directions at unprecedented speed and volume, and which almost always requires cleansing. Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time (http://en.wikipedia.org/wiki/Big_data). Difficulties include capture,storage, search, sharing, analysis, and visualization. Big data is a coined work by the IT professionals and markets researchers but they are arguing with a huge analytic power and storage power like terabyte or petabyte solution. This study targets to do the analysis with accelerated data sets. The vast amount of information stored by most of the MNC’s or government agencies always falls into the category of Big data. Any technique which can provide assistance in extracting valuable information from big chunk of data can become a valuable tool.
7
STATISTICAL ANALYSIS AND BIG DATA
My project is an attempt to handle such condition if data is of numeric kind. The first assumption taken is that whole data can’t be stored at a time, so different data sets has been considered/generated where storage capacity of system to handle the data is assumed to be equivalent to the size of each such data sets. Many such data sets together is been considered as the whole “Big data”. Second assumption is regarding nature of data. Data is numeric and random, also it’s distribution is considered to be unknown. In this particular project we have also tried to measure basic statistical properties of “Big data”.
The approach has been justified first under controlled parameter’s values simulated under multiple conditions and extended for a real data set. The daily withdrawal from several ATMs of a bank has been taken as sample data. The concept of simulated data is used using the different ATM data one by one and compiled it as the analysis of whole and partitioned data and the basic measures of statistics like mean, median, variance and mode are obtained in reference of Big Data.
METHODOLOGY:
Present study aims to analyse the big data which continuously aiding the information itself. For this we use partition of data into different groups that may be in respect to time, size etc. Here, the partitioned data is realization of a huge source of information in respect to some specific bases. We have justified our approach first by simulation and then applied it to study the ATM withdrawal of a bank.
8
Case 1 :Simulated Data
In simulation study we generated sets of observation, size of data is 1,00,000 observations. It is partitioned in 100 groups, where the parameters in different groups are following autoregressive model with zero mean. 100 data sets were generated in an accelerated manner. Each data set is consisting of 1000 observations. It is assumed that capacity of computing tool enables us to analyze up to 1000 observations only. A suitable statistical model, here linear regression is fitted over first data set. The parameters of the models are stored and next data set is analyzed and so on. After 100 such data sets, a suitable stochastic model is fitted over the parameters. The stochastic model fitted so entails us for future forecasting of parameters, which will help in speculation of future trend of data. The basic statistics like mean, median, variance, mode etc is also calculated for the whole data set flowchart is give below
9
Case 2 : ATM Data
We applied the technique to some real data of some random ATM. The whole data is considered as “Big data”. As the whole data can’t be realized at one time, we divides the data into smaller data sets each of size 1000. A suitable statistical model is applied to each such data sets and parameters are stored. Again a suitable stochastic model is fitted over the stored parameters. This stochastic model helps us in forecasting as well as analysis of data within storage capacity of the computing tool. The basic statistics like mean, median, variance, mode etc is also calculated for the whole data set. Flowchart is given below:
10
NUMERICAL ILLUSTRATION
Simulation: The data is generated under controlled parametric values. Initial values of alpha and beta is provided by the user. These alpha and beta values are used to generate first data set, alpha acting as the intercept and beta as slope. The data set so obtained is fitted with a suitable regression model and parameters are stored. Next pair of alpha and beta are generated by using the equation below. Similarly every pair of alphas and betas are
generated, the course of alpha and beta are shown in the diagram below, and thus every data set is obtained by using the corresponding pair of alpha and beta as parametric values. The data set so obtained is modelled with a suitable regression model and parameters are stored. Over these parameters we have fitted stochastic model as well as regression model. The parameters of regression model is shown as betahat1 and betahat2. The stochastic model is also shown. We have also successfully computed few basic statistical measures of whole data, e.g. mean, variance, median, mode etc. Computation of median and mode is done by partitioning the data into classes and measuring their class frequencies. As we can see the lesser the bin size more is the precision in measuring median and mode. But it should be also noted that lesser bin size increases computation.
Algorithm:
Generation of 100 sets( i refers to set no.). Alpha(1,i)=k1*alpha(1,i-1)+norm1(μ1,σ1) Beta(1,i)=K2*beta(1,i-1)+ norm2(μ2,σ2)
Within every ith set 1000 nos. are generated. Function used is Y1(i,j) = norm3(μ3,σ3)
Y2(i,j) = alpha(1,i)+beta(1,i)*Y1(i,j)+norm4(μ4,σ4) Constants used
alpha(1,1)=1.32; norm1(μ1,σ1)= normal(0,2.5) beta(1,1)=.213; norm2(μ2,σ2)= normal(0,1.5) k_1=.912; norm3(μ3,σ3)= normal(10,2.5) k_2=.9631; norm4(μ4,σ4)= normal(0,2.5)
11
For bin size of 10 the course of alpha and beta are shown
0 20 40 60 80 100 120 0 20 40 60 80 100 120 140 alpha 0 20 40 60 80 100 120 0 50 100 150 200 250 300 beta
12
The parametric values of models are
• Betahat1 = [67.2147 0.6763]
stats1 = [ 0.5350 113.9188 344.7294] (In order, the R2 statistic, the F statistic and an estimate of the error variance.)
• Betahat2 =[ 76.0739 ,2.3588]
stats2 = [ 0.8599 607.7700 785.9460]
• Stochastic Model: A(q)y(t) = C(q)e(t)
• On alpha’s it is A(q) = 1 - 1.004 q^-1 C(q) = 1 + 0.3029 q^-1 • On beta’s it is A(q) = 1 - 1.008 q^-1 C(q) = 1 + 0.5046 q^-1
With the help of stochastic models on alpha’s and beta’s the whole data can be analyzed. The basic statistics for simulated data is shown below
Real_median Median (estimated) Mean Mean (estimated) Variance (estimated) Mode Mode frequency Bin size 2.1545e+003 2.2274e+003 2.1614e+003 2.1583e+003 2.2269e+003 2.1630e+003 2.0887e+003 2.1602e+003 2.0963e+003 2.0887e+003 2.1602e+003 2.0963e+003 8.2857e+005 7.3996e+005 7.4514e+005 2250 2579 2315 49.2000 72 54.4000 10 1 5
13
EMPIRICAL EXAMPLE
Real Data:
Method 1 : 30000 data has been divided randomly into 30 sets, each of size 1000. A suitable regression model is fitted over each such sets. Parameters (alpha’s and beta’s) thus so
obtained are stored. The course of parameters are shown below. A suitable stochastic model is Fitted over alphas and betas. The model is shown below. The basic statistics for the whole data are also obtained following the same procedure as discussed in simulated data section.
The course of alpha and beta are shown above
0 5 10 15 20 25 30 35 -100 -80 -60 -40 -20 0 20 40 60 80 100 alpha 0 5 10 15 20 25 30 35 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 beta
14
The parametric values of models are
• Stochastic Model: A(q)y(t) = C(q)e(t)
• On alpha’s A(q) = 1 + 0.9952 q^-1 , C(q) = 1 + q^-1 • On beta’s A(q) = 1 - 0.9647 q^-1 , C(q) = 1 - 0.3449 q^-1
The basic statistics for real data (method 1) is shown below
Real_median Median (estimated) Mean Mean (estimated) Variance (estimated) Mode Mode frequency Bin size 197.9500 197.9500 205.6430 198.6753 201.7952 248.5890 248.5890 248.5890 248.5890 5.7465e+004 5.7465e+004 140 119 125 90.2000 110 94 10 1 5
15
Method 2 : Data has been divided ATM wise, a total of 19 sets of different sizes. Same technique as discussed above is applied. The only difference is that size of each data sets are different.
The course of alpha and beta are shown above
0 2 4 6 8 10 12 14 16 18 20 -14 -12 -10 -8 -6 -4 -2 0 2 4x 10 4 alpha 0 2 4 6 8 10 12 14 16 18 20 0 1000 2000 3000 4000 5000 6000 beta
16
The parametric values of models are
• Stochastic model : A(q)y(t) = C(q)e(t)
• On alpha’s
A(q) = 1 - 1.062 q^-1
C(q) = 1 - q^-1 Loss function 1.119e+009 and FPE 1.3823e+009
• On beta’s
A(q) = 1 - 0.986 q^-1 C(q) = 1 - 0.8946 q^-1
Loss function 1.57289e+006 and FPE 1.92242e+006
The basic statistics for real data (method 2) is shown below
Real_median Median (estimated) Mean Mean (estimated) Variance (estimated) Mode Mode frequency Bin size 197600 197600 2.0351e+00 5 2.0008e+00 5 2.0722e+00 5 2.4479e+005 2.4479e+005 2.4480e+005 2.4480e+005 5.6053e+010 5.6053e+010 1 1 1 1036 933 1120 5000 1000 10^4
17
CONCLUSION
As the size of the big is countably infinite and this nature can be realize by the analysis of data till we have sufficient information about the parameters. The parameters are realized by the proper modelling of the parameters. We have taken a sample program in this study, it can be extended for the study of the larger data size as managed by the available computing tools. Main advantage of this study is to obtain the parameters of the big data with certain level of confidence which can be analyzed by simple computing machines. The technique is
successful in modelling the real data and computation of its few important basic statistic measures. Simulation part is not fully completed and study is still in progress.