BIG DATA PROCESSING FOR DECISION MAKING

(1)

UDC 004.67

K.K. Nurlybayeva1, G.T. Balakayeva2

(1al-Farabi Kazakh National university, Almaty, Kazakhstan, [email protected] 2

al-Farabi Kazakh National university, Almaty, Kazakhstan)

BIG DATA PROCESSING FOR DECISION MAKING

Abstract. Nowadays there is a growing problem of mining large amounts of data. This article is dedicated to description of the methods and techniques which are focused to solve these problems. There are some Data Mining algorithms are described in the paper. This article examines the latest developments in data analysis, as well as the benefits of analyzing large volumes of data for businesses. The article also describes proposals for the optimization of data processing systems and integrating them into a single infrastructure for a more rapid and "smart" business decision making.

Key words. Big data, Data mining, regression, classification, association, OLAP.

Nowadays, there is a big problem concerning the increase of the data volume.

The concept of big data means that the volume of the data exceeds the volumes of the information systems. Additional methods and technologies of processing the data are needed when its volumes becomes more than terabyte or petabyte. It is clear that the algorithms which are suitable for small amounts of data are not appropriate for handling big data; they are not fast and efficient enough for it.

Lots of information collected in data warehouses of the world’s enterprises and companies [1]. The increasing of information volume continues each year and there are number of problems, which are still open to everyone:

• Storage of data requires certain financial costs for equipment, maintenance, backups, etc. • Data processing is becoming more complex and it spent more and more resources. Nevertheless, big data analysis can be very beneficial for most of the interested parties.

Big data is the source of interest for analysts who make decisions relying upon historical data [2]. They build necessary reports to be able to get the required information for analyzing and decision making. In addition, there are companies that are interested in obtaining benefits from the information stored in systems that process big data.

This article examines the latest developments in data analysis, as well as the benefits of analyzing large volumes of data for businesses. The article also describes proposals for the optimization of data processing systems and integrating them into a single infrastructure for a more rapid and "smart" business decision making.

The increase of the information amounts becomes the consequence of the improvement of the data recording and service technologies in the variety of the fields. The activity of almost each enterprise is accompanied by the registration of client’s information, i.e. medical, commercial, industrial, scientific organizations. The question is what for this information is needed. There is no interest in raw flow of information without appropriate processing and analysis. Many analytical tools are available nowadays, however several years ago there weren’t capability to handle such amounts of data or it was very expensive [3].

New and evolving analytical processing technologies now make possible what was not possible before. Examples include:

 New systems those are able to process a wide variety of unstructured data.  Improved analytical capabilities including predictive and text analytics.

 Operational business intelligence that improves business flexibility by enabling automated real-time actions and intraday decision making.

 Cloud computing services.

The system for processing big data should combine these technologies to enable new solutions that can bring significant benefits to the business [4]. In addition, to handle big data the system should represent a wide range of new analytical technologies and business possibilities.

Examples include technologies such as: • Design of predictive models

• Fraud detection • Risk Analysis

(2)

• On-line analytical processing, etc.

As world experience shows storage system and business process management should be reorganized according to the necessity of the company. For example in some cases there is no need in raw data. Therefore, the data which is saved in the database should be preprocessed and transformed. This particular measure will optimize the storage place and cost. Data mining technologies should be implemented in this case.

Mathematical statistics formerly has been used as a primary tool for data analysis. However, in connection with the problems associated with data processing, statistical methods were not sufficient for analysis. Statistical methods are useful mainly for checking hypotheses (verification-driven data mining) and "rough" exploratory analysis, which is the foundation of online analytical processing (online analytical processing, OLAP).

There are wide ranges of usage of the Data mining technology. It is used everywhere where data is present. The main areas where it is very important to use Data mining are marketing, credit scoring, fraud detection, any type of forecasting, etc [5]. There are five methods of Data mining that should be mentioned.  Association  Sequence  Classification  Clustering  Forecasting  Regression

Association takes place when several occasions are related with each other. Data mining technologies allow determine the patterns of associative rules, which then can be used for knowledge database formation in the decision making systems.

Sequence appears in case of the chain of the timely related occasions.

Classification has the aim to solve the problem of sorting the separate occasion to the class of the existing occasions, by determining its number.

Clustering is used in case of the finding the final number of clusters or classes which divide the set of occasions the particular non-intersecting subsets.

Forecasting is used in every field in order to define future benefits from new product.

Regression analysis is a statistical process for estimating the relationships among variables [6]. In probability theory and mathematical statistics, it is a dependence of the average value of a random variable from some other value or even several. In contrast to the purely functional dependence = ( ), where each value of the independent variable x is the unique value of the dependent variable y, regression dependence implies that each value of the variable x may correspond to different values of y, due to the random nature of dependence. If there are dependence such as to some value of corresponds a set of values { , , … , }, then the dependence is arithmetic from the and it is a statistical regression:

=( , , … , )

Regression study in probability theory based on the fact that the random variables X and Y, with joint probability distribution associated probabilistic dependence: for every fixed value X = x , the value of Y is a random variable with a certain (depending on the value of x) conditional probability distribution . Regression of Y on X is determined by the value of the conditional expectation Y, calculated under the condition that X = x : E(Y|x) = u(x). The equation y = u(x) is a regression equation.

Regression lines have the following remarkable property among all real functions f(x) a minimum expectation [ − ( )] is for a function f(x) = u(x). This means that the regression on Y by X provides the best in this sense on the representation of Y value X. This property allows the use regression for prediction value of Y by X. In other words, if the Y value is not directly observed and the experiment allows to record only X, then as predicted value Y can use the value of Y = u(X). The simplest case is when the regression dependence of Y on X is linear, for example ( | )= + , where and - regression coefficients. In practice, the regression coefficients in the equation y=u(x) are unknown, and they are measured from the observed data.

(3)

Figure 1. Regression line

Regression is widely used in analytical techniques to solve various business problems, such as forecasting (sales, exchange rates and equity), evaluation of various business indicators for the observed values of other indicators (scoring), identifying relationships between indicators, etc.

Differences of Data Mining from other methods of data analysis Traditional methods of data analysis (statistical methods) and OLAP (Online Analytical Processing Systems) is that it mainly focuses on verification of pre-formulated hypotheses (verification-driven data mining) and the "rough" exploratory analysis, which underpins the online analytical processing (OnLine Analytical Processing, OLAP), while one of the main provisions of the Data Mining - find non-obvious relationships. Data Mining tools may find these patterns on their own and also build their own hypotheses about relationships. Since it is the formulation of hypotheses about relationships is the most difficult task, Data Mining advantage over other methods of analysis are obvious.

Most statistical methods for identifying relationships in data using the concept of averaging over the sample, which leads to operations on non-existent values, whereas Data Mining operates the real values.

OLAP is more suitable for retrospective understanding of historical data, Data Mining based on historical data to answer questions about the future.

(4)

Figure 2. Main stages of data processing

Perspectives of Data Mining technology

Potential Data Mining provides a tremendous opportunity for expanding the frontiers of technology. Development of Data Mining concerns the following areas:

• selection of types of subject areas, which will facilitate the formalization of the decision of the relevant tasks Data Mining, relating to these areas;

• establishment of formal languages and logical means by which arguments will be formalized and automated tool that will solve problems Data Mining in specific subject areas;

• development of methods for Data Mining, able not only to extract patterns from data, but also to form some theories based on empirical data;

• addressing the significant backlog of opportunities of Data Mining tools from theoretical achievements in this field.

It is evident that the development of Data mining technology is the most directed to the areas related to business. In the short term Data Mining products can become as ordinary and necessary, as e-mail, and, for example, be used by users to find the lowest prices on certain goods or the cheapest tickets. In the long term future of Data Mining is really interesting - it can be to find intelligent agents as new treatments of various diseases, and a new understanding of the nature of the universe.

However, Data Mining contains a potential danger - in fact more and more information becomes available through a worldwide network, including information of a private nature, and more knowledge is possible to get out of it.

Areas where the use of technology Data Mining, is likely to be successful have these features: • require decisions based on knowledge;

• have a changing environment;

• are accessible, adequate and meaningful data; • provide high returns from the right decisions.

There are several points of view on Data Mining nowadays. Supporters of one of them consider it a mirage, distracting from the classical analysis. Supporters of the other direction - that those who accept the Data Mining as an alternative to the traditional approach to the analysis. There is also the middle, where we consider the possibility of sharing the latest achievements in the field of Data Mining and classical statistical analysis of data.

Data Mining technology is constantly evolving, is attracting increasing interest from both the scientific world, and from the applications of technology in business.

Integration of new technologies such as Data Mining and others into a single infrastructure will help to achieve more rapid and “smart” business decision making.

REFERENCES

1. Randal E. Bryant, Randy H. Katz, Berkeley Edward D. Lazowska. Big-Data Computing: Creating revolutionary breakthroughs in commerce, science, and society — 2008. — V.8

2. Bollier D. The Promise and Peril of Big Data —Washington: The Aspen Institute, 2010. — 23 p.

3. Heemink A. Mathematical Theory of Data Processing in Models (Data Assimilation Problems) — 2008. — Vol.1 – 202 p.

4. Moran W., La Scala B. Measurements in Mathematical Modeling and Data Processing — 2008. — Vol. 1 — 284 p.

5. K.K. Nurlybayeva, G.T. Balakayeva. Simulation of Large Data Processing for Smarter Decision Making. AWERProcedia Information Technology and Computer Science. 3rd World Conference on Information Technology WCIT-2012. 2012. Vol 03, 2013. 1253-1257 p.

6. K.K. Nurlybayeva, G.T. Balakayeva. Processing of large amounts of data on a credit scoring example using neural network technology. Safety and Security Engineering V. 2013. - 165-171p.

Нурлыбаева К.К., Балақаева Г.Т. Big making data processing for decision

Түйіндеме. Осы уақытта мәлiметтердiң көлемдерiн үлкеюi өндiрудiң өсетiн сұрағында болады Бұл мақала осы мәселелердi шешуге бағдарлаған әдістердің және қабылдаулардың сипаттамасына арнаулы. Кейбір зияткерлік талдаудың алгоритмдер мақалада суреттелiп айтылған. Осы мақалада үлкен көлемдердің талдаудың артықшылықтары осы кәсіпкерлік үшін талдауға облыста өте жаңа әзірлеулер қаралып жатыр. Мақалада сонымен бірге өңдеу жүйе ықшамдау бойынша ұсыныстар суреттеліп жатыр және «ақылды» кәсіпкерлік-шешімдері тезірек қабылдануы үшін біртұтас инфрақұрылымына олардың кірігуі. Маңызды сөздер: Үлкен деректер, деректерді өңдеу, регрессия, топтастыру, қауымдастық, OLAP.

(5)

Нурлыбаева К.К., Балакаева Г.Т. Большие объемы данных для принятия решений Резюме. В настоящее время существует растущая проблема обработки больших объемов данных. Эта статья посвящена описанию методов и приемов, которые ориентированы на решение этих проблем. Некоторые алгоритмы интеллектуального анализа данных описаны в статье. В данной статье рассматриваются новейшие разработки в области анализа данных, а также преимущества анализа больших объемов данных для бизнеса. В статье также описывается предложения по оптимизации системы обработки данных и их интеграции в единую инфраструктуру для более быстрого принятия «умных» бизнес-решений. Ключевые слова. Большие данных, обработка данных, регрессия, классификация, ассоциация, OLAP.