Analysis on Uncertain Data of Share Market Using Decision Tree and Pruning Algorithm

(1)

Sunit S. Dongare

, IJRIT 560 International Journal of Research in Information Technology

(IJRIT)

www.ijrit.com ISSN 2001-5569

Analysis on Uncertain Data of Share Market Using Decision Tree and Pruning Algorithm

Sunit S. Dongare¹, Kiran R. Khandarkar², Sonali S. Wagh³

1PG Student, SYCET Aurangabad Computer Sci. Engg, Dr.BAMU Aurangabad, Maharashtra, India

[email protected]

2Asst. Professor, SYCET Aurangabad Computer Sci. Engg, Dr.BAMU Aurangabad, Maharashtra, India

3Designation, University Department Name, University Name/ Company Name Aurangabad, Maharashtra, India

Abstract

In this paper we are focusing on data with uncertain information, instead of precise or known values.

Uncertain value is found in many application while data collection. Data uncertainty arises naturally in many applications due to various reasons. We briefly discuss three categories here: measurement errors, data staleness, and repeated measurements. In uncertainty, the value of a data item is often represented by multiple values i.e values in range which forming a probability distribution. Rather than abstracting uncertain data by statistical derivatives (such as mean and median), we discover that the accuracy of a decision tree classifier can be much improved if the “complete information” of a data item (taking into account the entropy) is utilized. Since processing pdf’s is computationally more costly than processing single values (e.g., averages), we use entropy calculation and decision tree construction on uncertain data is more CPU demanding than that for certain data. So further, we are implementing Pre-pruning techniques that can greatly improve efficiency in terms of time and with accuracy.

Keywords:

Decision tree, Pre-pruning, Classification.

1. Introduction

Classification is a classical problem in machine learning and data mining [5]. Given a set of training data tuples, each having a class label and being represented by a feature vector, the task is to algorithmically build a model that predicts the class label of an unseen test tuple based on the tuple’s feature vector. One of the most popular classification models is the decision tree model. Decision trees are popular because they are practical and easy to understand. Rules can also be extracted from decision trees easily. Many algorithms, such as ID3[2] and C4.5[3] have been devised for decision tree construction. These algorithms are widely adopted and used in a wide range of applications such as image recognition, medical diagnosis,

(2)

, IJRIT 561

and credit rating of loan applicants, scientific tests, fraud detection, and target marketing. In traditional decision-tree classification, a feature (an attribute) of a tuple is either categorical or numerical. For the latter, a precise and definite point value is usually assumed. In many applications, however, data uncertainty is common. The value of a feature/attribute is thus best captured not by a single point value, but by a range of values giving rise to a probability distribution. A simple way to handle data uncertainty is to abstract probability distributions by summary statistics such as means and variances. We call this approach averaging. Another approach is to consider the complete information carried by the probability distributions to build a decision tree.

Before we delve into the details of our data model and algorithms, let us discuss the sources of data uncertainty and give some examples. Data uncertainty arises naturally in many applications due to various reasons. We briefly discuss three categories here measurement errors, data staleness, and repeated measurements.

2. Related Work

Decision trees are popular because they are practical and easy to understand. Rules can also be extracted from decision trees easily. These algorithms are widely adopted and used in a wide range of applications such as image recognition, medical diagnosis [1], and credit rating of loan applicants, scientific tests, fraud detection, and target marketing. In traditional decision-tree classification, a feature (an attribute) of a tuple is either categorical or numerical. The most challenging task is to construct a decision tree based on tuples with uncertain values. It involves finding a good testing attribute Ajn and a good split point zn for each internal node n, as well as an appropriate probability distribution Pm over C for each leaf node m. We describe algorithms for constructing such trees.

Self-learning is a difficult problem in machine learning. However, it would promote the development and application of machine learning theory greatly if we could get rid of the reliance on prior knowledge in machine learning process and let the training data control the process independently. Professor G. Y. Wang [2] developed a self-learning method by using the minimal local certainty of a decision table as the threshold to control the process of Skowron’s propositional default rule generation algorithm.

3. Proposed System

In existing project the system designed which already gives accuracy of decision data. But it is time consuming as compare to previous system e.g computing mean average. In our proposed system we are improving the time efficiency by using tree pruning technique. Also we take benefit of faster feature of the pre-pruning technique. The proposed system is work on the analysis of share market data which hold uncertain value. We are applying pruning technique on tree and take decision according to more convenient and accurate result with efficiency in term of time. Finally our output will show single value instead of range value. Then we can do further calculation.

(3)

, IJRIT 562

The proposed system use Decision tree pre-pruning self-learning algorithm (DPSA) for cutting unwanted node from tree. Top-down induction of decision trees is arguably the most popular learning regime in classification because it is fast and produces comprehensible output.

However, the accuracy and size of a decision tree depends strongly on the pruning strategy employed by the induction algorithm that is used to form it. Pruning algorithms discard branches of a tree that do not improve accuracy. To achieve this they implement one of two general paradigms: pre-pruning or post pruning. Pre-pruning algorithms do not literally perform “pruning” because they never prune existing branches of a decision tree: they “prune” in advance by suppressing the growth of a branch if additional structure is not expected to increase accuracy. Post-pruning methods, discussed in the previous chapter, have their equivalent in horticulture: they take a fully grown tree and cut off all the superfluous branches—

branches that do not improve predictive performance.

As we saw share market update regularly that means the value of shares are changes with respect to the time and continuously. We are design the system which used to help for making decision about investment or divestment. We are taking data from NSE,BSE and NIFTY which helps us to take reading of stock market. We focusing only the up and down values of share also the reason of up-down of share as well.

3.1 Share Market

Market share analysis is a part of market analysis and indicates how well a firm is doing in the marketplace compared to its competitors. Many researched have spreadsheet and word processing software firms to give a clearer image of how to determine market share in the software industry. They propose six factors to help estimate the value of market share (1997):

• unit or dollar sales,

Compute Single

Value

Input With Uncertain Value Attribute

Applying Entropy on

Range value

Initial Tree Forming

Using Single Value

Final Tree Using Pruning

Algorithm

Output without Uncertain

Value

Give Decision

Fig. 3.1: Methodology of Decision tree for Uncertain Data

(4)

, IJRIT 563

0 20 40 60 80 100 120

testing process training process

DPSA UDT

• user base (since piracy and brand switching effect),

• market definition (scope of definitions),

• scope of denominator (which other brands included),

• time frame length,

• product definition (brand, product line, or strategic business unit).

4. Experiment

We have data set of share market which contains uncertainty. In our proposed system dataset contain repeated of value and we are working on repeated value. It is natural to hypothesis that the closer we can model the proposed system. Our proposed system implementing pre-pruning algorithm and it minimizes the tree growth at starting i.e from root node of tree. Decision tree Pre-pruning Self learning Algorithm (DPSA) minimizes the mathematical calculation by removing the nodes which are not necessary in further calculation. So, definitely minimum node needs minimum time for any further calculation. So definitely efficiency is improved and minimum time is required. Our hypothesis is model with proposed system then it’s efficiency grows 10% than existing system UDT.

Result from Existing system:

The efficiency percentage is

10 2

100 80.88%

10  − 

× =

 

 

Overall efficiency percentage of UDT method is

27 2

100 92.59%

27  − 

× =

 

 

Result from proposed system:

The efficiency of proposed system is

10 1

100 90%

10  − 

× =

 

 

Overall efficiency percentage of DPSA method is

27 1

100 96.59%

27  − 

× =

 

 

Fig. 4.1: Result comparing between DPSA and UDT

(5)

, IJRIT 564 4. Conclusions

From the above charts it is observe that different performance accuracy is provided by the UDT and DPSA for the different Signals, and it is observe that DPSA provides greater performance accuracy than the UDT that is 96.59%. If we compared proposed system with existence system it will observe that it provide more accuracy. Hence the present system can be used for the further process.

References

[1] Tsang, Ben Kao, Kevin Y. Yip, Wai- Shing Ho, Sau Dan Lee, “Decision Trees for Uncertain Data” IEEE Transactions on Knowledge and Data Engineering, Vol.23, No. 1, pp 1-14, Jan 2011.

[2] De-sheng Yin, Guo-yin Wang, Wu, “A Self-Learning Algorithm for Decision Tree Pre- Pruning”

Third International Conference on Machine Learning and Cyber metrics, Shanghai, pp 2040-2043, 26-29 August 2004.

Signals Training

Tuples

Accuracy (%)

Japanese Vowel 270 87.30

Pen-Digit 7494 96.11

Page Block 5473 96.82

Satellite 4435 87.73

Segment 2310 92.91

Vehicle 846 75.09

Breast Cancer 569 95.93

Ionosphere 351 91.69

Glass 214 72.75

Iris 150 96.13

Signals Training

Signals

Accuracy (%)

Share market 100 96.59

Table 4.1Efficiency percentage of UDT method.

Table 4.2: Efficiency Percentage of DSPA Method

(6)

, IJRIT 565

[3] Hand Book of Decision Tree and Pre-Pruning.

[4] R. Agrawal, T. Imielinski, and A. N. Swami, “Database mining: A performance perspective” IEEE Trans. Knowl. Data Eng., vol. 5, no. 6, pp. 914–925, 1993.

[5] Krishna Mohan, SurekhaAlokam , MHM Krishna Prasad, “An Efficient Decision Tree For Uncertain Data” International Journal of Engineering Research and Applications (IJERA) Vol. 2, Issue 3,pp 1401-1405, May-Jun 2012.

[6] NileshDalvi, Dan Suciu, “Efficient Query Evaluation on Probabilistic Databases”, proceeding of 30^th VLDB conference Toronto Canada, pp.1-5, 2004.

[7] W. Nor Haizan, W. Mohamed, MohdNajibMohdSalleh, Abdul HalimOmar,“A Comparative Study of Reduced Error Pruning Method in Decision Tree Algorithms” IEEE International Conference on Control System, Computing and Engineering, pp 392-396, 23 - 25 Nov. 2012, Penang, Malaysia.

[8] Jie Chen, Xizhao Wang, JunhaiZhai, “Pruning Decision Tree Using Genetic Algorithms”, International Conference on Artificial Intelligence and Computational Intelligence, pp 244-247, 2009.

[9] S. D. Lee, B. Kao, and R. Cheng, “Reducing UK-means to K-means”, in The 1st Workshop on Data Mining of Uncertain Data (DUNE), in conjunction with the 7th IEEE International Conference on Data Mining (ICDM), Omaha, NE, USA, pp 1-5, 28 Oct. 2007.