Sequential Sequence Mining Technique in Mammographic Information Analysis Database

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 5, May 2012)

375

Sequential Sequence Mining Technique in

Mammographic Information Analysis Database

Kiran Amin

1

, J. S. Shah

2

1

Head, Department of Computer Engineering & Information Technology,,

U.. V. Patel College of Engineering, Kherva, Gujarat. E-mail:[email protected]

2

Head, Department of Computer Engg.,

L.D. College of Engineering, Ahmedabad, E-mail:[email protected]

Abstract—The Sequential Sequence Mining produces large

sequences of biomedical data. It provides the opportunities for data analysis and knowledge discovery. The Sequential Sequence Mining provides the efficient and scalable methods to extract the sequences of interest in datasets. The synthetic dataset was taken of the medical images of mammography. The Sequential Sequence Mining technique motivates us to discover sequences of interest in the existing dataset. Many algorithms have been developed to associate the huge amount of Mammographic data sets. This paper focuses on Sequential Sequence Mining algorithm to analyze mammographic data sets.

Keywords— Sequential Sequence Mining, Association Rule

Mining, Biomedical data, Memographic Image Analysis

I. INTRODUCTION

With the mammography, the detection and classification of breast abnormalities are found. It may be benign or malignant. The synthetic dataset contains 150 malignant, 60 benign, and 70 normal cases. The mammographic images, in the form of clinical and pathology reports are collected. The masses of mammograms have been extracted. The shape factors[2] representing compactness, fractional concavity, and spiculation index were found. The association-rule mining method applied to the data set. Sequential Sequence Technique is used to develop and analyze the memographic sequences. The Sequential Sequence Mining technology enables to identify the long sequence which is responsible for the causes of anomalies occurring in the functioning of the breast.

II. ASSOCIATION RULE MINING

Association rule mining gives the interesting

relationship among a large items[7]. Association-rule mining

is a data-mining task that is used to discover relationships

among items in a transactional database [9].

If the rules satisfy the minimum support and minimum confidence threshold[6]. The support and confidence is discussed here.

2.1. Support: The support of an association rule is the ratio

(in percent) of the records that contain {X, Y} to the total

number of records in the database: support (XY) =

Prob{XUY} with respect to total numbers of records.

2.2. Confidence: The confidence is the ratio of the number

of records that contain {X, Y} to the number of records that

contain X: confidence (X  Y) = Prob { Y| X} = ( support

(X U Y) ) /( support (X).

In brief, an association rule is an expression X, Y, where X and Y are sets of items. The meaning of such rules is quite intuitive. Given a database D of transactions where

each transaction T  D is a set of items, X  Y expresses

that whenever a transaction T contains X, then T probably contains Y also; The following key parameters are used to evaluate the generated association rules: support and confidence.

2.3. Strong Association Rules: For every frequent itemset

A, if BA, B ≠ 0, and support (A)/support (B) ≥ minconf,

then we have association rule B  (A-B). Rules that satisfy

both a minimum support threshold (min-sup) and a minimum confidence threshold (min-con) are called strong rules. Strong rules are the key elements obtained from an analysis of all possible rules.

Association rule mining gives the interesting

(2)

International Journal of Emerging Technology and Advanced Engineering

376 Second it finds the strong association rules from the frequent item sets: Means these rules should satisfy minimum support and minimum confidence[7]. Many scans are required to find frequent sequence by using association rule mining. We may apply Apriori[6], FP-growth[9], Hashing have been found. They apply on gene sequence. This association rule mining algorithm finds the useful patterns from the biological sequence. Using this approach, one gene sequence is used to the induction of a serial of target gene sequences.

Among the best-known algorithms for association rule induction is the Apriori algorithm [9]. In this paper, we used the Apriori algorithm in order to discover association rules among the shape features extracted from the mammographic mass regions.

2.4. Apriory Algorithm: Apriori Algorithm[7] uses prior knowledge of frequent itemset properties such as such kitemsets are used to find k+1 sequence. It uses join step and prune step method to find the frequent sequence. First it finds L1 itemsets which is single itemsets. The L2 itemsets are found by using L1 itemsets. Again L3 item sets are found by using L2 itemsets. Finally the frequent sequence is found by using Apriori Principle.

Apriori algorithm[6]:

(1) Cl = {candidate 1-itemsets};

(2) L1 = { c

∈

C1 lc.count ≥minsup}};

(3)for (k=2; Lk- 1 ≠ 0; k++) do begin

(4) Ck=apriori-gen(Lk -1);

(5) for all transactions t

∈

D do begin

(6)Ct=subset(Ck, t);

(7) for all candidates c

∈

Ct do

(8) c.count++; (9) end

(10) Lk={c

∈

Cklc.count≥minsup}

(11) end

(12)Answer = U Lk;

III. FREQUENT PATTERN GROWTH TREE

Frequent Pattern growth Tree mines the frequent item

sets uses divide-and-conquer method. It performs on the tree and finds the frequent sets. The nodes are kept as in such a way that more frequently occurring nodes shares the frequently occurring item sets compare to less frequently occurring.

It finds first frequent-1 item sets then frequent-2 itemsets like wise we may generate frequent–n item sets. It finds conditional frequent tree based on the subset of the database. This algorithm is divided in two phases. First phase uses FP-tree. The FP-tree growth method converts the problem of finding long frequent patterns searches for shorter ones recursively and then merging suffix.

This method uses the least frequent items as a suffix. It reduces the search costs. When the database is large, it is sometimes unrealistic to construct a main memory-base FP-tree. The performance study on FP-tree shows that it is efficient and scalable for mining both long and short frequent patterns.

IV. PARTITION

Partition algorithm is based on the frequent sets are normally in very less number in the set of itemsets. As a result the database is divided into various partitions such that each partition can be adjusted in main memory. This algorithm reduces the number of scanning of database. It brings the partition in memory while scanning and counts the number of items in that partition. The algorithm is implemented in two phases First phase logically divides the database into number of non-overlapping partitions. The partitions are considered one at a time and all frequent itemsets for that partition are generated. Hence therefore if there are n partitions, n iteration are taken by Phase I of the algorithm. At the end of this phase, these frequent itemsets are merged together with a plan to generate the set of all potential frequent itemsets. In this step, the local frequent itemsets of same lengths from all n partitions are combined to generate the candidate rules should satisfy minimum support and minimum confidence[8]. Many scans are required to find frequent sequence by using association rule mining. We may apply Apriori[5], FP-growth[9], Hashing have been found. They apply on gene sequence. This association rule mining algorithm finds the useful patterns from the biological sequence.

Association rules mining algorithms are used to find gene expression in data. The several gene expressions are scanned. These several association rule mining algorithms used to find related sequence. One of them is Apriori. We will shortly discuss the algorithm provided by Apriori[5].

V. FEATURE EXTRACTION

Three shape measures of compactness (C), fractional

(3)

International Journal of Emerging Technology and Advanced Engineering

377

1. The normalized form of compactness, C, is a simple

measure of shape complexity, and is computed as

C = 1 - 4A

P2

where A is the area and P is the perimeter of the contour. 2.

Fractional concavity Fcc is the ratio of the cumulative

length of the concave portions of the contour to the total length of the contour[3].

3. Spiculation index SIrepresents the degree of picularity

of a mass contour.

VI. QUANTITATIVE ASSOCIATION-RULE MINING

6.1. Data preparation

Here the quantitative, real values are used in Association Rule and Sequential Sequence Mining Technique. Quantitative association rules are mapped into a Boolean association-rule problem. Each shape features values are extracted numerically in the range [0.0, 1.0].

Each feature is split into 10 ranges as [0.0, 0.1], [0.1, 0.2],[0.2, 0.3]..., [0.9, 1.0], and intervals are defined for

each feature Fi as Rangemin  Fi <Rangemax. For

example, the feature SI = 0.82 belongs to the new feature interval [Rangemin = 0.8, Rangemax = 0.9]. These new feature intervals are organized in the form of transactions. This would be used as a input for the data-mining and classification algorithms.

The transactions are of the form {Class Label, F1, F2,. ..., Fn}, where F1, F2, . . ., Fn are the features extracted for a given mammographic mass image, and Class Label of mass is referred as benign or malignant. Here the feature interval samples are given below for the transactions:

{ Benign, C0.1-0.2, F0.0-0.1,S0.0-0.1 } { Benign, C0.1-0.2, F0.0-0.1,S0.0-0.1 } { Benign, C0.0-0.1, F0.1-0.2,S0.2-0.3 } ……

……

{Malignant, C0.3-0.4, F0.2-0.3, S0.l-0.2} ……

……

Here, F = Fcc, and S = SI.

…… ……

6.2. Pruning methods

After the association-rule mining algorithm is applied, many rules are generated. However, we might not be interested in all the rules generated. Pruning methods need to be employed to identify the rules that represent knowledge in a useful manner.

1. The major constraint could be that only those rules that could be used further for classification are to be considered.

Given the transaction model X Y, we are interested in

rules of the form Y is Ci (Classification Type: Benign or Malignant). Such rules are called “interesting rules”.

2. The probabilities and joint probabilities of items and

combinations of interest are evaluated to provide thresholds based on their support or confidence values. In this paper, the term “strong rules” is applied to interesting rules that

have support  5% and confidence 80% confidence.

Single-feature rules for fractional concavity indicate that if the value is less than 0.4, the mass is most likely benign, whereas if the value is greater than 0.4, the mass is malignant with 100% confidence. For the spiculation index,

if the value is less than 0.2, the mass is most likely benign,

and if the value is greater than 0.2, the mass is malignant with 100% confidence.

VII. USING SEQUENTIAL SEQUENCE MINING

The Large Gene sequence may be found using Sequential Sequence Mining techniques. This technique uses vertical fragment representation of the database with efficient support counting. It finds long sequences using generating sequence-A and Sequence-B sequences. The various sequences are generated by item adding the item at the end of the sequences.

When the item is added at the end of the sequence, it becomes a Sequence-A, while the item is added at the end of last item set, such that item whose index is greater than the last itemset.

In this method the customers’ transactions are shown by various fragments. The corresponding bit is set to 1 if the transaction it contains the last itemset in the sequence and previous transactions contain all previous itemsets in the sequence (i.e. the customer contains the sequence of

itemsets) from a parent to as Sequence-A and Sequence-B.

(4)

International Journal of Emerging Technology and Advanced Engineering

378

7.1. The Absolute Support: The absolute support of a

sequence Kp in the sequence representation of a database D

is defined as the number of sequences k  D that contain

Kp , and the relative support is defined as the percentage of

sequences k  D

7.2. Generation of Sequences: First the Customers’ data are

sorted by Customer ID and then Transaction ID. First various Sequence-A and Sequence-B sequences are generated. Various nodes are tested using Sequence-A and Sequence-B sequences. At each node n, the support of each Sequence-A and each Sequence-B is tested. If the support of a generated sequence s is greater than or equal to minSup then this sequence will be useful and stored as a frequent sequence. If the support of s is less than minSup, then we do not need to repeat the process on s by the Apriori principle since any child sequence generated from s will not be frequent.

If we create the sequence by considering a tree then each sequence in tree generates Sequence-A sequences and Sequence-B sequences. Thus we can associate with each sequence n in the tree two sets: The set of items that are considered for a Sequence-A extensions of sequence n

(k-extensions) = Kn . and The set of items that are considered

for a Sequence-B extensions = I n .

Suppose the item elements are considered as nodes of the tree then each element in the tree is generated only by either Sequence-A. The sequence ( { p , q }, { q } ) is generated from sequence ( { p , q }) with Sequence { q }. It can not generate from the sequence ( { p } , { q } ) or ( { q } , { q } ).

7.3. Pruning: We can prune candidate k-extensions and i-extensions [10] of a node n in the tree. We use the pruning techniques based on Apriori principle and aimed at

minimizing the size of Kn and In at each node n.

7.4. Fragment Representation: We have used various fragments to represent the data. In the fragment map, each item in the transaction is set to 1 if it appears otherwise the corresponding value is set to 0. Suppose the item i is appearing in transaction x then the value of that bit is to be set to 1 for the transaction otherwise it is set to 0. Suppose the item i and item j are appearing in one transaction, for finding { i , j } we need to do bitwise AND operation among the transformed fragment of {i} and itemset {j}.

7.5. Algorithm:

Input :D, a database of transactions

Output :Frequent Sequences of database

Method :

1. Collect the information for customer's transaction from

an input data file and stores into array.

2. Initialize Number of Customers, Transactions, and

Items.

3. Store the Customer, Transactions and Items into array.

4. Increment customer count for new customer.

Increment customer’s transaction count for same customer for different transaction. Increment item count for the same customer and different item.

5. Read the information about CID, TID, IID from the

array and put 1 to appropriate transaction in bit.

6. Read the data and fill in transaction bits.

7. Read the input file and finds the frequent-1 itemsets.

8. Find the max number of transactions and number of

customers & set minimum support.

9. Finds the frequent itemset.

10. Do sequence-A and Sequence-B process on the current

node.

11. Find sequence-A sequence for the next node

12. Create Sequence-B for the next node whose index is

higher than current node. Check for the frequent item. If it is frequent then store it.

13. Output index wise Sequence-A whose support is

greater than min support threshold.

(5)

International Journal of Emerging Technology and Advanced Engineering

379

VIII. PERFORMANCE

The Figure 1 and Figure 2 shows the analysis of Sequential Sequence Mining. Figure 1 shows the Analysis of various customers with different support values. Figure 2 shows the mining time taken with different support value. Here we made comparisons with Association Rule Mining and Here the results are found better for Gene Sequences.

IX. CONCLUSION

In this paper, frequent sequences in Mammographic Sequences are found and used as transactions by using sequence-A and sequence-B sequences. The algorithm finds the sequential sequence and performs better. It generates the sequences with various lengths with efficient support count.

REFERENCES

[1] Alberta Cancer Board, “Screen Test: Alberta Program for the Early Detection of Brest Cancer,” 1999/01 Biennial Report”, Edmonton, Alberta, Canada,2001.

[2] H. Alto, R.M. Rangayyan, R.B. Paranjape, J.E.L. Desautels, and H. Bryant, “An indexed atlas of digital mammograms for computer-aided diagnosis of breast cancer, vol. 58, no.5,6, pp.820-835,2003.

[3] R.M.Rangayyan, N.R.Mudigonda, and J.E.L. Desautels, Boundary modelling and shape analysis methods for classification of mammographic masses,’’ Medical and Biological Engineering and Computing, vol.38, pp. 487-495,2000.

[4] R.Agrawal, T.Imielinki and A.Swami, “Mining association rules between set of item of large databases” in Proc. Of the ACM SIGMOD Intl’l Conf. on Management of data, Washington, D.C.,USA, 1993, pp 207-216.

[5] R.Agrawal, R.Srikant, “Fast algorithms for mining association rules in large databases”. Proc. of 20th Int’l conf. on VLDB: 487-499, 1994.

[6] C.Gyorodi, R.Gyorodi. “Mining Association rules in Large Databases”.Proc. of Oradea EMES’02: 45-50, Oradea, Romania, 2002.

[7] Pei, J., Han, J. and Wang, W., “Mining Sequential Pattern with Constraints in Large Databases,” in Proc. Of CIKM’02, pp. 18–25, 2002.

[8] J.Han, J.Pei, Y.Yin, “Mining Frequent Patterns without candidate generation”. Proc. Of ACM-SIGMOD, 2000.

[9] S.Brin, R.Motawani, J.D.Ullman and S. Tsur, “Dynamic Itemset counting and implication rules for market basket data” in Proc. of the ACM SIGMOD Intl’l Conf. on Management of data, Tucson, Arizona, USA, 1997, pp. 255-264.World