Efficient Astronomical Data Classification on Large-Scale Distributed Systems

(1)

Large-Scale Distributed Systems

Cheng-Hsien Tang1_{, Min-Feng Wang}1_{, Wei-Jen Wang}1_{, Meng-Feng Tsai}1,_,

Yuji Urata2_{, Chow-Choong Ngeow}2_{, Induk Lee}2_{, Kuiyun Huang}3_,

and Wen-Ping Chen2

1 _{Department of Computer Science and Information Engineering,}

National Central University, Taiwan

2 _{Institute of Astronomy, National Central University, Taiwan} 3 _{Academia Sinica Institute of Astronomy and Astrophysics, Taiwan}

Abstract. Classification of different kinds of space objects plays an im-portant role in many astronomy areas. Nowadays the classification pro-cess can possibly involve a huge amount of data. It could take a long time for processing and demand many resources for computation and storage. In addition, it may also take much effort to train a qualified ex-pert who needs to have both the astronomy domain knowledge and the capability to manipulate the data. This research intends to provide an efficient, scalable classification system for astronomy research. We im-plement a dynamic classification framework and system using support vector machines (SVMs). The proposed system is based on a large-scale, distributed storage environment, on which scientists can design their analysis processes in a more abstract manner, instead of an awkward and time-consuming approach which searches and collects related subset of data from the huge data set. The experimental results confirm that our system is scalable and efficient.

Keywords: Classiﬁcation, Support Vector Machine, Distributed System, Data Center.

1 Introduction

Classification of space objects is a common, important technique in many as-tronomy research areas. For example, analyzing the galaxies usually involves classification using each space object’s appearance, magnitude, moving behav-ior, and many other characteristics. Most traditional classification methods focus on handling a small region of data store in one machine, and apply a single pro-cess to analyze the data. When those methods encounter an extremely large size of data, the analysis performance inevitably drops down dramatically. Unfortu-nately, we have already seen large data collections in terabyte-scale or even in petabyte-scale in the past decade, such as astronomy [1], high energy physics [2],

_{Corresponding author.}

R.-S. Chang et al. (Eds.): GPC 2010, LNCS 6104, pp. 430–440, 2010. c

(2)

and aircraft engine diagnostics [3]. Thus scalable, eﬃcient classiﬁcation methods are very important to today’s science research.

The Multiple Classifier System (MCS) [4] is one popular solution to divide data into small chunks, and then to classify the data chunks in parallel with mul-tiple similar tools. Many dynamic classification selection methods, such as [5, 6], only partially support MCS because some critical problems remain unsolved. First, these methods require gathering and combining different datasets into one complete set, and then use some static analysis processes to split the complete set into appropriate chunks. The splitting process should create diversity of chunks for better accuracy. This process is time-consuming because huge-size datasets are frequently exchanged and transferred among computing hosts and storage systems. Nowadays scientists have seen tera-byte or peta-byte of dataset con-stantly being transferred from one site to another site across different countries. Secondly, those dynamic classification selection methods are usually designed for face identification and brain wave measurement. The data characteristics can be very restrictive and thus they cannot be directly reused in other MCS applications.

The goal of this research is to provide a basic MCS platform for distributed, large-scale data storage systems. We provide a prototype of a decentralized MCS framework that supports classification with multiple classifiers. This design can reduce both data extraction time and computing loads in a distributed large-scale data center for astronomy research. The proposed approach uses LS-SVM [7] as the basic classifier, and applies aDivide and Conquer Classifier Selectionmethod to train and select proper classifiers on a distributed environment. Our approach then combines the trained classifiers into one complete classification model. In comparison with other MCS systems, our framework can initiate local computa-tion of dataset processing on each local site. In order to integrate the distributed information on each local site, we provide a portal that contains necessary meta-data and query management tools for usability and interoperability. This allows the users to perform analysis on a more abstract level without learning data management on each local sites and the whole distributed environment.

The rest of the paper is organized as follows: In section 2 we provide the related work that has been done by other researchers. Section 3 presents our architecture, while in section 4 we discuss the methodology of the research. In section 5 we present our preliminary results. Conclusions and future work are described in section 6.

2 Related Work

Many researchers have proposed several kinds of multiple classifier systems (MCSs) for different problems in the literature. Zhu et al. design an attributed-oriented dynamic classifier selection method to mining stream data [6]. They use statistical information of attributes to split the evaluation set into disjoint sub-sets, and then evaluate the classification accuracy of each base classifier based on these subsets during training. Finally, they select the corresponding subsets

(3)

using attribute values, and select the base classifier with the highest classifica-tion accuracy during tests. Woods et al. propose a dynamic classifier selecclassifica-tion method based on local feature surrounding an unknowing test sample [5]. In this approach, the local accuracy estimation of individual classifier can help de-termine where a classifier performs most reliability in feature space. In [8], the dynamic classifier integration framework is proposed to overcome the diversity of application domains of pattern recognition. They consider the local expertise of each classifier, and combine the advantages of classification fusion and classi-fier selection to find the best single classiclassi-fier. In order to provide an overview of MCS, Ranawana et al. explains the basic principles of MCS design [4].

Scalability is a serious problem for classifier systems using support vector machine (SVM) because the total number of used support vectors could be very large. Therefore, Tran et al. propose a clustering support vector machine method [9]. Their method is to assign all the data of each class to K groups using the K-means algorithm, and then train the SVM based on the central vec-tors of each group. Woodsend et al. also propose a SVM which supports parallel computing [10]. The method distributes full data sets evenly amongst the pro-cessors, use an interior point method to give efficient optimization, and utilize Cholesky decomposition to give good numerical stability. Hush et al. study the convergence properties of a general class of decomposition algorithms for sup-port vector machines (SVMs) [11], develop a model algorithm for decomposition, and prove the necessary and sufficient conditions for stepwise improvement of the algorithm. Collobert et al. propose a parallel mixture method of multiple SVMs [12], in which a single set of SVMs is trained on subsets of the training set, and a neural network is used to provide a class prediction and to assign samples to different subsets. Due to the cost of the memory limitation and the computation time, Chang et al. develop a PSVM algorithm [13] to improve the scalability of SVMs. The approach loads the essential data to each machine for parallel computing, and usse matrix theory for memory reduction. Ali et al. de-velop a gird-based distributed SVM algorithm [14] which integrates data mining algorithm and Parallel SVM using MPI to improve the computing performance.

3 System Architecture

The proposed system is a general-purpose multiple SVM classifier system which is based on a cluster-based data center, where data are too huge to be stored at one single site (see Figure 1). Since data are inherently distributed, an integrated interface (portal) is required to simplify users’ requests. The requests are then mapped into several independent query commands and issued to each site for preliminary processing. Then the computing results, such as classifier accuracy and rule models, are collected and filtered by middleware services, and finally formalized by the integrated interface. The functionality provided by the data center can be abstracted to four layers with several system sub-components (see Figure 2): the Virtual Observatory Portal Layer, the Query Management Layer, the Data Execution & Store Layer, and the Full Source Storage Layer.

(4)

Fig. 1.A cluster-based data center which supports distributed classiﬁcation

Fig. 2.The proposed system which can be abstracted to four layers with several system sub-components

(5)

3.1 Virtual Observatory Portal

The Virtual Observatory Portal layer is the entry point of the proposed system for users. It contains a user interface with related metalog which stores the metadata of the source data and provides necessary information which is related to the data. It enables the users to perform an overall search over the distributed data. Furthermore, users can submit queries with some basic functions through the portal. This layer also provides an interface to control the process of SVM classiﬁers training by assigning the groups of the training datasets.

3.2 Query Management Layer

The Query management layer contains four modules: the metadata table, the query dispatcher, the DCS table, and the data receiver service. The metadata table maintains related information of the computing nodes in the data center, such as boundary, type, replica, and other necessary information to determine job scheduling. The metadata table also maintains the historical data of user queries and status. The historical data can be used to optimize the system performance (e.g., adjusting the data replication strategy). The query dispatcher receives user queries from the portal and breaks them into several sub-queries. It then spreads the commands to the computing nodes with the desired data. The DCS (Dynamic Classifier Selection) table maintains the accuracy information of all classifiers when a distributed classification job is issued. The DCS table also maintains the information of proper classifiers of each groups based on some given criteria, such as an accuracy threshold and the top n classifiers. The rules of the selected classifiers are then combined into one complete model for the classification request. The data receiver is responsible to collect the partial results from computing nodes and to deliver the results to the portal.

3.3 Data Execution and Store Layer

The data execution & store layer is constructed by several clusters, which main-tain different kinds of data for different purposes. There are three components in a computing node to perform local query and classification: the SVM classifier, the workflow control module, and the storage system built by a multi-dimension data warehouse management system [15].

The SVM classifier utilizes the technique of Support Vector Machine, a ma-chine learning algorithm derived from statistical learning theory. SVM has been widely used in many application domains. It is advantageous to use SVM in our system because SVM is relatively easier to implement and can deal with differ-ent kind of data without too much modification. A workflow control module is provided for the data warehouse based management system on the bottom level which splits a user query into several fundamental queries, and rearranges the queries into a new order to improve execution efficiency. A special node, called the general-purpose node, works independent of all clusters in the proposed sys-tem. It provides a global classifier that communicates with other computing nodes, receives a subset of data of interest, and generates rules based on the

(6)

data. The global classiﬁer guarantees a lower bound of accuracy through global dataset.

The storage system uses a data warehouse architecture. It is derived from [15] of which the schema is designed for a centralized system. To improve the schema into decentralize schema, we use the data partitioning and allocation strategy in [16]. This approach enables parallel queries on a distributed multi-dimension data warehouse management system.

3.4 Full Source Storage

The full source storage system maintains the source data and provides data to the data execution & store layer when needed. Compared to the full source storage system, the storage system at the data execution & store layer only stores the data which researchers are interested in. In other words, the full source storage system is independent of the data execution & store layer, on which each computing node only maintains a subset of the source data. Notice that the cost of maintaining a full copy of source data in the system is high because it could consume too much space (in tera-bytes or even in peta-bytes) and network bandwidth. Therefore, the full source storage system only stores raw data. When the full source storage system receives a data request, it will extract, transform, and load the source data to the storage system at the data execution & store layer.

4 Methodology

The size of today’s astronomical data is usually extremely large, and thus a single-processor classification method barely handles the scale of data. There-fore, we employ a parallel approach on a distributed storage system to deal with the data using a multiple SVM classifier system, as well as a distributed, dynamic classifier selection method calledthe Divide and Conquer Classifier Se-lection (DC-CS) service to find the best classifiers and rule models to improve classification accuracy.

The main procedure of the DC-CS service is shown as follows:

1. The user deﬁnes the criteria for training set selection from the portal.

2. When the query dispatcher receives a user request, it decomposes the request into several small sub-queries, and sends them to the computing nodes with the data of interest.

3. When a computing node receives a query, the workﬂow control module starts computing the best execution strategy by rearranging the execution order of sub-queries.

4. Once a computing node receives a query from the query dispatcher, the sys-tem extracts data from the local storage, puts the local training data sets into SVM classiﬁers, and generates a classiﬁcation model.

5. All involved computing nodes randomly select n% (based on users’ rules) of data from the training set, send them to the global classiﬁer on the general-purpose computing node. The global classiﬁer does not obtain any data from

(7)

the full source storage system. As soon as it receives all required data, it begins the same training procedure as what the other computing nodes do. Notice that the training procedure in Steps 4 and 5 can execute in parallel.

6. After the training and testing procedures ﬁnish, each computing node sends the accuracy information of each groups to the DCS table.

7. The DCS table maintains the accuracy information of all classifiers for each group, including the global classifier. After the accuracy information of each classifier is received, the DCS table chooses the best classifiers of each group defined based on users’ selection criteria. The DCS table then combines the selected classifiers of each group into one complete classification model.

8. The system obtains the complete classification model and stores the selected classifiers for future use. Other training data sets can be used to refine the classification model to provide better classification accuracy.

The difference between the DC-CS service and other DCS algorithms is that the traditional algorithms (i.e. [5, 6]) need to: (1) collect all the training sets, (2) to do some static analysis in advance, and (3) to cut data into chunks. The large amount of data in each computing nodes thus could consume many resources if a traditional DCS algorithm is applied. On the contrary, the DC-CS service can have better efficiency by evenly distributing the workload and yet maintain acceptable classification accuracy. Furthermore, the global classifier can also provide a lower-bound accuracy in the worst case. The strategy makes the DC-CS service a practical solution in a large-scale, distributed environment. A good data replication and replacement strategy can improve system perfor-mance, especially when the scale of data is huge. A bad method, on the other hand, could lead the system to an unrecoverable state when accidents happen (i.e. power failure). Our system uses the full-replica strategy (see Figure 3), which duplicates all the data in each node to some other nodes. Each dataset in the system has two replicas. The first replica will be placed at the neighbor node with the closest distance to the source dataset. The second one is placed at a randomly selected node in another cluster, if any. A node only communicates with the other nodes with replicas, thus reducing the communication overhead among computing nodes. Another reason for taking a full-replica strategy rather than using a block-interleaved approach (i.e., splitting data into several small blocks and duplicating the blocks) is that each computing node can have a larger continuous data set. Thus the SVM classifier could have more overlapped train-ing data and test data if the size of each local traintrain-ing data set is too small to use. For example, we can use a threshold which is less than n% of the size of the total training datasets, where n is the total number of the computing nodes. To have better accuracy, the replica on the nearest node is activated to provide extra datasets.

5 Experimental Results

In this section, we evaluate the performance of the proposed system using two diﬀerent kinds of real astronomical data: theIPP diﬀ image log and theMOPS

(8)

data. The expriments use ten Intel quad-core computers with 8G memory for parallel computing, and two storage servers to provide the source data. All com-puting nodes are connected via Fast Ethernet (100 MBit/s full duplex) switches. The IPP (Image processing pipeline) diff image log is generated from the Pan-STARRS project [17], in which telescopes scan the whole sky and create images. The diff image log provides difference data of related images within a month to show the magnitude variation in different bands for each observed sky object. The MOPS (The Moving Object Processing System) data is also produced by the Pan-STARRS project. It is a collection of datasets of moving objects (i.e. asteroids) in the sky. There are over 380,000 observed objects until now, and the size will grow up to six times as big as the current data size when the Pan-STARRS observatory starts working. To show the efficiency and accuracy of the proposed system, we use the single SVM classifier method as the comparison basis. The single SVM classifier method uses a simple greedy algorithm on a dis-tributed environment. It selects the node with the largest local subset, acquires the missing part of data from other nodes, and performs a model training using the full training set.

Figure 4 shows the experimental results using diﬀerent sizes of training sets of the IPP data. The astronomy experts mark the objects in diﬀ image log based on magnitude and clarity of objects within 5 groups: good detections, possible good detections, unknown objects, possible bad detections, and bad detections. Figure 4(a) shows the total data transfer size between nodes during

(9)

Fig. 4.Performance evaluation using the IPP data

(10)

classification. It confirms that the single classifier needs to collect the whole data to a single computing node and requires more data migration. The DC-CS service, on the contrary, only needs to transfer randomly picked data to the global classifier and takes less bandwidth. Figure 4(b) shows that the accuracy of the proposed system is usually a little bit lower than the single classifier because the datasets are decomposed into smaller subsets. The figure also indicates that both methods can have the same accuracy if the data size is large. Figure 4(c) shows the total execution time from query submission to computation completion. The proposed DC-CS method is obviously a better choice because it is very scalable. Figure 4(d) shows the accuracy-time ratio, which is defined as follows:

Accuracy/Execution T ime(using DC−CS)

Accuracy/Execution T ime(using Single Classifier) (1) Figure 5 shows the results using diﬀerent sizes of training sets of the MOPS data. Each moving object in the MOPS data contains 6 main attributes, which can be used to classify objects into diﬀerent groups that have similar state [18]. The experimental results in Figure 5 are similar to the results in Figure 4, except that the average accuracy is lower because the data groups of MOPS data are more complicated.

All the experiments show that the accuracy using a single classifier could be a little higher than other classifier selection methods because it considers the full training sets. However, the accuracy of both methods is almost the same if the size of the datasets is large. The DC-CS method is obviously more efficient, and it can be ten times as fast as the single classifier method when the data size is large enough. The experimental results also show that our system can efficiently handle heterogeneous data.

6 Conclusions and Future Work

In this paper, we proposed and implemented a multiple classifier framework and system in a data center for astronomy research. We developed the DC-CS service, a multiple-classifier framework to provide fast and stable classification in a distributed environment. The global classifier of the DC-CS service ensures the classification accuracy in the worst case. We used real astronomical data to do several experiments, and the results showed that the proposed system is scalable, efficient, and accurate. The possible directions of our future work include: (1) expanding the system into a larger and more distributed storage system, (2) optimizing the system performance, and (3) integrating the services into a SOA architecture.

Acknowledgements

This research was sponsored by National Science Council, Taiwan under the grant 98-2219-E-002-022.

(11)

References

1. SDSS: Sloan digital sky survey (2009),http://www.sdss.org

2. PPDG: Particle physics data grid project (2009),http://www.ppdg.net

3. Austin, J., Davis, R., Fletcher, M., Jackson, T., Jessop, M., Liang, B., Pasley, A.: Dame: Searching large data sets within a grid-enabled engineering application. In: Proceedings of the IEEE, March 2005, pp. 496–509 (2005)

4. Ranawana, R., Palade, V.: Multi-classiﬁer systems: Review and a roadmap for developers. International Journal of Hybrid Intelligent Systems 3(1), 35–61 (2006) 5. Woods, K., Kegelmeyer Jr., W.P., Bowyer, K.: Combination of multiple classi-ﬁers using local accuracy estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(4), 405–410 (1997)

6. Zhu, X., Wu, X., Yang, Y.: Effective classification of noisy data streams with attribute-oriented dynamic classifier selection. Knowledge and Information Sys-tems 9(3), 339–363 (2006)

7. Suykens, J.A.K., Vandewalle, J.: Least squares support vector machine classiﬁers. Neural Processing Letters 9(3), 293–300 (1999)

8. Kim, E., Ko, J.: Dynamic classiﬁer integration method. Multiple Classiﬁer Systems, 97–107 (2005)

9. Tran, Q.A., Zhang, Q.L., Li, X.: Reduce the number of support vectors by using clustering techniques. Machine Learning and Cybernetics 2, 1245–1248 (2003) 10. Woodsend, K., Gondzio, J.: High-performance parallel support vector machine

training. Parallel Scientiﬁc Computing and Optimization 27, 83–92 (2008) 11. Hush, D., Scovel, C.: Polynomial-time decomposition algorithms for support vector

machines. Machine Learning 51(1), 51–71 (2003)

12. Collobert, R., Bengio, S., Bengio, Y.: A parallel mixture of svms for very large scale problems. Neural Computation 14(5), V1105–V1114 (2002)

13. Chang, E., Zhu, K., Wang, H., Bai, H., Li, J., Qiu, Z., Cui, H.: Psvm: Paralleliz-ing support vector machines on distributed computers. In: Advances in Neural Information Processing Systems, vol. 20 (2007)

14. Meligy, A., Al-Khatib, M.: A grid-based distributed svm data mining algorithm. European Journal of Scientiﬁc Research 27(3), 313–321 (2009)

15. Tang, C.H., Yu, C.H., Shen, C.H., Tsai, M., Wang, W.J., Chang, Z.W., Chen, W.P.: A system design for terabyte-scale, distributed multidimensional data management and analysis in the taos project. In: Proceedings for the HPC Asia and APAN 2009, pp. 336–342 (2009)

16. Nguyen, T.M.: Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval DevelopmentXInnovative Methods and Applications. Information Science Reference (2009)

17. pan STARRS: The panoramic survey telescope and rapid response system (2009),

http://pan-STARRS.ifa.hawaii.edu

18. Bendjoya, Philippe, Zappal: Asteroid family identiﬁcation. in asteroids iii, pp. 613– 618. University of Arizona Press (2002)