Traditional Centralised Computing - Support Vector Machine (SVM) aggregation modelling for spat

Traditionally in a centralised computing model we set up monitoring stations that monitor our environment through various sensor devices, which then transfer data to

one powerful centralised computer. To conduct an analysis based on prediction the air pollution status in the future is determined (Varshney & Mohan, 2005). However,

there are a few drawbacks associated with this type of centralised computing explained earlier in chapter 1. These drawbacks provided us with research opportunities to

explore and improve decision making capability using a distributed scenario for the air pollution problem through knowledge sharing. The next section proposes motivation

for the proposed research method.

2.8.1 Big Data

Machine learning algorithms have to keep pace with the latest continuous develop-

ments, application demands and solutions to problems where, data sets have become larger and larger. In such large data sets, mining for speciﬁc patterns is always a

putation experiment. The solution to this problem was proposed by (Moretti, Stein-

haeuser, Thain, & Chawla, 2008), and was accomplished by dividing the data and computation using a distributed ensemble of classiﬁers. In that research, a method

was proposed that used abstraction for scalable data mining to address the above challenges. The technical implementation used distributed cloud computing multiple

models and considered several performance aspects including work load balancing

and system conﬁguration settings. The performance and scalability of ensembles was tested with a wide variety of data sets and algorithms on a condor-based pool along

with Chirp for handling storage of data.

The divide and conquer strategy was used for the SVM ensemble approach. The

three stage SVM ensemble algorithm (Yang & Shi, 2010) was proposed to improve prediction accuracy and generalisation performance of the hectic time series data. In

the ﬁrst stage, a fuzzy c-means algorithm was implemented to divide the input data set into subsets. In the second stage, the best ﬁt partitioned subsets were constructed

via SVMs with composite kernels and their hyper-parameters were developed through particle swarm optimisation (PSO). In the ﬁnal stage, a fuzzy synthesis algorithm

was applied to combine the outputs of various sub-models to obtain the ﬁnal output. Experiment results on hectic time-series data showed that the proposed three stage

SVM ensemble algorithm had good performance when compared with other existing algorithms in this research for the time series prediction task.

Increasing the number of self-directed data-sources results in concerns for distributed knowledge discovery and data mining (Visalakshi & Thangavel, 2008). Dis-

tributed data sets were clustered with the help of a distributed clustering algorithm. In this algorithm, each object was assigned to multiple clusters (membership

weights sum to one for these collective assignments). A new soft clustering algorithm (Visalakshi & Thangavel, 2008) was proposed that modiﬁes the existing distributed

K-Means algorithm. This new proposed algorithm was capable of clustering multiple homogeneous data sources, which were distributed over various local sites and were

obtained by combining local clustering results. To cluster local data sets, the fuzzy c-means algorithm was deployed where, the centroids of individual data sets formed

an ensemble. Local centroids were clustered using the K-Means algorithm. The ex-

periment results depicted better performance than conventional centralised the fuzzy c-means clustering algorithm.

2.8.2 SVM Ensemble

In a wide range of applications, the combination of classiﬁers reduces the classiﬁcation

errors. SVM ensembles with bagging have shown better performance in classiﬁcation (Alham, Li, Liu, Ponraj, & Qi, 2012) compared with a single SVM. Due to the

presence of large replicated data sets, the training process of the SVM ensemble is computationally intensive. In order to solve this problem, a MapReduce based Dis-

tributed SVM Ensemble (MRESVM) algorithm was proposed for image annotation. In this study a training dataset was re-sampled based on bootstrapping and SVM. It

was trained on each data set in parallel, using a cluster of computers. Experiment

results showed that the proposed algorithm reduced the training time signiﬁcantly and achieved higher classiﬁcation accuracy (Alham et al., 2012).

A novel online ensemble learning algorithm (Canzian, Zhang, & van der Schaar, 2015) was proposed. It consisted of various distributed local learners that analysed

different streams of data related to an event which needed to be classified. Each learner used a local classifier to make local predictions. The local predictions were

then collected by each learner and combined by a majority of voting rule to produce the ﬁnal prediction. This is one approach to combining and collecting of data from

multiple sources for decision making.

One can argue that merging data to a single location and performing a series of

merges and joins, could result in the production of a single flat file. Furthermore, af- ter randomising and subsampling this file, some algorithms can be used for problem

solving using this data. However, in reality this approach is not an eﬃcient way of problem solving, including issues with computational cost (time) and complexity as

well as potentially vast data storage requirements (Canzian et al., 2015). Moreover, in some cases, it may be practically impossible owing to real life constraints including

tions. This is one reason why modern techniques demand that agencies attempt to

integrate their databases and analytical techniques. Such integration reduces overall complexity and can lead to the production of speedier results.

The method in the literature that is closely aligned with our study is the online ensemble learning technique. This method circumvents the requirement for data

storage as data are processed on arrival. Additionally, the method is optimised for

online learning and minimises data processing overheads. An optimisation of this method is the use of perceptron weighted majority (PWC) (Canzian et al., 2015).

Several authors indicate the potential of these approaches for further investigation (H. Wang, Fan, Yu, & Han, 2003; Masud, Gao, Khan, Han, & Thuraisingham, 2009;

Street & Kim, 2001; Avidan, 2007).

Our work diﬀers from these techniques because: (i) our work requires the data set

to be a subsample, and (ii) we process using accurate local classiﬁers and an accurate combination method. We diﬀerentiate from previous works in other ways. Firstly, air

pollution data are distributed, stored and collected at various monitoring stations. We considered a distributed SVM ensemble approach for decision support for solving

air pollution problems. This approach has not been documented. Secondly, the SVM ensemble works on the strategy of divide and conquer, which is deployed in our

research for decomposition of a complex air pollution quality problem into simpler local sub problems. The reason for adopting this approach is threefold: (i) it can deal

with huge image sizes and long-term historic spatio-temporal data using SVMs, and (ii) we can improve the decision making performance by analysing each monitoring

station robustly owing to the beneﬁts of scalability, and (iii) it will help to improve our timely decision making owing to the incremental learning approach.

In document Support Vector Machine (SVM) aggregation modelling for spatio-temporal air pollution analysis (Page 51-54)