Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
Panagiotis D. Michailidis and Konstantinos G. Margaritis
Parallel and Distributed Processing Laboratory Department of Applied Informatics, University of Macedonia
156 Egnatia str., P.O. Box 1591, 54006 Thessaloniki, Greece {panosm,kmarg}@uom.gr
http://macedonia.uom.gr/~ {panosm,kmarg}
Abstract. In this paper, we present three parallel approximate string matching methods on a parallel architecture with heterogeneous work- stations to gain supercomputer power at low cost. The first method is the static master-worker with uniform distribution strategy, the second one is the dynamic master-worker with allocation of subtexts and the third one is the dynamic master-worker with allocation of text pointers. Fur- ther, we propose a hybrid parallel method that combines the advantages of static and dynamic parallel methods in order to reduce the load imbal- ance and communication overhead. This hybrid method is based on the following optimal distribution strategy: the text collection is distributed proportional to workstation’s speed. We evaluated the performance of four methods with clusters 1, 2, 4, 6 and 8 heterogeneous workstations.
The experimental results demonstrate that the dynamic allocation of text pointers and hybrid methods achieve better performance than the two original ones.
1 Introduction
Approximate string matching is one of the main problems in classical string algorithms, with applications to information and multimedia retrieval, compu- tational biology, pattern recognition, Web search engines and text mining. It is defined as follows: given a large text collection t = t
1t
2...t
nof length n, a short pattern p = p
1p
2...p
mof length m and a maximal number of errors allowed k, we want to find all text positions where the pattern matches the text up to k errors. Errors can be substituting, deleting, or inserting a character.
In the on-line version of the problem, it is possible to preprocess the pattern but not the text collection. The classical solution involves dynamic program- ming and needs O(mn) time [14]. Recently, a number of sequential algorithms improved the classical time consuming one; see for instance the surveys [7,11].
Some of them are sublinear in the sense that they do not inspect all the charac- ters of the text collection.
D. Kranzlm¨uller et al. (Eds.): Euro PVM/MPI 2002, LNCS 2474, pp. 432–440, 2002.
Springer-Verlag Berlin Heidelberg 2002c
We are particularly interested in information retrieval, where current free text collections is normally so very large that even the fastest on-line sequential algorithms are not practical, and therefore the parallel and distributed process- ing becomes necessary. There are two basic methods to improve the performance of approximate string matching on large text collections: one is based on the fine- grain parallelization of the approximate string matching algorithm [2,12,13,6,4,5]
and the other is based on the distribution of the computation of character com- parisons on supercomputers or network of workstations. As far as the second method, is concerned distributed implementations of approximate string match- ing algorithm are not available in the literature. However, we are aware of few attempts for implementing other similar problems on a cluster of workstations.
In [3] a exact string matching implementation have been proposed and results are reported on a transputer based architecture. In [9,10] a exact string match- ing algorithm was parallelized and modeled on a homogeneous platform giving positive experimental results. Finally, in [5,16] presented parallelizations of a bi- ological sequence analysis algorithm on a homogenous cluster of workstations and on an Intel iPSC/860 parallel computer respectively. However, the general efficient algorithms for the master-worker paradigm on heterogeneous clusters have been widely developed in [1].
The main contribution of this work is three low-cost parallel approximate string matching approaches that can search in very large free textbases on inex- pensive cluster of heterogeneous PCs or workstations running Linux operating system. These approaches are based on master-worker model using static and dynamic allocation of the text collection. Further, we propose a hybrid parallel approach that combines the advantages of three previous parallel approaches in order to reduce the load imbalance and communication overhead. This hybrid approach is based on the following optimal distribution strategy: the text collec- tion is distributed proportional to workstation’s speed. The four approaches are implemented using the MPI library [15] over a cluster of heterogeneous worksta- tions. To the best of our knowledge, this is the first attempt the implementation of approximate string matching application using static and dynamic load bal- ancing strategies on a network of heterogeneous workstations.
2 MPI Master-Worker Implementations of Approximate String Matching
We follow master-worker programming model to develop our parallel and dis- tributed approximate string matching implementations under MPI library [15].
2.1 Static Master-Worker Implementation
In order to present the static master-worker implementation we make the follow-
ing assumptions: First, the workstations are numbered from 0 to p − 1, second,
the documents of our text collection are distributed among the various work-
stations and stored on their local disks and finally, the pattern and the number
of errors k are stored in the main memory to all workstations. The partitioning strategy of this approach is to partition the entire text collection into a number of the subtext collections according to the number of workstations allocated.
The size of each subtext collection should be equal to the size of the text col- lection divided by the number of allocated workstations. Therefore, the static master-worker implementation that is called P1 is composed of four phases. In first phase, the master broadcasts the pattern string and the number of errors k to all workers. In second phase, each worker reads its subtext collection from the local disk in the main memory. In third phase, each worker performs character comparisons using a local sequential approximate string matching algorithm to generate the number of occurrences. In fourth phase, the master collects the number of occurrences from each worker.
The advantage of this simple approach is low communication overhead. This advantage was achieved, a priori, by the search computation, assigning each worker to search its own subtext independently without have to communicate with the other workers or the master. However, the main disadvantage is the pos- sible load imbalance because of the poor partitioning technique. In other words, there is a significant idle time for faster or more lightly loaded workstations on a heterogeneous environment.
2.2 Dynamic Master-Worker Implementations
In this subsection, we implement two versions of the dynamic master-worker model. The first version is based on the dynamic allocation of subtexts and the second one is based on the dynamic allocation of text pointers.
Dynamic Allocation of Subtexts The dynamic master-worker strategy that
we adopted is a known parallelization strategy and is known as ”workstation
farm”. Before, we present the dynamic implementation we make the following
assumption: the entire text collection is stored on the local disk of the master
workstation. The dynamic master-worker implementation that is called P2 is
composed of six phases. In first phase, the master broadcasts the pattern string
and the number of errors k to all workers. In second phase, the master reads from
the local disk the several chunks of the text collection. The size of each chunk (sb)
is an important parameter which can be affect the overall performance. More
specifically, this parameter is directly related to the I/O and communication
factors. We selected several sizes of each chunk in order to find the best perfor-
mance as we presented in our experiments [8]. In third phase, the master sends
the first chunks of the text collection to corresponding worker workstations. In
fourth phase, each worker workstation performs a sequential approximate string
matching algorithm between the corresponding chunk of text and the pattern in
order to generate the number of occurrences. In fifth phase, each worker sends
the number of occurrences back to master workstation. In sixth phase, if there
are still any chunks of the text collection left, the master reads and distributes
next chunks of the text collection to workers and loops back to fourth phase.
The advantage of this dynamic approach is low load imbalance, while the disadvantage is higher inter-workstation communication overhead.
Dynamic Allocation of Text Pointers Before, we present the dynamic im- plementation with the text pointers we make the following assumptions: First, the complete text collection is stored on the local disks of all workstations and second, the master workstation has a text pointer that shows the current position in the text collection. The dynamic allocation of text pointers that is called P3 is composed of six phases. In first phase, the master broadcasts the pattern string and the number of errors k to all workers. In second phase, the master sends the first text pointers to corresponding workers. In third phase, each worker reads from the local disk the sb characters of text starting from the pointer that re- ceives. In fourth phase, each worker performs a sequential approximate string matching procedure between the corresponding chunk of text and the pattern in order to generate the number of occurrences. In fifth phase, each worker sends the result back to master. In sixth phase, if the text pointer does not reach the end of the text, then master updates the text pointers for the next position of next chunks of text and sends the pointers to workers and loops back to third phase.
The advantage of this simple implementation is that reduces the inter work- station communication overhead since each workstation in this scheme has an entire copy of the text collection on the local disk. However, this scheme requires more local space (or disk) requirements, but the size of the local disk in parallel and distributed architectures is large enough.
2.3 Hybrid Master-Worker Implementation
Here, we develop a hybrid master-worker implementation that combines the advantages of static and dynamic approaches in order to reduce the load imbal- ance and communication overhead. This implementation is based on the optimal distribution strategy of the text collection that is performed statically. In the following subsection, we describe the optimal text distribution strategy and its implementation.
Text Distribution and Load Balancing To avoid the slowest workstations to determine the parallel string matching time, the load should be distributed proportionally to the capacity of each workstation. The goal is to assign the same amount of time, which may not correspond to the same amount of the text collection.
A balanced distribution is achieved by a static load distribution made prior to the execution of the parallel operation. To achieve a good balanced distribu- tion among heterogeneous workstations, the amount of text distributed to each workstation should be proportional to its processing capacity compared to the entire network:
l
i= S
i p−1j=0
S
j(1)
where S
jis the speed of the workstation j. Therefore, the amount of the text collection that is distributed to each workstation M
i(1 ≤ i ≤ p) is l
i∗n, where n is the length of the complete text collection.
The hybrid implementation that is called P4 is same as the P1 implementa- tion but we use the optimal distribution method instead of the uniform distri- bution one. The four entire parallel implementations are constructed so that al- ternative sequential approximate string matching algorithms can be substituted quite easily [7,11]. In this paper, we use the classical SEL dynamic programming algorithm [14].
3 Experimental Results
In this section, we discuss the experimental results for the performance of four parallel and distributed algorithms. These algorithms are implemented in C pro- gramming language using the MPI library [15].
3.1 Experimental Environment
The target platform for our experimental study is a cluster of heterogeneous workstations connected with 100 Mb/s Fast Ethernet network. More specifi- cally, the cluster consists of 4 Pentium MMX 166 MHz with 32 MB RAM and 6 Pentium 100 MHz with 64 MB RAM. A Pentium MMX is used as master work- station. The average speeds of the two types of workstations, Pentium MMX and Pentium, for the four implementations are listed in Table 1. The MPI im- plementation used on the network is MPICH version 1.2. During all experiments, the cluster of workstations was dedicated. Finally, to get reliable performance results 10 executions occurred for each experiment and the reported values are the average ones. The text collection we used was composed of documents, which were portion of the various web pages.
3.2 Experimental Results
In this subsection, we present the experimental results concluding from two sets of experiments. For the first experimental setup, we study the performance of four master-worker implementations P1, P2, P3 and P4. For the second ex- perimental setup, we examine the scalability issue of our implementations by doubling the text collection.
Table 1. Average speeds (in chars per sec) of the two types of workstations
Application Pentium MMX Pentium P1,P4 3693675.74 2200079.448
P2 3753988.176 2194043.93
P3 3737505.55 2197613.22
Comparing the Four Types of Approximate String Matching Imple- mentations Before we present the results for four methods, we determined from the extensive experimental study [8] that the block size nearly sb=100,000 characters produces optimal performance for two dynamic master-worker meth- ods P2 and P3, later experiments are all performed using this optimal value for the P2 and P3. Further, from [8] we observed that the worst performance is obtained for very small and large values of block size. This is because small values of block size increase the inter-workstation communication, while large values of block size produce poorly balanced load.
Figures 1 and 2 show the execution times and the speedup factors with re- spect to the number of workstations respectively. It is important to note that the execution times and the speedups, which are plotted in Figures 1 and 2 are result of average for five pattern lengths (m=5, 10, 20, 30 and 60) and four values of the number of errors ( k=1, 3, 6 and 9). The speedup of a heteroge- neous computation is defined as the ratio of the sequential execution time on the fastest workstation to the parallel execution time across the heterogeneous cluster. To have a fair comparison in terms of speedup, one defines the system computing power, which considers the power available instead of the number of workstations. The system computing power defines as follows:
p−1i=0
S
i/S
0for p workstations used, where S
0is the speed of the master workstation.
As we have expected, performance results show that the P1 implementation using static load balancing strategy is less effective than the other three imple- mentations in case of heterogeneous network. This fact due to the presence of waiting time associated to communications. In other words, the slowest work- station is always the latest one in string matching computation. Further, the P2 implementation using dynamic allocation of subtexts produces better results than the P1 one in case of heterogeneous cluster. Finally, the experimental results show that the P3 and P4 implementations seem to have the best performance compared with the others in case of heterogeneous cluster. These implementa- tions give smaller execution times and higher speedups than in case of using
10 20 30 40 50 60 70 80 90 100
1 2 3 4 5 6 7 8
Time (in seconds)
Number of workstations SEL search algorithm, n=13MB and k=3
P1 P2P3 P4
5 10 15 20 25 30 35 40
1 2 3 4 5 6 7 8
Time (in seconds)
Number of workstations SEL search algorithm, n=13MB and m=10
P1 P2P3 P4
Fig. 1. Experimental execution times (in seconds) for text size of 13MB and k=3
using several pattern lengths (left) and m=10 using several values of k (right)
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
1 2 3 4 5 6 7 8
Speedup
Number of workstations SEL search algorithm, n=13MB and k=3
P1P2 P3 P4
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
1 2 3 4 5 6 7 8
Speedup
Number of workstations SEL search algorithm, n=13MB and m=10
P1P2 P3 P4
Fig. 2. Speedup of parallel approximate string matching with respect to the number of workstations for text size of 13MB and k=3 using several pattern lengths (left) and m=10 using several values of k (right)
the P1 and P2 ones when the network becomes heterogeneous, i.e. after the 3rd workstation.
We now examine the performance of the P2, P3 and P4 parallel implemen- tations. From the results, we see a clear reduction in the computation time of the algorithm when we use the three parallel implementations. For instance, with k=3 and several pattern lengths, we reduce the average computation time from 95.085 seconds in the sequential version to 17.206, 16.176 and 16.040 sec- onds in the distributed implementations P2, P3 and P4 respectively using 8 workstations. In other words, from the Figure 1 we observe that for constant total text size there is an expected inverse relation between the parallel execu- tion times and the number of workstations. Further, the three master-worker implementations achieve reasonable speedups for all workstations. For example, with k=3 and several pattern lengths, we had an increasing speedup curves up to about 5.52, 5.86 and 5.91 in distributed methods P2, P3 and P4 respectively on the 8 workstations which had the computing power of 5.92, 5.93 and 5.97.
Scalability Issue To study the scalability of three proposed parallel implemen-
tations P2, P3 and P4, we setup the experiments in the following way. We simple
double the old text size two times. This new text collection is around 27MB. Re-
sults from these experiments have been depicted in Figures 3 and 4. The results
show that the three parallel implementations still scales well though the prob-
lem size has been increased two times (i.e. doubling the text collection). The
average execution times for k=3 and several pattern lengths similarly decrease
to 34.063, 32.298 and 32.150 seconds for the P2, P3 and P4 implementations
respectively when the number of workstations have been added to 8. Moreover,
speedup factors of three methods also linearly increase when the workstations
are increased. Finally, the best performance results are obtained with the P3
and P4 load balancing methods.
20 40 60 80 100 120 140 160 180 200
1 2 3 4 5 6 7 8
Time (in seconds)
Number of workstations SEL search algorithm, n=27MB and k=3
P2P3 P4
10 20 30 40 50 60 70 80 90
1 2 3 4 5 6 7 8
Time (in seconds)
Number of workstations SEL search algorithm, n=27MB and m=10
P2P3 P4
Fig. 3. Experimental execution times (in seconds) for text size of 27MB and k=3 using several pattern lengths (left) and m=10 using several values of k (right)
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
1 2 3 4 5 6 7 8
Speedup
Number of workstations SEL search algorithm, n=27MB and k=3
P2 P3 P4
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
1 2 3 4 5 6 7 8
Speedup
Number of workstations SEL search algorithm, n=27MB and m=10
P2 P3 P4