A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

(1)

Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Panagiotis D. Michailidis and Konstantinos G. Margaritis

Parallel and Distributed Processing Laboratory Department of Applied Informatics, University of Macedonia

156 Egnatia str., P.O. Box 1591, 54006 Thessaloniki, Greece {panosm,kmarg}@uom.gr

http://macedonia.uom.gr/~ {panosm,kmarg}

Abstract. In this paper, we present three parallel approximate string matching methods on a parallel architecture with heterogeneous work- stations to gain supercomputer power at low cost. The first method is the static master-worker with uniform distribution strategy, the second one is the dynamic master-worker with allocation of subtexts and the third one is the dynamic master-worker with allocation of text pointers. Fur- ther, we propose a hybrid parallel method that combines the advantages of static and dynamic parallel methods in order to reduce the load imbal- ance and communication overhead. This hybrid method is based on the following optimal distribution strategy: the text collection is distributed proportional to workstation’s speed. We evaluated the performance of four methods with clusters 1, 2, 4, 6 and 8 heterogeneous workstations.

The experimental results demonstrate that the dynamic allocation of text pointers and hybrid methods achieve better performance than the two original ones.

1 Introduction

Approximate string matching is one of the main problems in classical string algorithms, with applications to information and multimedia retrieval, compu- tational biology, pattern recognition, Web search engines and text mining. It is deﬁned as follows: given a large text collection t = t

1

t

2

...t

_n

of length n, a short pattern p = p

1

p

2

...p

m

of length m and a maximal number of errors allowed k, we want to ﬁnd all text positions where the pattern matches the text up to k errors. Errors can be substituting, deleting, or inserting a character.

In the on-line version of the problem, it is possible to preprocess the pattern but not the text collection. The classical solution involves dynamic program- ming and needs O(mn) time [14]. Recently, a number of sequential algorithms improved the classical time consuming one; see for instance the surveys [7,11].

Some of them are sublinear in the sense that they do not inspect all the charac- ters of the text collection.

D. Kranzlm¨uller et al. (Eds.): Euro PVM/MPI 2002, LNCS 2474, pp. 432–440, 2002.

Springer-Verlag Berlin Heidelberg 2002c

(2)

We are particularly interested in information retrieval, where current free text collections is normally so very large that even the fastest on-line sequential algorithms are not practical, and therefore the parallel and distributed process- ing becomes necessary. There are two basic methods to improve the performance of approximate string matching on large text collections: one is based on the ﬁne- grain parallelization of the approximate string matching algorithm [2,12,13,6,4,5]

and the other is based on the distribution of the computation of character com- parisons on supercomputers or network of workstations. As far as the second method, is concerned distributed implementations of approximate string match- ing algorithm are not available in the literature. However, we are aware of few attempts for implementing other similar problems on a cluster of workstations.

In [3] a exact string matching implementation have been proposed and results are reported on a transputer based architecture. In [9,10] a exact string match- ing algorithm was parallelized and modeled on a homogeneous platform giving positive experimental results. Finally, in [5,16] presented parallelizations of a bi- ological sequence analysis algorithm on a homogenous cluster of workstations and on an Intel iPSC/860 parallel computer respectively. However, the general eﬃcient algorithms for the master-worker paradigm on heterogeneous clusters have been widely developed in [1].

The main contribution of this work is three low-cost parallel approximate string matching approaches that can search in very large free textbases on inex- pensive cluster of heterogeneous PCs or workstations running Linux operating system. These approaches are based on master-worker model using static and dynamic allocation of the text collection. Further, we propose a hybrid parallel approach that combines the advantages of three previous parallel approaches in order to reduce the load imbalance and communication overhead. This hybrid approach is based on the following optimal distribution strategy: the text collec- tion is distributed proportional to workstation’s speed. The four approaches are implemented using the MPI library [15] over a cluster of heterogeneous worksta- tions. To the best of our knowledge, this is the ﬁrst attempt the implementation of approximate string matching application using static and dynamic load bal- ancing strategies on a network of heterogeneous workstations.

2 MPI Master-Worker Implementations of Approximate String Matching

We follow master-worker programming model to develop our parallel and dis- tributed approximate string matching implementations under MPI library [15].

2.1 Static Master-Worker Implementation

In order to present the static master-worker implementation we make the follow-

ing assumptions: First, the workstations are numbered from 0 to p − 1, second,

the documents of our text collection are distributed among the various work-

stations and stored on their local disks and ﬁnally, the pattern and the number

(3)

of errors k are stored in the main memory to all workstations. The partitioning strategy of this approach is to partition the entire text collection into a number of the subtext collections according to the number of workstations allocated.

The size of each subtext collection should be equal to the size of the text col- lection divided by the number of allocated workstations. Therefore, the static master-worker implementation that is called P1 is composed of four phases. In ﬁrst phase, the master broadcasts the pattern string and the number of errors k to all workers. In second phase, each worker reads its subtext collection from the local disk in the main memory. In third phase, each worker performs character comparisons using a local sequential approximate string matching algorithm to generate the number of occurrences. In fourth phase, the master collects the number of occurrences from each worker.

The advantage of this simple approach is low communication overhead. This advantage was achieved, a priori, by the search computation, assigning each worker to search its own subtext independently without have to communicate with the other workers or the master. However, the main disadvantage is the pos- sible load imbalance because of the poor partitioning technique. In other words, there is a signiﬁcant idle time for faster or more lightly loaded workstations on a heterogeneous environment.

2.2 Dynamic Master-Worker Implementations

In this subsection, we implement two versions of the dynamic master-worker model. The ﬁrst version is based on the dynamic allocation of subtexts and the second one is based on the dynamic allocation of text pointers.

Dynamic Allocation of Subtexts The dynamic master-worker strategy that

we adopted is a known parallelization strategy and is known as ”workstation

farm”. Before, we present the dynamic implementation we make the following

assumption: the entire text collection is stored on the local disk of the master

workstation. The dynamic master-worker implementation that is called P2 is

composed of six phases. In ﬁrst phase, the master broadcasts the pattern string

and the number of errors k to all workers. In second phase, the master reads from

the local disk the several chunks of the text collection. The size of each chunk (sb)

is an important parameter which can be aﬀect the overall performance. More

speciﬁcally, this parameter is directly related to the I/O and communication

factors. We selected several sizes of each chunk in order to ﬁnd the best perfor-

mance as we presented in our experiments [8]. In third phase, the master sends

the ﬁrst chunks of the text collection to corresponding worker workstations. In

fourth phase, each worker workstation performs a sequential approximate string

matching algorithm between the corresponding chunk of text and the pattern in

order to generate the number of occurrences. In ﬁfth phase, each worker sends

the number of occurrences back to master workstation. In sixth phase, if there

are still any chunks of the text collection left, the master reads and distributes

next chunks of the text collection to workers and loops back to fourth phase.

(4)

The advantage of this dynamic approach is low load imbalance, while the disadvantage is higher inter-workstation communication overhead.

Dynamic Allocation of Text Pointers Before, we present the dynamic im- plementation with the text pointers we make the following assumptions: First, the complete text collection is stored on the local disks of all workstations and second, the master workstation has a text pointer that shows the current position in the text collection. The dynamic allocation of text pointers that is called P3 is composed of six phases. In first phase, the master broadcasts the pattern string and the number of errors k to all workers. In second phase, the master sends the first text pointers to corresponding workers. In third phase, each worker reads from the local disk the sb characters of text starting from the pointer that re- ceives. In fourth phase, each worker performs a sequential approximate string matching procedure between the corresponding chunk of text and the pattern in order to generate the number of occurrences. In fifth phase, each worker sends the result back to master. In sixth phase, if the text pointer does not reach the end of the text, then master updates the text pointers for the next position of next chunks of text and sends the pointers to workers and loops back to third phase.

The advantage of this simple implementation is that reduces the inter work- station communication overhead since each workstation in this scheme has an entire copy of the text collection on the local disk. However, this scheme requires more local space (or disk) requirements, but the size of the local disk in parallel and distributed architectures is large enough.

2.3 Hybrid Master-Worker Implementation

Here, we develop a hybrid master-worker implementation that combines the advantages of static and dynamic approaches in order to reduce the load imbal- ance and communication overhead. This implementation is based on the optimal distribution strategy of the text collection that is performed statically. In the following subsection, we describe the optimal text distribution strategy and its implementation.

Text Distribution and Load Balancing To avoid the slowest workstations to determine the parallel string matching time, the load should be distributed proportionally to the capacity of each workstation. The goal is to assign the same amount of time, which may not correspond to the same amount of the text collection.

A balanced distribution is achieved by a static load distribution made prior to the execution of the parallel operation. To achieve a good balanced distribu- tion among heterogeneous workstations, the amount of text distributed to each workstation should be proportional to its processing capacity compared to the entire network:

l

i

= S

_i

_p−1

j=0

S

_j

(1)

(5)

where S

j

is the speed of the workstation j. Therefore, the amount of the text collection that is distributed to each workstation M

i

(1 ≤ i ≤ p) is l

i

∗n, where n is the length of the complete text collection.

The hybrid implementation that is called P4 is same as the P1 implementa- tion but we use the optimal distribution method instead of the uniform distri- bution one. The four entire parallel implementations are constructed so that al- ternative sequential approximate string matching algorithms can be substituted quite easily [7,11]. In this paper, we use the classical SEL dynamic programming algorithm [14].

3 Experimental Results

In this section, we discuss the experimental results for the performance of four parallel and distributed algorithms. These algorithms are implemented in C pro- gramming language using the MPI library [15].

3.1 Experimental Environment

The target platform for our experimental study is a cluster of heterogeneous workstations connected with 100 Mb/s Fast Ethernet network. More speciﬁ- cally, the cluster consists of 4 Pentium MMX 166 MHz with 32 MB RAM and 6 Pentium 100 MHz with 64 MB RAM. A Pentium MMX is used as master work- station. The average speeds of the two types of workstations, Pentium MMX and Pentium, for the four implementations are listed in Table 1. The MPI im- plementation used on the network is MPICH version 1.2. During all experiments, the cluster of workstations was dedicated. Finally, to get reliable performance results 10 executions occurred for each experiment and the reported values are the average ones. The text collection we used was composed of documents, which were portion of the various web pages.

3.2 Experimental Results

In this subsection, we present the experimental results concluding from two sets of experiments. For the ﬁrst experimental setup, we study the performance of four master-worker implementations P1, P2, P3 and P4. For the second ex- perimental setup, we examine the scalability issue of our implementations by doubling the text collection.

Table 1. Average speeds (in chars per sec) of the two types of workstations

Application Pentium MMX Pentium P1,P4 3693675.74 2200079.448

P2 3753988.176 2194043.93

P3 3737505.55 2197613.22

(6)

Comparing the Four Types of Approximate String Matching Imple- mentations Before we present the results for four methods, we determined from the extensive experimental study [8] that the block size nearly sb=100,000 characters produces optimal performance for two dynamic master-worker meth- ods P2 and P3, later experiments are all performed using this optimal value for the P2 and P3. Further, from [8] we observed that the worst performance is obtained for very small and large values of block size. This is because small values of block size increase the inter-workstation communication, while large values of block size produce poorly balanced load.

Figures 1 and 2 show the execution times and the speedup factors with re- spect to the number of workstations respectively. It is important to note that the execution times and the speedups, which are plotted in Figures 1 and 2 are result of average for five pattern lengths (m=5, 10, 20, 30 and 60) and four values of the number of errors ( k=1, 3, 6 and 9). The speedup of a heteroge- neous computation is defined as the ratio of the sequential execution time on the fastest workstation to the parallel execution time across the heterogeneous cluster. To have a fair comparison in terms of speedup, one defines the system computing power, which considers the power available instead of the number of workstations. The system computing power defines as follows:

_p−1

i=0

S

_i

/S

0

for p workstations used, where S

0

is the speed of the master workstation.

As we have expected, performance results show that the P1 implementation using static load balancing strategy is less eﬀective than the other three imple- mentations in case of heterogeneous network. This fact due to the presence of waiting time associated to communications. In other words, the slowest work- station is always the latest one in string matching computation. Further, the P2 implementation using dynamic allocation of subtexts produces better results than the P1 one in case of heterogeneous cluster. Finally, the experimental results show that the P3 and P4 implementations seem to have the best performance compared with the others in case of heterogeneous cluster. These implementa- tions give smaller execution times and higher speedups than in case of using

10 20 30 40 50 60 70 80 90 100

1 2 3 4 5 6 7 8

Time (in seconds)

Number of workstations SEL search algorithm, n=13MB and k=3

P1 P2P3 P4

5 10 15 20 25 30 35 40

1 2 3 4 5 6 7 8

Time (in seconds)

Number of workstations SEL search algorithm, n=13MB and m=10

P1 P2P3 P4

Fig. 1. Experimental execution times (in seconds) for text size of 13MB and k=3

using several pattern lengths (left) and m=10 using several values of k (right)

(7)

1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

1 2 3 4 5 6 7 8

Speedup

P1P2 P3 P4

1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

1 2 3 4 5 6 7 8

Speedup

P1P2 P3 P4

Fig. 2. Speedup of parallel approximate string matching with respect to the number of workstations for text size of 13MB and k=3 using several pattern lengths (left) and m=10 using several values of k (right)

the P1 and P2 ones when the network becomes heterogeneous, i.e. after the 3rd workstation.

We now examine the performance of the P2, P3 and P4 parallel implemen- tations. From the results, we see a clear reduction in the computation time of the algorithm when we use the three parallel implementations. For instance, with k=3 and several pattern lengths, we reduce the average computation time from 95.085 seconds in the sequential version to 17.206, 16.176 and 16.040 sec- onds in the distributed implementations P2, P3 and P4 respectively using 8 workstations. In other words, from the Figure 1 we observe that for constant total text size there is an expected inverse relation between the parallel execu- tion times and the number of workstations. Further, the three master-worker implementations achieve reasonable speedups for all workstations. For example, with k=3 and several pattern lengths, we had an increasing speedup curves up to about 5.52, 5.86 and 5.91 in distributed methods P2, P3 and P4 respectively on the 8 workstations which had the computing power of 5.92, 5.93 and 5.97.

Scalability Issue To study the scalability of three proposed parallel implemen-

tations P2, P3 and P4, we setup the experiments in the following way. We simple

double the old text size two times. This new text collection is around 27MB. Re-

sults from these experiments have been depicted in Figures 3 and 4. The results

show that the three parallel implementations still scales well though the prob-

lem size has been increased two times (i.e. doubling the text collection). The

average execution times for k=3 and several pattern lengths similarly decrease

to 34.063, 32.298 and 32.150 seconds for the P2, P3 and P4 implementations

respectively when the number of workstations have been added to 8. Moreover,

speedup factors of three methods also linearly increase when the workstations

are increased. Finally, the best performance results are obtained with the P3

and P4 load balancing methods.

(8)

20 40 60 80 100 120 140 160 180 200

1 2 3 4 5 6 7 8

Time (in seconds)

P2P3 P4

10 20 30 40 50 60 70 80 90

1 2 3 4 5 6 7 8

Time (in seconds)

P2P3 P4

Fig. 3. Experimental execution times (in seconds) for text size of 27MB and k=3 using several pattern lengths (left) and m=10 using several values of k (right)

1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

1 2 3 4 5 6 7 8

Speedup

P2 P3 P4

1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

1 2 3 4 5 6 7 8

Speedup

P2 P3 P4

Fig. 4. Speedup of parallel approximate string matching with respect to the number of workstations for text size of 27MB and k=3 using several pattern lengths (left) and m=10 using several values of k (right)

4 Conclusions

In this paper, we have presented four parallel and distributed approximate string matching implementations and the performance results on a low-cost cluster of heterogeneous workstations. We have observed from this extensive study that the P3 and P4 implementations produce better performance results in terms execution times and speedups than the others. Higher gains in performance are expected for a larger number of varying speed workstations in the network.

Variants of the approximate string matching algorithm can directly be im-

plemented on a cluster of heterogeneous workstations using the four text dis-

tribution methods reported here. We plan to develop a theoretical performance

model in order to conﬁrm the experimental behaviour of four implementations

on a heterogeneous cluster. Further, this model can be used to predict the ex-

ecution time and similar performance metrics for the four approximate string

matching implementations on larger clusters and problem sizes.

(9)

References

1. O. Beaumont, A. Legrand and Y. Robert, The master-slave paradigm with het- erogeneous processors, Report LIP RR-2001-13, 2001. 433

2. H. D. Cheng and K. S. Fu, VLSI architectures for string matching and pattern matching, Pattern Recognition, vol. 20, no. 1, pp. 125-141, 1987. 433

3. J. Cringean, R. England, G. Manson and P. Willett, Network design for the implementation of text searching using a multicomputer, Information Processing and Management, vol. 27, no. 4, pp. 265-283, 1991. 433

4. D. Lavenier, Speeding up genome computations with a systolic accelerator, SIAM News, vol. 31, no. 8, pp. 6-7, 1998. 433

5. D. Lavenier and J. L. Pacherie, Parallel processing for scanning genomic data- bases, in Proc. PARCO’97, pp. 81-88, 1997. 433

6. K. G. Margaritis and D. J. Evans, A VLSI processor array for flexible string matching, Parallel Algorithms and Applications, vol. 11, no. 1-2, pp. 45-60, 1997.

433 7. P. D. Michailidis and K. G. Margaritis, On-line approximate string searching al- gorithms: Survey and experimental results, International Journal of Computer Mathematics, vol. 79, no. 8, pp. 867-888, 2002. 432, 436

8. P. D. Michailidis and K. G. Margaritis, Performance evaluation of load balancing strategies for approximate string matching on a cluster of heterogeneous work- stations, Tech. Report, Dept. of Applied Informatics, University of Macedonia, 2002. 434, 437

9. P. D. Michailidis and K. G. Margaritis, String matching problem on a cluster of personal computers: Experimental results, in Proc. of the 15th International Conference Systems for Automation of Engineering and Research, pp. 71-75, 2001.

433 10. P. D. Michailidis and K. G. Margaritis, String matching problem on a cluster of personal computers: Performance modeling, in Proc. of the 15th International Conference Systems for Automation of Engineering and Research, pp. 76-81, 2001.

433 11. G. Navarro, A guided tour to approximate string matching, ACM Computer Sur- veys, vol. 33, no. 1, pp. 31-88, 2001. 432, 436

12. N. Ranganathan and R. Sastry, VLSI architectures for pattern matching, Inter- national Journal of Pattern Recognition and Artificial Intelligence, vol. 8, no. 4, pp. 815-843, 1994. 433

13. R. Sastry, N. Ranganathan and K. Remedios, CASM: a VLSI chip for approxi- mate string matching, IEEE Transactions on Pattern Analysis and Machine In- telligence, vol. 17, no. 8, pp. 824-830, 1995. 433

14. P. H. Sellers, The theory and computations of evolutionaly distances: pattern recognition, Journal of Algorithms, vol. 1, pp. 359-373, 1980. 432, 436

15. M. Snir, S. Otto, S. Huss-Lederman, D. W. Walker and J. Dongarra, MPI: The complete reference, The MIT Press, Cambridge, Massachusetts, 1996. 433, 436 16. T. K. Yap, O. Frieder and R. L. Martino, Parallel computation in biological se-