pList Design and Implementation - The STAPL Parallel Container Framework

In this section, we describe the pList modules used for storage and data distribution information.

bContainer: For the stapl pList, we use the stl list as the container of the bContainer. Most pList methods will ultimately be executed on the bContainer using the bContainer’s corresponding methods. For example, pList insert will ultimately invoke the stl list insert method. The pList bContainer can also be provided by the user so long as insertions and deletions never invalidate iterators, and the bContainer provides the required domain interface (see below). Additional

requirements are relative to the expected performance of the methods (e.g., insertions and deletions should be constant time operations).

The pList has a global view of all of the bContainers and knows the order between them in order to provide a unique traversal of all its data. For this reason each bContainer is identified by a globally unique BCID. For static or less dynamic – in terms of number of bContainers – pContainers such as pArray or associative containers, the BCID can be a simple integer. The pList, however, needs a BCID that allows for fast dynamic operations. During the splice operation, bContainers from a pList instance need to be integrated efficiently into another pList instance while maintaining the uniqueness of their BCIDs. For these reasons, the BCID for the pList bContainers is currently defined as follows:

typedef std::pair<plist_bcontainer*, location_identifier> CID

Global Identifiers (GID): Performance and uniqueness considerations similar to those of the bContainer identifier, and the list guarantee that iterators are not invalidated when elements are added or deleted, lead us to use the following definition for the pList GID.

typedef std::pair<std::list<>::iterator, BCID> gid;

Since the BCID is unique, the GID is unique as well. With the above definition for GID, the pList can uniquely identify each of its elements and access them independent of their physical location.

Domain: The domain interface for the pList is provided by the pList bContainers. The pList domain is a union of all domains corresponding to individual bContainers. The union domain doesn’t replicate any data from the pList but it stores pointers to its bContainers.

Location 0 Location 1 pList with blocked mapping

bCont bCont bCont bCont

pList with cyclic mapping

bCont bCont _bCont bCont

Fig. 37. Different partitions and mappings for pList.

Data Distribution: The data distribution manager for a pList uses a partition and a partition-mapper to describe how the data will be grouped and mapped on locations. The pList specializes the partition mapper to take advantage of the fact that the location identifier is embedded in the bContainer identifier.

The pList uses a dynamic partition that can maintain an arbitrary number of bContainers and elements per location. The partition constructor can take an optional argument, which is the number of desired bContainers and it will allocate them balanced across locations. The allocation can be done in a blocked fashion or in a cyclic fashion as depicted in Figure 37. Subsequent insert and delete operations may lead to imbalanced distributions of elements in the bContainer. The pList provides a method for this situation to redistribute the data so that elements are rebalanced across locations.

pView: The pList currently supports sequence pViews that provide an iterator type and begin() and end() methods. A pView can be partitioned into sub-views. By default the partition of a pList pView matches the subdivision of the list in

bContainers, thus allowing random access to portions of the pList. This allows parallel algorithms to achieve good scalability as shown in Section D.

pList Container: A typical implementation of a pList method that operates at the element level is included in Figure 38 and uses the invoke skeleton introduced in Chapter IV, Section 6. The run-time cost of the method has three constituents: the time to decide the location and the bContainer where the element will be added (Figure 8, line 5-15), the communication time to get/send the required information (Figure 8, line 10), and the time to perform the operation on a bContainer (Figure 8, line 17).

The complexity of constructing a pList of N elements is O(M +log P ), where M is the maximum number of elements in a location. The log P term is due to a fence at the end of the constructor to guarantee the pList is in a consistent state. The complexities of the element-wise methods are O(1). Multiple concurrent invocations of such methods may be partially serialized due to concurrent thread-safe accesses to common data. The size and empty methods imply reductions and the complexity is O(log P ), while clear is a broadcast plus the deletion of all elements in each location, so the complexity is O(M + log P ). This analysis relies on the pList bContainer to guarantee that allocation and destruction are linear time operations and size, insert, erase and push back/front are constant time operations.

The pList also provides methods to rearrange data in bulk. These methods are splice and split to merge lists together and split lists, respectively.

The signature of the pList splice method is:

void pList::splice(iter pos, pList& pl [, iter it1, iter it2]);

where iter stands for an iterator type, pos is an iterator of the calling pList, pl is another pList, and the optional iterators it1 and it2 are iterators pointing to

1 p l i s t : : p c o n t a i n e r s e q u e n c e : : i n s e r t ( g i d , v a l ) { 2 th i s−>m dist −>i n v o k e (MP INSERT ELEMENT,

3 b o o s t : : bind (& p a r t i t i o n t y p e : : i n s e r t e l e m e n t , g i d , v a l ) , 4 b o o s t : : bind (& p a r t i t i o n t y p e : : w h e r e i n s e r t e l e m e n t , g i d ) ) ; 5 }

Fig. 38. pList method implementation.

elements of pl. splice removes from pl the portion enclosed by it1 and it2 and inserts it at pos. By default it1 denotes the begin of pl and it2 the end.

The complexity of splice depends on the number of bContainers included within it1 and it2. If it1 or it2 points to elements between bContainers, then new bContainers are generated in constant time using sequential list splice. Since the global begin and global end of the pList are replicated across locations, the operation requires a broadcast if either of them is modified.

split is also a member method of pList that splits one pList into two. It is a parallel method that is implemented based on splice with the following signature: void pList::split(iterator pos, pList& other_plist)

When pList.split(pos, other plist) is invoked, the part of pList starting at pos and ending at pList.end() is appended at the end of the other plist. The complexity of split is analogous to the complexity of splice.

D. Performance Evaluation

In this section, we evaluate the scalability of the pList methods. We compare pList and pVector performance, evaluate generic pAlgorithms (p generate and p partial sum) on pList, pArray and pVector, and evaluate an Euler tour imple-

0 0.5 1 1.5 2 2.5 3 128 256 512 1024 2048 4096 8192 16384 Execution Times(sec) Num Procs push anywhere async

push anywhere insert insert async

(a) CRAY4: Local Methods

0 1 2 3 4 5 6 7 8 128 256 512 1024 2048 4096 8192 16384 Execution Times(sec) Num Procs insert 1% insert async 1% insert 2% insert async 2%

(b) CRAY4: Insert Remote

0 0.2 0.4 0.6 0.8 1 1.2 1 2 4 8 16 32 64 128 Execution Times(sec) Num Procs push anywhere async

push anywhere insert insert async

2 4 6 8 10 12 14 16 1 2 4 8 16 32 64 128 Execution Times(sec) Num Procs insert 1% insert async 1% insert 2% insert async 2%

(d) P5-cluster: Insert Remote Fig. 39. Execution times for pList methods.

mentation using pList.

In document The STAPL Parallel Container Framework (Page 154-159)