New Variable Modifiers for Productive Programming

2.8 Summary

3.1.1 Worker-Local Data

3.1.1.2 New Variable Modifiers for Productive Programming

The direct use of thePlaceLocalHandleand WorkerLocalHandleclasses as currently implemented is messy, as it requires the definition and initialization of each instance using ‘boilerplate’ template code. To support productive programming, we propose that the definition and initialization of PlaceLocalHandle and WorkerLocalHandle

be supported by new variable modifiers for the X10 language: placeLocal and

workerLocal. For example:

1 val t_worker:WorkerLocalHandle[T];

2 t_worker = new WorkerLocalHandle[T]( () => new T(params) );

could be written more simply as: 1 workerLocal t_worker:T;

2 t_worker = new T(params);

The addition of keywords to a language is not without cost: each keyword adds to the cognitive load for programmers learning and using the language. This cognitive load must be weighed against the alternative. In this case, the alternative is verbose boilerplate code that obscures the important information about the variable: its type and the operation used to initialize its local instances.

There is a symmetry in the relationship between workerLocal and async and

placeLocal andat. Code within anasyncactivity must assume that it is working a different instance of a workerLocal value, and furthermore it may assume that no other activity will read or modify that instance for the duration of the activity. Similarly, code within anatstatement must assume that it is working on a different instance of aplaceLocal value to its enclosing scope (and of course it may assume that code running at other places will not read or modify that instance). However, unlikeworkerLocaldata,placeLocaldata are not thread safe. Other worker threads at the place may access theplaceLocalvalue, so possible concurrent accesses must be protected byatomicblocks.

The proposed new keywords are under discussion with the X10 core design team and have not been incorporated into the language.

3.1.2 Visualizing Task Locality In A Work Stealing Runtime

An optimal scheduler should place tasks on processing elements so as to maximize reuse of cached data. For X10 this means that as far as possible, activities that are ‘nearby’ in the sense of sharing data should be executed by the same worker thread.

Divide-and-conquer style programs which share data between related subtasks tend to exhibit good locality when run on work-stealing runtimes. However, work stealing has been observed to exhibit poor locality for other applications, including iterative data-parallel loops, due to random stealing [Acar et al., 2000]. It may therefore be useful for the programmer to know the actual locality achieved by the work-stealing runtime during program execution. To support the profiling and visualization of task locality in X10, we propose a new annotation: @ProfileLocality.

§3.1 Task Parallelism 45

Applying the@ProfileLocalityannotation to anasyncstatement causes the runtime to collect the following information for each activity that executes the statement:

• place ID;

• worker thread ID; and

• one or more locality variables specified by the programmer.

On termination of the program, the collected information for the full list of activities is printed in tabular format to the standard output stream. The@ProfileLocality

annotation could be implemented by a simple compiler transformation, which would have the advantage that instrumentation and logging could be controlled using a compiler flag. For the purposes of evaluating the approach for this thesis, the transformation was performed by hand.

The following example code demonstrates the use of this facility to visualize the locality of activities in a one-dimensional domain decomposition. The code simply computes the histogram of an array of integers, by processing each element of the array in a separate activity.1

1 public static def compute(data:Rail[Int], numBins:Int) {

2 val bins = new Rail[Int](numBins);

3 for (i in 0..(data.size-1)) @ProfileLocality(i) async {

4 val b = data(i) % numBins;

5 atomic bins(b)++;

6 }

7 return bins;

8 }

The annotation on line 3 associates each activity generated by the loop with the locality variable i, which is simply the loop index. On termination, the program prints the locality variableifollowed by the place ID and worker ID for each activity as follows: # Histogram__closure__1 # i place worker 0 0 1 1 0 3 2 0 2 3 0 4 4 0 2 5 0 2 6 0 2 ...

The above output shows that the array element with index i=0 was processed at place 0 by worker thread 1; array elementi= 1 was processed at place 0 by worker

Figure 3.2: Locality of activity-worker mapping for Histogram benchmark (n=2000, X10_NTHREADS=4).

Figure 3.3: Locality of activity-worker mapping for Histogrambenchmark with divide-and- conquer loop transformation (n=2000, X10_NTHREADS=4).

thread 3, and so on. These data can then be visualized using standard plotting tools. For example, figure3.2shows the locality of activities generated by executing Histogramfor an array of 2000 elements on four worker threads at a single place. The activities executed by each thread are represented by a unique color.

It is apparent from the diagram that both load balancing and locality are poor: most activities are executed by one particular worker thread, and the remaining activities are finely and randomly divided between the other three worker threads.

To see the value of such a visualization, contrast figure3.2with figure3.3. In X10 version 2.3, the X10 compiler was enhanced to include an optional loop transformation which converts aforloop over an integer index to a divide-and-conquer pattern of activities, using recursive bisection. This significantly improves locality when using work stealing with simple for loops. Figure 3.3 shows the locality of activities generated byHistogramcompiled using the divide-and-conquer loop transformation. The activities are now more evenly divided between worker threads, and each worker executes large contiguous chunks of the loop, meaning that its reads from thedata

array have a unit stride, rather than being scattered across memory.

Chapters4and5will show how visualizing the locality of activity execution can be used to inform parallel algorithm design for more complex scientific codes.

3.2 Active Messages

An active message combines data transfer and computation in a single message [von Eicken et al.,1992]. Active messages are the core mechanism for point-to-point communications in X10. The following sections describe performance improvements for the implementation of active messages in X10, and a generalization of active messages to collective communications between activities executing at different places.

In later chapters, these improvements will be demonstrated in the context of two scientific applications. Firstly, chapter4will describe the use offinish/ateach collective active messages to share work between places in a quantum chemistry application. Secondly, chapter5will demonstrate the use ofTreeCollectingFinish

to perform an energy summation over a group of places in a molecular dynamics application. Both applications make use of improvements to the serialization of active

§3.2 Active Messages 47

messages described in the next section.

In document X10 for high performance scientific computing (Page 62-65)