Parallel execution - Node js Design Patterns Casciaro, Mario [PDF][StormRG] pdf

There are some situations where the order of the execution of a set of asynchronous tasks is not important and all we want is just to be notified when all those running tasks are completed. Such situations are better handled using a parallel execution

flow, as shown in the following diagram:

This may sound strange if we consider that Node.js is single threaded, but if we remember what we discussed in Chapter 1, Node.js Design Fundamentals, we realize that even though we have just one thread, we can still achieve concurrency, thanks to the nonblocking nature of Node.js. In fact, the word parallel is used improperly in this case, as it does not mean that the tasks run simultaneously, but rather that their execution is carried out by an underlying nonblocking API and interleaved by the event loop.

As we know, a task gives the control back to the event loop when it requests a new asynchronous operation allowing the event loop to execute another task. The proper word to use for this kind of flow is concurrency, but we will still use

The following diagram shows how two asynchronous tasks can run in parallel in a Node.js program:

Event Loop

Main Task 1 Task 2

Call Return

In the previous image, we have a Main function that executes two asynchronous tasks:

1. The Main function triggers the execution of Task 1 and Task 2. As these trigger an asynchronous operation, they immediately return the control back to the Main function, which then returns it to the event loop. 2. When the asynchronous operation of Task 1 is completed, the event

loop gives control to it. When Task 1 completes its internal synchronous processing as well, it notifies the Main function.

3. When the asynchronous operation triggered by Task 2 is completed, the event loop invokes its callback, giving the control back to Task 2. At the end of Task 2, the Main function is again notified. At this point, the Main

function knows that both Task 1 and Task 2 are complete, so it can continue its execution or return the results of the operations to another callback.

In short, this means that in Node.js, we can execute in parallel only asynchronous operations, because their concurrency is handled internally by the nonblocking APIs. In Node.js, synchronous (blocking) operations cannot run concurrently unless their execution is interleaved with an asynchronous operation, or deferred with

setTimeout() or setImmediate(). We will see this in more detail in Chapter 6, Recipes.

Web spider version 3

Our web spider application seems like a perfect candidate to apply the concept of parallel execution. So far, our application is executing the recursive download of the linked pages in a sequential fashion. We can easily improve the performance of this process by downloading all the linked pages in parallel.

To do that, we just need to modify the spiderLinks() function to make sure to

spawn all the spider() tasks at once, and then invoke the final callback only when

all of them have completed their execution. So let's modify our spiderLinks()

function as follows:

function spiderLinks(currentUrl, body, nesting, callback) { if(nesting === 0) {

return process.nextTick(callback); }

var links = utilities.getPageLinks(currentUrl, body); if(links.length === 0) {

return process.nextTick(callback); }

var completed = 0, errored = false; function done(err) { if(err) { errored = true; return callback(err); }

if(++completed === links.length && !errored) { return callback();

} }

links.forEach(function(link) { spider(link, nesting - 1, done); });

Let's explain what we changed. As we mentioned earlier, the spider() tasks are

now started all at once. This is possible by simply iterating over the links array and starting each task without waiting for the previous one to finish:

links.forEach(function(link) { spider(link, nesting - 1, done); });

Then, the trick to make our application wait for all the tasks to complete is to provide the spider() function with a special callback, which we call done(). The done() function increases a counter when a spider task completes. When the number of completed downloads reaches the size of the links array, the final callback is invoked:

function done(err) { if(err) {

errored = true; return callback(err); }

if(++completed === links.length && !errored) { callback();

} }

With these changes in place, if we now try to run our spider against a web page, we will notice a huge improvement in the speed of the overall process, as every download is carried out in parallel without waiting for the previous link to be processed.

The pattern

Also, for the parallel execution flow, we can extract our nice little pattern, which we can adapt and reuse for different situations. We can represent a generic version of the pattern with the following code:

var tasks = [...]; var completed = 0; tasks.forEach(function(task) { task(function() { if(++completed === tasks.length) { finish(); } });

});

function finish() {

//all the tasks completed }

With small modifications, we can adapt the pattern to accumulate the results of each task into a collection, to filter or map the elements of an array, or to invoke the finish() callback as soon as one or a given number of tasks complete

(this last situation in particular is called competitive race).

Pattern (unlimited parallel execution): run a set of asynchronous tasks in parallel by spawning them all at once, and then wait for all of them to complete by counting the number of times their callbacks are invoked.

Fixing race conditions in the presence of

In document Node js Design Patterns Casciaro, Mario [PDF][StormRG] pdf (Page 90-94)