The push and pull functions provide straightforward mechanisms for fine-grained control of data movement. The push function accepts a dictionary and a (optional) list of targets as arguments and updates the targets' namespaces with that dictionary.
The pull function accepts a name and a list of targets and returns a list containing the value of the name on those targets. For example:
In [43]: dv.push({"hello":"world"}) Out[43]: <AsyncResult: _push>
In [44]: dv["hello"]
Out[44]: ['world', 'world', 'world', 'world']
In [45]: dv.push({"hello":"dolly"}, [2,3]) Out[45]: <AsyncResult: finished>
Chapter 5 In [46]: dv["hello"]
Out[46]: ['world', 'world', 'dolly', 'dolly']
In [48]: ar = dv.pull("hello", [1,2])
In [49]: ar
Out[49]: <AsyncResult: finished>
In [50]: ar.get()
Out[50]: ['world', 'dolly']
Imports
Consider the following straightforward computation of a hailstone sequence, with a random sleep thrown in:
import time import random
def f(n):
curr = n tmp = 0
time.sleep(random.random()) while curr != 1:
tmp = tmp + 1 if curr % 2 == 1:
curr = 3 * curr + 1 else:
curr = curr/2 return tmp
Calling f from the IPython command line works as expected:
In [15]: f(28) Out[15]: 18
Running it in parallel also looks smooth:
In [7]: !ipcluster start -n 4 &
Opening the Toolkit – The IPython API
[ 144 ]
2015-12-29 11:35:37.619 [IPClusterStart] Starting Controller with LocalControllerLauncher
2015-12-29 11:35:38.624 [IPClusterStart] Starting 4 Engines with LocalEngineSetLauncher
2015-12-29 11:36:08.952 [IPClusterStart] Engines appear to have started successfully
In [8]: from ipyparallel import Client In [10]: c = Client()
In [11]: dv = c[:]
In [12]: %run hail2.py
In [13]: ar = dv.map(f, range(1, 11))
In [14]: ar
Out[14]: <AsyncMapResult: finished>
The problem first becomes apparent when attempting to access the results:
In [17]: ar[0]
[0:apply]:
--- ---NameError
Traceback (most recent call last)<string> in <module>()
<remotefunction.py> in <lambda>(f, *sequences) 229 if self._mapping:
230 if sys.version_info[0] >= 3:
--> 231 f = lambda f, *sequences: list(map(f,
*sequences))
232 else:
233 f = map
<hail2.py> in f(n) 5 curr = n 6 tmp = 0
----> 7 time.sleep(random.random())
Chapter 5 8 while curr != 1:
9 tmp = tmp + 1
NameError: name 'time' is not defined
One could (and should) always check the metadata to determine whether an error has occurred:
In [21]: ar.metadata[0].status Out[21]: 'error'
In [22]: ar.metadata[0].error
Out[22]: ipyparallel.error.RemoteError('NameError', "name 'time' is not defined")
However, the real question is: why did the error happen? The answer is that when IPython starts an engine, it does not copy the environment of the Hub to the engine. Running the program in the interactive session imports the libraries into the interactive session only. Calling the function on an engine does not import the library into the engine.
The solution is to simultaneously import modules locally and globally using the sync_imports context manager. This will force the import to happen on all active engines in the DirectView. The following lines will fix the problem:
In [23]: with dv.sync_imports( ):
....: import time ....:
importing time on engine(s)
In [24]: with dv.sync_imports( ):
....: import random ....:
importing random on engine(s)
In [25]: ar = dv.map(f, range(1, 11)) In [26]: ar
Opening the Toolkit – The IPython API
[ 146 ]
Discussion
An alternative way to achieve the same results without using sync_imports is to move the import statements into the function body. Both approaches have their advantages and drawbacks:
sync_import Function body
Advantages There is centralized control of the environment.
It clarifies what modules are being imported.
It is easier to determine what libraries are actually in use.
This Eases refactoring.
Drawbacks This moves import away from where it is used.
It imports a module to every engine, even if only some use it.
This violates the PEP08 Style Guide.
It is less efficient on repeated function calls.
At the bottom, the conflict between the two approaches is between centralization and localization. This conflict crops up whenever a system is complex enough to require configuration. There is no single solution that will be right for every project. The best that can be done is to settle for an approach and use it consistently while also keeping in mind any problems that the approach brings with it.
LoadBalancedView
A LoadBalancedView uses a scheduler to execute jobs one at a time, but without blocking. The basic concept is that a LoadBalancedView can be treated as if it were a single engine, except that instead of waiting for the engine to finish before submitting the next job, one can simply submit another immediately. The scheduler is responsible for determining when and where each job will run.
The effect of this on function execution is straightforward:
• For apply* functions, the scheduler provides an engine and f, *args, and
**kwargs are executed on that engine.
• For map* functions, *sequences is iterated over. Moreover, f, the next item from *sequences, and **kwargs are sent to the engine provided by the scheduler.
Chapter 5
Data movement
The data movement functionality of DirectView depends on knowing the engine on which each process is running. A LoadBalancedView allows the scheduler to assign each process to an engine, so this information is not available. As such, the data movement functionality of DirectView is not available in a LoadBalancedView. Data movement in a LoadBalancedView requires external mechanisms, such as ZeroMQ or MPI. These mechanisms provide alternative ways of determining process location (for example, MPI's rank indicator) or a means of specifying a connection endpoint (such as ZeroMQ's use of port numbers).
Using MPI and ZeroMQ for data movement is a big enough topic to warrant a chapter on its own. See Chapter 4, Messaging with ZeroMQ and MPI, for more details.
Imports
The situation with imports is the same in a LoadBalancedView and a DirectView.
Summary
In this chapter, we saw a variety of useful features provided by IPython. While no single feature is of game-changing importance, each provides the right tool for its job.
IPython's timing utilities, whether through the utils.timing or the timeit and prun magics, provide a quick and easy way to measure application performance.
The AsyncResult class provides more than a variety of different methods of obtaining results from asynchronous jobs. Metadata about the results is also
available, allowing the developer to access important information such as when a job was started and its error status.
Given this data about jobs, the Client class provides access to job-control functionality. In particular, queues can be accessed and jobs and engines can be stopped based on their status.
A Client object can be used to obtain a View object (either a DirectView or a LoadBalancedView). Both Views are the primary mechanisms by which jobs are started on engines.
Opening the Toolkit – The IPython API
[ 148 ]
DirectView works as a multiplexer. It has a set of engines (targets) and it does the same thing to all of its engines. This allows the developer to treat a set of engines as if they were a single entity. An important capability of a DirectView is its ability to make things "the same" on all of its target engines. In particular, DirectView provides mechanisms for data movement (a global dictionary, scatter-gather, and push-pull) and for managing the environment (the sync_imports context manager).
A LoadBalancedView also allows multiple engines to be treated as a single
engine. In this case, however, the conceptual model is that of a single, nonblocking engine rather than a set of engines. The controller can feed multiple jobs to the LoadBalancedView without blocking. For its part, the LoadBalancedView depends on a scheduler to actually execute the jobs. This does not change the environmental management situation with respect to a DirectView, but it does necessitate an external data movement tool, such as MPI or ZeroMQ.
While this chapter briefly outlined some of the more useful features of IPython's libraries, IPython cannot do everything itself. In the next chapter, we will take a look at how additional languages can be combined with Python to further expand the range of available tools.