Section 3.4 defines COMPSs as a programming model with no APIs, while the runtime described in the previous section provides an API with two functions. To close the gap between the applica- tion programming and the runtime interface, the proposed framework provides a mechanism that instruments the code written by the developer during the building and packaging of the application.
This mechanism consists in adding a step to the four-step Android building process (described in Section 3.4) after the Java Builder and before the bundling of the application: the COMPSs
4.3. INSTRUMENTATION
Instrumenter. During the first two steps, the building process completes the application code with all the necessary auxiliary classes, and the third step, the Java Builder, compiles it to Java bytecode. Using Javassist, a library for Java bytecode editing, the framework can modify the application classes as done for the regular COMPSs version – leaving aside the differences in the APIs of both runtimes. The framework scans all the classes of the application – not the classes within libraries on which the application depends – mainly looking for four code patterns that require instrumentation:
• Calls to CE methods. The instrumentation has to replace them by executeTask invocations passing as parameters the name of the invoked method, the class to which the method belongs, whether the call is on an instance or not, and the list of parameters. If the method is not statically called, the instrumented code includes the callee object as a parameter. If the method returns any value, the instrumented code creates an empty instance of the same class to use as a future object and also adds it to the list of parameters.
• Calls to non-CE methods on an object instance. Prior the execution of the method, the runtime needs to check if the object is not the result of a task and, if that is the case, it synchronizes its value. Before executing the method, the instrumented code calls the accessValue method of the runtime API. It always assumes that the body of the method modifies the content of the object.
• Constructors of Java utility classes to interact with files. Through these classes, the applica- tion reads/updates the content of a file that might be accessed by a task; thus, they require the same treatment as calls to non-CE methods on an object instance. Before the execution, the runtime needs to synchronize the content of the file through the accessValue method. Depending on the actions that the class allows to do on the file content, the instrumentation indicates whether the access modifies the value or not.
• Calls to non-CE methods whose code has not been instrumented (black-box methods). Besides the callee object, also the values passed as parameters require a synchronization. On instrumented methods, synchronization of parameters values are instrumented in the body of the method; for non-instrumented methods, the runtime cannot delay this verification beyond the invocation method. For each parameter, the code adds an invocation to the accessValue method to synchronize the accessed values. Again, the instrumentation assumes that the body of the method modifies the content of the object.
The instrumented code replace the original classes of the application, and the building process continues as usual for a regular Android application: the code is converted into Dalvik bytecode and bundled into the apk file.
Besides the code instrumentation, COMPSs Instrumenter needs to modify the Android- Manifest that is bundled along with the application logic to include the runtime service as
an application component and request the permissions that it requires to operate. Namely, it requests the Internet access permission, for interacting and exchanging data with the surrogate nodes; and write access to external storage permission, for using the mobile sdcard to store intermediate values and release the memory from that burden. As discussed in Chapter 8, this thesis considers securing the submission of these intermediate data values over the network; however, it dismisses applying any encryption mechanism automatically to protect data at rest. Figure 4.5 illustrates the differences between the original manifest and the extended manifest.
<manifest xmlns:android="http://schemas.android.com/apk/res/android" android:versionCode="1" android:versionName="1.0" package="es.bsc.mobile.apps.bs" >
<uses-sdk android:minSdkVersion="8" android:targetSdkVersion="21" />
<uses-permission android:name="android.permission.INTERNET" />
<uses-permission android:name="android.permission.WRITE_EXTERNAL_STORAGE" />
<application android:allowBackup="true" android:debuggable="false" android:icon="@drawable/ic_launcher" android:label="@string/app_name" >
<activity android:label="@string/app_name" android:name="es.bsc.mobile.apps.bs.MainActivity" > <intent-filter>
<action android:name="android.intent.action.MAIN" />
<category android:name="android.intent.category.LAUNCHER" /> </intent-filter>
</activity>
<service android:name="es.bsc.mobile.runtime.service.RuntimeService" android:process=":newprocess" />
</application> </manifest>
Figure 4.5: AndroidManifest file extended by the COMPSs Instrumenter. The additional elements are highlighted over the gray code which corresponds the original Android Manifest.
4.4
Summary
This chapter gives a glimpse of the solution proposed in this dissertation to build mobile applica- tions that offload computation onto remote resources to improve their performance. Developers write the application leveraging on an extended version of COMPSs programming model. The extension allows developers to declare multiple implementations for one Core Element; thus, when the application invokes any of the implementations, the programming model creates a new task, and the designated device executes the most suitable implementation given the features of its hardware and the characteristics of the task.
As with the original COMPSs programming model, a runtime system runs along with the application to orchestrate the execution on the underlying infrastructure. This runtime has an API with two methods: one for submitting tasks, executeTask; and a second one for register data accesses from the application code, accessValue. For holistically managing the available resources for all the applications running on the mobile device, the runtime is twofold. Running on the same
4.4. SUMMARY
process of the application, one part of the runtime handles the invocations to the API, creates asynchronous tasks and detects the dependencies caused by accesses to the private data values of the application like objects. The second part of the runtime, named Orchestrator, manages the aspects of the execution shared among applications; for instance, detecting the dependencies among tasks due to access to files or running the tasks on the computing devices. This second part of the runtime runs as a service in an independent process; all the COMPSs applications running on the mobile device contact the same instance of the service. For managing the available resources, the Orchestrator groups the computational devices into Computing Platforms according to the mechanisms required to provide the processing elements with the necessary input values, launch the task execution avoiding resource oversubscription and fetching the results back from them. The Offload Decision Engine evaluates the temporal, energetic and economic costs of running a task on each Computing Platform, and decides which of them executes it. Then, each Computing Platform manages internally on which resources, on which moment and using which implementation runs to execute that task. For the different processes and Computing Platforms to share data values, each process has deployed an instance of a distributed key-value store that asynchronously replies queries about data values: the Data Manager.
For the code provided by the developer to invoke the runtime system, the programming environment has to instrument the application monitoring data accesses and replacing the calls to the methods selected as a Core Element by calls to the runtime. This instrumentation happens during the building of the distribution package of the application. The framework extends the default Android process with an additional step that replaces the original code of the application with the instrumented one, attaches the runtime system to the bundle and updates all the necessary configuration files.
Part III
Exploitation of Local Computing
Resources
C
H A P T E R5
CPU EXPLOITATION
T
he first step towards fully exploiting a Mobile Cloud environment is to make a proper use of the computing elements within the mobile. Of all the devices embedded in a mobile, CPU cores are the most natural approach for standard developers to host the computation. Current mobiles are equipped with multi-core CPU; to benefit from their computing capacity, programmers have to deal manually with the management of multiple Java threads or learning the details about the classes provided by the Android framework to offer concurrent execution. In addition to this management, developers need to face the concerns related to the logic of the application already discussed in Chapter 3: partitioning the application and orchestrating the execution of such parts on the cores of the CPU.COMPSs releases application developers from these concerns and passes the responsibility to the runtime system. Upon a task detection, the Offloading Decision Engine picks one of the available platforms to host the computation according to the execution forecasts provided by such platforms and requests the execution of the task to the selected one. At this point, the task may still have some pending data dependencies with previous tasks that need to complete before running it. Computing platforms must hold the execution of tasks with pending dependencies until they are addressed. For that purpose, they monitor the state of the global execution to detect the creation of such data values and notice when a blocked task becomes dependency-free.
The number of processing elements assigned to a Computing Platform is limited, and the number of tasks to run in parallel on the platform is likely to be larger than that. To avoid overloading the devices and harming their performance, platforms need to plan the execution of all the tasks on the available resources over time guaranteeing that all the necessary data will be in place before they start running. Arrived the time scheduled for a task to run, the platform
triggers its execution on the processing element and monitors it. At the end of the execution, the platform retrieves from the processing element any relevant value generated or updated during the task execution – i.e., the values for INOUT and OUT parameters – and publishes their existence. Thus, other tasks can fetch from the Data Manager their value to run on the same processing element, on another processing element within the same platform, or on another platform.
In a few words, a platform is responsible for providing a forecast of the end time, energy consumption and cost of running a task. Once the Offload Decision Engine assigns a task to the platform, it monitors the existence and obtains the data values that tasks require to run, schedules task executions on the available processing elements, submits the executions to the actual resources, and collects their results to make them available to other tasks.
5.1
CPU Platform
For the runtime to support task executions on the CPU, it requires a computing platform able to orchestrate the execution of tasks on its cores: the CPU Platform. Upon the reception of a new task from the Offload Decision Engine, the CPU Platform submits the task to an internal Scheduler which orchestrates the execution of the tasks assigned to the platform on the available resources. The Scheduler plans not only the execution of the tasks but also all the necessary transfers to obtain the proper value of the accessed data values from other processes or nodes.
The first thing that the Scheduler does with a just submitted task is checking if there are any pending dependencies. For that purpose, it contacts the local Data Manager for querying the existence of every datum used as input for the task. If the value exists, the Data Manager directly replies the query; otherwise, it registers the query and delays the existence notification until the value creation.
From the reception of the notification on, the Scheduler can decide to trigger the obtention of the actual values to run the task. Arrived that time, the platform contacts again the local Data Manager requesting the value associated with the corresponding version of the datum. If the Data Manager already contains the value, it instantly notifies the value presence so that the task execution uses it; otherwise, it registers the query and fetches the value from a remote source. Once the Data Manager receives the value, it stores it and notifies its presence to the platform.
Often, the body of a task modifies the value of a datum passed as a parameter; however, the Data Manager needs to preserve the original value so that tasks running later on the same process can read it. When the platform requests a value for an INOUT parameter, the Data Manager clones the value stored for the initial version of the datum, and the task uses the copy as a parameter; thus, the task modifies the clone and leaves the stored value untouched. The Data Manager delays the value presence notification of INOUT parameters until the respective copy operation finishes.
5.1. CPU PLATFORM
Eventually, the Scheduler notices that the local Data Manager can provide the values to run the task on the cores. From that point on, the Scheduler can decide to launch the execution of any of the implementations of the CE at any moment. The platform has a pool of independent Java threads, whose size is configurable, continually polling the Scheduler for a task to run. When a thread gets a task from the scheduler, it gathers all the input values of the task and invokes the method corresponding to the selected implementation using reflection. At the end of the method execution, the thread collects the result of the method and the values of all the parameters – possibly modified – and notifies the platform of the end of the execution so that it stores values for the new data values on the Data Manager. At this point, the Data Manager notifies the existence of the just-stored values so that the Scheduler processes the new dependency-free tasks.
Figure 5.1 depicts the 8-step process involving the execution of a task on the CPU Platform. Steps 1 and 2 represent the existence queries and notifications done for each parameter, and steps 3 and 4 are the presence queries/transfer requests for the values and their corresponding notifications. Step 5 illustrates the submission of the task to one of the executing threads; 6, the actual execution of the task on the CPU core; and step 7, the notification of the task completion to the Scheduler. Upon the reception of the notification, the platform stores all the updated/new values on the Data Manager represented by step 8.
Mobile Device Runtime Process Task Executor Data Manager CPU Platform O oad Decision Engine Scheduler Thread Pool 1 2 3 4 5 6 7 8
Figure 5.1: Architecture of the CPU platform illustrating the flow involving a task execution.
The current policy for scheduling the transfers consists in requesting all the data values of a task at the same time as soon as the Scheduler realizes that the task is dependency-free. For task executions, the Scheduler follows a FIFO policy considering the moment when all the parameters of the task are present on the local Data Manager as the moment when the task gets in the Scheduler.
Last, but not least, the platform needs to provide the forecasts for the execution. To make these predictions the platform takes into account information related to the available resources, like the number of cores and their power consumption; to the underlying infrastructure, such as the speed
of the network or data transfer fees; to the task to run, such as the CE, the implementations able to run on the platform or the size of the parameters; and to the pending workload of the platform.
Android directly provides some of these values such as the power consumption of the mobile components (cores, screen, network interfaces, ...). The runtime can infer other values from the current state and standards; for instance, the used network protocol and the strength of the signal can lead to the speed of the network. Values like data fees or the number of available cores require the user to set them up. However, information, like the timespan to run a concrete task implementation on the CPU, requires application-specific knowledge that developers do not provide.
For the runtime to obtain this information, the CPU Platform profiles the execution of each task to collect data about the duration of the execution and the sizes of the input and output para- meters. At the end of the task, the runtime adds the measured values into a statistical analysis of the historical values obtained throughout all the executions of the application. Currently, this analysis consists in keeping track of the highest, lowest and average value, but it could include other measures such as the standard deviation. Unlike the execution time that depends on the selected implementation and resources, the input and output data sizes are a feature shared by all the implementations of the CE regardless the platform and resources running it. For this reason, the runtime keeps record of the measures common for all the platforms, Core Profiles, and each platform owns a data structure to register the execution time of the implementation on its resources, Implementation Profiles. The CPU Platform assumes that all the cores of the CPU have the same characteristics; an implementation has the same behavior regardless the CPU core running it. Therefore, the data structure groups the profiles only by the implementation executed. To support executions on systems with a heterogeneous computing architecture coupling different types of processor cores, such as ARM big.LITTLE [6], the application user should set up the runtime to use a different platform to manage the cores of each type of processor and use thread affinity mechanism to bind the execution threads of each platform to the cores of the specific processor.
Considering the average execution time observed, the CPU Platform can find out which is the fastest implementation for each task. Expecting that all the tasks submitted to the platform waiting for execution require the average execution time to run, the platform can estimate when