Workload Management System Extensions for
gLite
Max Berger1, Thomas Fahringer1, Andr´as L´aszl´o2, Attila Kert´esz3, and Dezs˝o Horv´ath2
1 Leopold-Franzens-Universit¨at Innsbruck 2
MTA KFKI Rszecske- s Magfizikai Kutatintzet 3 MTA SZTAKI Computer and Automation Research Institute
Abstract. The gLite Grid middleware supports execution of jobs on dis-tributed resources through its Workload Management System. However, running large grid applications and scientific experiments still requires manual management. In this paper we describe the new WMSX tool. WMSX wraps the gLite workload management to provide job manage-ment and ease-of-use. We describe our extensions for job managmanage-ment with automatic result retrieval, live debugging, easier run of parame-ter studies, pre- and post-execution scripts, and lightweight workflow support. The WMSX system has been sucessfully used to submit and manage a large number of Grid jobs.
1
Introduction
Grid computing [4] can federate geographically and institutionally distributed resources for coordinated problem solving based on shared resources across dif-ferent domains. Grid middleware software abstracts the physical grid from the user. It hides the complexity of the grid from the user and allows easy access to job submission, execution monitoring, and data handling.
The gLite [8] distribution is the Grid middleware produced and used by the Enabling Grids for E-scienE (EGEE) [7] project. It is based on a collection of previous research projects, combined to deliver a production quality software distribution [9]. The gLite middleware is currently deployed on hundreds of sites as part of the EGEE project and enables global science in a number of disciplines, notably serving the LHC Computing Grid (LCG) project. At this time the EGEE project enables access to 41.000 CPUs and 5 Petabytes of storage [1], making it one of the largest Grids in the world.
The gLite Workload Management System (WMS) [3] supports the execution of computational jobs on remote computational resources in the Grid. To sub-mit a job it must first be described using the Job Description Language (JDL) [13]. JDL contains the description of the job characteristics and requirements. It defines input data, output data, an executable to use, and requirements for the computational node. When a user submits the job, the description and the input data are uploaded to the WMS. The user will receive a JobID which iden-tifies the job in execution. The job is scheduled to a remote site, where it will
Fig. 1.Architecture of the gLite Workload Management System.
eventually be executed. During the life of the job, a user can repeatedly query its current status. When the execution is finished, the results can be retrieved from the WMS and stored on the users system. To keep track of a submitted job, a user must manually keep track of the assigned JobID.
The WMS system, as currently implemented in gLite, shows several draw-backs: JobIDs must be managed by the user, the black box philosophy makes it difficult to debug failed jobs, and the WMS has great support for single jobs, but has only limited support for workflows.
In this paper we describe a novel tool, WMSX, which aims to resolve these issues. The WMSX tools wraps the gLite WMS to provide easier job management for the end-user. It manages job submissions, provides debugging support, and provides support for simple workflows. Its main objective is ease-of-use.
The remainder of this paper is organized as follows: Section 2 describes the Workload Management System and some other extending systems in more de-tail. Section 3 introduces a new extension for the management of complex jobs. Section 4 draws conclusions and gives an outlook.
2
Related work
2.1 gLite WMS
The Workload Management System (WMS) consists of multiple Grid middleware components responsible for the distribution and management of jobs across Grid resources [2]. The WMS is composed of the Workload Manager (WM) and the job Logging and Bookkeeping Service (L&B). It uses other gLite services for data management, policy handling, execution, and information collection. Figure 3 shows the overall architecture of the Workload Management System.
Although WMS takes care of the actual execution, it does not provide much help for the user in job handling. Users have to manually keep track of the JobIDs. They have to manually check if the job is done, and if so, manually request the results to be transferred back to their personal machine. This works well for a limited number of jobs. However, in practice users may have hundreds
or thousands of jobs running on the grid, which requires a more sophisticated job handling.
2.2 Ganga
Ganga (Gaudi /Athena and Grid Alliance) [10] is a frontend for job definition and management, implemented in Python. It maps job description and execution to python classes. Ganga allows switching between testing on a local batch system and large-scale processing on Grid resources. Ganga can be run interactively or used in a python script.
The Ganga system provides job management. It keeps track of the submitted jobs, their status, and is able to retrieve results. It uses an embedded database to keep the information about jobs between runs, so that a user can log out, log back in later and still has access to the same managed jobs. Ganga will fail when multiple instances are run on the same machine, as the access to the embedded database is not properly locked.
The Ganga system provides full support for managing submitted jobs. How-ever, the lack of concurrency support makes it difficult to use it in automated environments. Also, requiring python usage adds to the reluctance of some po-tential users.
2.3 DIANE
DIANE [11] is a job execution control framework for parallel scientific applica-tions. It provides automatic load balancing, fine-grained scheduling and failure recovery. Its default scheduling plug-in algorithm is developed for jobs without inter-task communication. It is designed to be extended with workflow capabil-ity. DIANA uses a master-worker communication model. A master server will submit generic workers onto the Grid, which will then communicate back to the master to acquire the actual computational tasks to be executed.
Although DIANE works well for jobs without interdependencies, the support for workflows is currently limited. As it is based on Ganga, it inherits the lack of concurrency support. The use of generic workers may result in better perfor-mance of the DIANA application but it is unfair towards other Grid applications, which have to be schedules in the regular use the job queues.
2.4 Other Systems
Many other systems, such as Taverna [12], Triana [5], or Askalon [6], provide workflow and job management for the Grid. However, these tools provide their own execution engine and contain some overlapping functionality with gLite. Most of these tools access the grid on a lower level, and are currently incompati-ble with gLite. Rather than porting one of these large systems a more lightweight approach was chosen.
1. Write JDL
2. Submit job to WMS 3. Record JobID
4. Check for job status (fre-quently)
5. Wait until job is done 6. Retrieve output
Fig. 2.Managing a single job man-ually with the standard WMS.
1. Write JDL
2. Submit job to WMSX
Fig. 3. Managing a single job au-tomatically with WMSX.
Fig. 4.Architectural placement of WMSX
3
WMSX
In this section we will introduce a new toolkit (WMSX) which extends the gLite Workload Management system with additional functionality. The existing sys-tems showed limitation in job management, job debugging support, or workflow handling. Some of the tools where not flexible enough or limited to a particular programming or Grid environment, or did not work concurrently. The WMSX (Workload Management System eXtensions) system was developed to overcome these limitations.
The WMSX system takes care of job submission management. It main goal is to simplify the use of the existing gLite middleware, while still being lightweight an non-intrusive. Figures 2 and 3 show a comparison of manual job management with WMS alone versus automatic job management with WMSX.
3.1 Architecture
Figure 4 gives an overview of the WMSX architecture and its placement in the overall Grid middleware architecture. WMSX provides a layer between the user and the existing workload management system, providing additional functional-ity. The WMSX system consists of two applications, the WMSX controller and the WMSX provider.
Fig. 5.The Black box philosophy does not return any output if the job fails.
Fig. 6.WMSX usesstdoutfor con-tinuous feedback, providing output even if the job fails.
The WMSX controller is a separate application which keeps running as long as there are jobs to manage. It contains the interface to the actual WMS back-end. Job submissions are passed to WMS. The JobIDs are kept in an internal database and made available to the user. The status of the jobs is tracked and recorded. When a job is done, its results can be automatically retrieved, allow-ing the user to concentrate on the application rather than the grid specifics. The WMSX controller supports gLite and a local back-end for use during develop-ment. As there is exactly one controller running per user, the WMSX system has full support for concurrent use.
The WMSX requestor provides an interface for the user. It consists of a command line tool, which forwards the requests to a running WMSX controller instance. The requestor itself is lightweight: It performs basic sanity checks and forwarding the requests to the controller. The requestor can be called from com-mand line scripts to be used for automated execution.
The WMSX system is capable of handling many jobs at the same time. It can be limited to submit a given number of jobs for execution at the same time, to prevent overloading the gLite WMS. It will automatically submit the remaining jobs once previous jobs are done. This self-limitation allows reducing an overloading effect caused by bugs in the submission.
3.2 Debugging
The Grid philosophy treats a submitted job as a black box. Once a job is sub-mitted, information is available where the job is executed. Also, it is possible to learn what Grid state a job is in. However, it is impossible to trace the state of the execution of the actual job contents. The results of a job are only avail-able when a job is finished successfully. This works very well for successful jobs. However, when a job fails the output is unavailable. Figure 5 shows an example of a failed execution.
For debugging purposes it would be interesting to know what happened dur-ing the execution of the job. When a job produces no output, we would like to know where it stopped and why. For this, a trace of the job execution up to and including the failure point is necessary.
experiment1.tar.gz experiment2.tar.gz experiment2.wjdl arglist.txt
Fig. 7. Example of directory con-tents for a parameter study.
experiment1 a experiment1 b experiment1 c experiment2 1 1 experiment2 1 2
Fig. 8.Sample contents of the pa-rameter filearglist.txt.
To provide debugging support we must first distinguish between two types of job output: The actual job output (data output), and the job debugging output. If the execution of the job was successful, only the data output is needed. The debug output is only relevant if the job execution fails. When examining existing program run on the Grid, most follow the following pattern: Data output is available as result files, while debugging output is send to stdout or stderr.
The simple solution therefore seems to just redirect stdout to a file and retrieve it along with the job output. However, this method suffers from two problems: First, the job output is not available. If a job fails, disappears, or hangs forever, the job output cannot be retrieved by the user. Second, some programs produce a lot of debug output. Enabling all debug output by default may very quickly overfill the available disk space on the worker node, or result in a lot of overhead when transferring the result back.
The gLite WMS provides support for interactive jobs. When running an interactive job, a TCP connection is opened to the original hosts where the job was submitted from. This connection can be used to send data to the worker node to be used asstdin, and to receive the contents ofstdout. When submitting an interactive job, an X-window is opened by default, displaying the output and giving the user the possibility to enter input. On request, a listener process can be started to provide the streams as UNIX pipes. While this functionality does not fit the black box philosophy of the Grid, it is valuable for debugging.
Rather than providing a pipe, WMSX providesstdout as a file. WMSX at-taches itself to the listener process, and reads the data stream immediately. The data is stored in a file as soon as it is received. It uses the same file name and directory as it would when submitting the job non-interactively. Since the data is written immediately, a user is able to trace the application during its execu-tion. As figure 6 shows, the data is always available, even when the job fails or never finishes. Using the same file name and directory allows switching between debugging and normal execution by changing a single parameter.
3.3 Parameter Studies
Many computational experiments can be classified as parameter studies. In a parameter study the same application is run multiple times with slightly different input parameters or input data. Managing a large parameter study requires
keeping track of all submitted jobs and manually checking which runs where successful and which runs to repeat. WMSX provides support for automatic management of large parameter study jobs.
Ideally a parameter study would be defined through its application and a list of arguments. The user wants to specify which application to run, and then pro-vide a list of arguments or parameters for this application. The job management system should then ensure that every single job is run successfully.
WMSX supports these parameter studies through the creation of special application bundles and an parameter file. Figures 7 and 8 show an example application. An application bundle is a.tar.gz file containing the application and a script for its execution. The parameter file contains a list of applications to run and their parameters, thus allowing to run different applications in the same study. For each job a job wrapper script is created. The job wrapper unpacks the application and executes it with the given parameters.
Additional information for an application can be given by adding awjdlfile. Not all applications follow the default pattern for their execution. The generation of the job wrapper and its contents must be customizable. WMSX will therefore look for a file with the application name and thewjdl extension. Thewjdl file contains a list of key-value pairs. It uses the same notation as the JDL file used by gLite. It adds additional attributes for specifying features such as the name of the binary to call, or additional software requirements on the target machine. When an argument list file is submitted, WMSX keeps track of the submitted jobs. It will create a work directory where it will automatically retrieve the result of the individual jobs. A researcher may look at the different results once they are available and associate them with the application and parameters. The internal handling and use of JobIDs is completely hidden from the end-user, allowing focusing on the results rather than the execution mechanics.
3.4 Pre-execution and post-execution
WMSX supports pre- and post-execution scripts to support automation of input generation and result dissemination.
A pre-exec script is executed just before the job is actually submitted to the WMS. This allows the script to generate additional input which may be needed for execution. If the job is part of a parameter study, the pre-exec script will receive the name of the application and its parameters as input, thus allowing using this information in data generation.
The post-exec script is executed after a job is completed and its results are received. The script has full access to the result data, and the name and parameters of the applications in case of a parameter study. The script may now execute any possible dissemination of the results specified by the user.
While the Grid application is run on a worker node, pre-execution and post-execution are run on the users own machine with full access rights of the user. A post-exec script could use the user rights to send notifications via email, or store the results in an AFS directory.
Fig. 9.Example of an equation solver workflow.
3.5 Arbitrary workflows
WMSX also allows the execution of Grid workflows. The mechanisms presented so far provide support for single jobs and for parallel jobs. However, many scien-tific experiments require certain jobs to be executed in a given order, or certain jobs to be executed based on a given condition. A workflow system will provide support for such behavior.
A full description of a workflow would require a workflow description lan-guage, such as AGWL used in the Askalon system. This language would also require proper tool support for both creation of workflows and their execution. As this full support is beyond the scope of this project, a lightweight approach to workflow description was taken.
In many applications is it sufficient to examine the results to decide if another step is necessary, and which step this should be. An example is a linear equation solver: After each iteration the result is checked for precision. If the required precision is reached, the execution is stopped. If the precision is not yet reached, another step is necessary. Figure 9 shows an example workflow.
The if-decision is supported through the post-exec script in WMSX. If the post-exec script returns a non-zero value, a new job is created. An exit value of zero means that the execution has finished.Stdout is kept free for debugging purposes. In the linear equation example the post-exec script would check for the required precision.
WMSX resolved the what-decision by executing a chain script. If the post-exec script returned that there are more jobs to post-execute, a chain script will be run. This script will provide the name of the application to run next onstdout, thus allowing to add additional jobs for the execution. In the linear equation example the chain script would restart the linear equation solver with the new parameters.
The chain script is able to output more than one application to run next. In this case, multiple applications would be run next, thus supporting a workflow split. An example use is a particle simulation for particle collisions. Particle collisions create multiple particles, each of which can be analyzed individually. Rather than a chain of events this type produces a tree of events.
Although this model is lightweight, it already gives the ability to run arbitrary workflows. It supports running jobs in sequence. It also allows to re-run jobs with new parameters, adding the ability to support loops in workflows. This extends the standard model, which did not support cycles. The model is still simple enough so that no special tools are needed.
4
Conclusions
In this paper we have presented a new tool for management of job workload for gLite. The tool supports management of submitted jobs, debugging of jobs, parameter studies, pre-execution and post-execution, and even arbitrary work-flows. The software is freely available at http://www.grid.kfki.hu/twiki/ bin/view/RmkiGrid/WmsxSoftware and has been used successfully to obtain results for a physics PhD thesis, where parameter studies where frequently re-quired.
While some of the functionality is similar to existing application, they lack support for concurrent use or arbitrary workflows. The lightweight approach allows embedding the application in custom scripts.
This work was supported by the European Union, partially through Marie-Curie TOK 50925, EGEE-II INFSO-RI-031688, and CoreGrid IST-004265.
References
1. Enabling grids for E-sciencE (EGEE).
2. EGEE middleware architecture. Technical report, CERN, July 2005.
3. P. Andreetto, S. Borgia, A. Dorigo, A. Gianelle, M. Mordacchini, M. Sgaravatto, L. Zangrando, S. Andreozzi, V. Ciaschini, CD Giusto, et al. Practical approaches to grid workload and resource management in the EGEE project. Proceedings of the International Conference on Computing in High Energy Physics (CHEP2004), Interlaken, Switzerland, 2004.
4. F. Berman, G. Fox, and A. Hey, editors. Grid Computing: Making the Global Infrastructure a Reality. Wiley, New York, NY, USA, 2003.
5. David Churches, Gabor Gomb´as, Andrew Harrison, Jason Maassen, Craig Robin-son, Matthew Shields, Ian J. Taylor, and Ian Wang. Programming scientific and distributed workflow with triana services.Concurrency and Computation: Practice and Experience, 18(10):1021–1037, 2006.
6. Thomas Fahringer, Alexandru Jugravu, Sabri Pllana, Radu Prodan, Clovis Ser-agiotto Junior, and Hong-Linh Truong. ASKALON: A Tool Set for Cluster and Grid Computing. Concurrency and Computation: Practice and Experience, 17(2-4), 2005. http://www.askalon.org.
7. F. Gagliardi, B. Jones, F. Grey, M.E. B´egin, and M. Heikkurinen. Building an infrastructure for scientific grid computing: status and goals of the EGEE project. Philosophical Transactions: Mathematical, Physical and Engineering Sci-ences, 363(1833):1729–1742, 2005.
8. E. Laure, S. M. Fisher, A. Frohner, C. Grandi, P. Kunszt, A. Krenek, O. Mulmo, F. Pacini, F. Prelz, J. White, M. Barroso, P. Buncic, F. Hemmer, A. Di Meglio, and A. Edlund. Programming the grid with gLite. Computational Methods in Science and Technology, 12(1):33–45, 2006.
9. E. Laure, F. Hemmer, F. Prelz, S. Beco, S. Fisher, M. Livny, L. Guy, M. Barroso, P. Buncic, P. Kunszt, et al. Middleware for the next generation Grid infrastructure. proceedings of CHEP, Interlaken, Switzerland, 2004.
10. A. Maier, F. Brochu, U. Egede, J. Elmsheuser, B. Gaidioz, K. Harrison, B. Koblitz, H.C. Lee, D. Liko, J. Moscicki, A. Muraru, V. Romanovsky, A. Soroko, and C.L. Tan. Ganga - an optimiser and front-end for grid job submission. Second EGEE User Forum, May 2007.
11. J.T. Moscicki. Efficient job handling in the grid: short deadline, interactivity, fault tolerance and parallelism. EGEE User Forum, March 2006.
12. Thomas M. Oinn, R. Mark Greenwood, Matthew Addis, M. Nedim Alpdemir, Justin Ferris, Kevin Glover, Carole A. Goble, Antoon Goderis, Duncan Hull, Dar-ren Marvin, Peter Li, Phillip W. Lord, Matthew R. Pocock, Martin Senger, Robert Stevens, Anil Wipat, and Chris Wroe. Taverna: lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience, 18(10):1067–1100, 2006.
13. F. Pacini. JDL attributes specification. Technical report, November 2007. EGEE Document EGEE-JRA1-TEC-590869-JDL-Attributes-v0-9.