8.2 Scheduling algorithms
8.3.1 Technical Background
Instead of building our own Grid infrastructure for testing, development and pilot appli- cations, we have decided to opt for using the powerful Grid infrastructure made available
8.3. PROTOTYPE IMPLEMENTATION 68 in Czech Republic by the METACenter project [78]. The META Center project was en- hanced during year 2003 by a new project called Distributed Data Storage (DiDaS) [30] incorporating new distributed storage based on an Internet Backplane Protocol (IBP) [5]. Such storage infrastructure can be efficiently used for implementation and deployment of DEE system, however as the scheduling system in use is not capable of scheduling jobs with respect to location of the data and there is also neither data location optimization nor prefetch functionality and our data-intensive application requires at least some of them for optimal performance, we had to enhance the underlying infrastructure.
As a part of the Grid infrastructure, the METACenter provides large computational power in the form of IA32 Linux PC clusters that are being rapidly expanded each year because of the cost-efficiency of this solution.
As follows from the discussion in Section 8.2, we need some globally accessible dis- tributed data storage for transient storage of source, intermediate, and target data, that provides high enough performance to supply data to processing and that supports data replicas. The filesystem shared across these PC clusters is based either on a rather slow globally accessible AFS filesystem supporting read-only administrator controlled replicas (several terabytes of storage are available) or on a somewhat faster site-local NFS which doesn’t support data replicas, has its own problems such as broken support for sharing files larger than 2 GB and a capacity on order of only a few tens of gigabytes available. Therefore we need a different means of storage for processing large volumes of data.
The IBP uses soft-consistency model with a time-limited allocation thus operating in best effort mode. A basic atomic unit of the IBP is a byte array providing an abstraction independent of the physical device the data is stored on. An IBP depot (server), which provides set of byte arrays for storage, is the basic building block of the IBP infrastructure offering disk capacity. By mid-2004 the IBP data depots were present in all cluster locations as well as distributed across other locations in the Czech academic network providing total capacity of over 14 TB.
The PBSPro [88] scheduling system is used for job scheduling across the whole META - Center cluster infrastructure. The PBSPro supports queue based scheduling as well as properties that can be used for constraining where a job may be run based on user re- quirements. These properties are static and defined on per-node basis. Under ideal cir- cumstances the PBSPro is capable of advance reservations and estimate of time when a specified node is available for scheduling new jobs unless a priority job is submitted. The latter feature requires cooperation with users that submit their jobs as the PBSPro needs an estimate of processing time provided by the job owner for each job—otherwise the maxi- mum time for the specified queue is used and this results in non-realistic estimate of when a specified processing node will be available.
For video processing we use thetranscodetool [55]. As discussed in Section 7.2.2,
this tool is unable to directly produce RealMedia format, which is one of few streaming formats with strong multi-platform support and which is also the format of choice for our pilot applications, the CESNET video archive and the Masaryk University lecture archive. Therefore we also need to use Helix Producer [72] to create the required target format. The Helix Producer needs raw video with PCM sound as input file and as this format is rarely the format of the input video, we use thetranscodefor pre-processing data for the Helix
Producer.
8.3.2 Architecture
As shown in Figure 8.6, the architecture of the Distributed Encoding Environment com- prises several components and component groups:
User interface The basic functionality of the user interface is for user to provide input information on the job. The basic information is usually the input file (media), job
User Interface Job Logging Job Preparation Local Scheduler Interfaces Job Submission Job Monitoring
Media Processing Tools
Media Analyzer Media Spli!er Media Processor Media Merger Distributed Storage
FIGURE8.6: Distributed Encoding Environment architecture and components.
chunk size, target file/media, and target format and its parameters. If multiple pro- cessing infrastructures are available, the user may also select which one will be used for processing. When job monitoring module is available, the user interface may pro- vide visualization of the job progress and informs user about problems encountered. Job preparation This module steers the preparation of the parallel job. It invokes media analyzer to find source format and its parameters, performs splitting job into chunks automatically or based on user specification if provided, prepares job files and in- vokes the job submission procedure to pass the jobs to local job scheduling system on the computing facilities.
Local scheduler interface The local scheduler interface group provides one mandatory and one optional module: mandatory job submission interface to send the jobs for computation on computing facilities and optionally also job monitoring interface so that user can monitor overall job status using user interface. Shall the system support the proposed scheduling model (Section 8.3.4), the job submission interface should provide not only the “write only” access for job submission, but it should be also ca- pable of reporting when specified resource will be available for scheduling according to the local scheduler tsched_free
p (with all the limitations discussed above).
Job logging facility The job logging facility takes care of keeping permanent track of the job status, results and especially error conditions. This component can be used only if job monitoring interface is used.
Media processing tools The media processing group contains four modules: media ana- lyzer for source media/format analysis, media splitter for splitting the source media into the job chunks, media processor, which is the actual processing, tool and media merger for merging resulting media chunks.
Distributed storage Any distributed storage system that the media processing tools can interact with. It is desirable to have a system that can use data replicas for optimizing
8.3. PROTOTYPE IMPLEMENTATION 70 performance and that supports data location hinting for user to be capable of spec- ifying data location, so that advanced functionality of the scheduling model can be utilized.
The workflow in this architecture works as follows: a user specifies his jobs using the user interface. The job preparation module analyzes the source data using media analyzer and if an unsupported format is found, the user is notified and the processing is termi- nated. When supported format is found, the source media is split into smaller chunks either based on chunk size specified or degree of parallelism specified by the user or even using predefined split points provided by the user. The splitting can be done though job submission interface or locally—the local splitting is sometimes desired since it avoids the job enqueuing latency, which can be quite long on heavily loaded systems.
The job preparation module then creates job specification for each individual media chunk and sends it via job submission interface to the local job submission system on the computing resource using the scheduling mechanism described in Section 8.3.4. It prepares the last job that merges resulting chunks into the target file or media. This job is executed only if all the chunk-processing jobs finish successfully.
If the job monitoring interface is present, all the jobs are monitored throughout their lifetime and the results are gathered by the user interface. Also, if any error situation is found, the user is notified both using the user interface and using the job logging facility. Security Considerations. The Grid environment provides strong security via Grid Secu- rity Infrastructure (GSI) [22] based on X.509 certificates. Other projects have developed enhanced security architectures based on GSI and each computing Grid usually has some security infrastructure for authentication, authorization, and accounting (AAA) readily available. Because DEE is designed for Grid environment, it relies on Grid infrastructure for the AAA functionality.