• No results found

Discrete-event system model for the simulation of the work of AliEn WMS and

CHAPTER 3. DESIGN AND DEVELOPMENT OF GRID BANKING SERVICE

3.2 Discrete-event system model for the simulation of the work of AliEn WMS and

Functioning of AliEn infrastructure is vital for ALICE collaboration so introducing changes to the software or the configuration always requires thorough analysis. Before a change can be deployed on the infrastructure which is used by ALICE collaboration one has to make sure that it will not hinder the work of AliEn users, Monte-Carlo data production operators as well functioning of AliEn sites. The banking service is not an exception, so every time a VO admin deems necessary changing the configuration parameter (e.g. job nominal price, maximum number of jobs a user is allowed to run in parallel) or site admin decides to change the site configuration parameter (e.g. price of the CPU resource provided by the site) first it is necessary to make sure that the change will not negatively affect any AliEn user or site and second it is very desirable to be able anticipate the effect of the change before it is deployed on the real infrastructure. To be able to study the effects which changes of AliEn banking configuration parameters will produce a simulation model of AliEn Workload Management System (WMS) has been developed. An implementation of the developed

58 model – a simulation program, takes as an input so-called workload , which is a list containing information about jobs (e.g. when is the job arrival time, who is the user who has submitted the job), as well as a list containing information about sites (e.g. number of worker nodes available on the site, SI2k conversion ratio of site worker nodes) and a list containing information about users (e.g. maximum number of jobs a user is allowed to run in parallel, amount of money available to user). The program then mimics the work of AliEn job management system by assigning jobs to the sites similar to the way it is done in real system and records important parameters (e.g. start time of the jobs, completion time of the jobs) in the database. The simulation finishes when all the jobs get processed by the program.

By modifying the simulation parameters one can study the produced effects by analyzing the data which is collected in the database during the simulation.

The number of basic elements which constitute AliEn WMS is very big (tens of thousands of computers, hundreds of services, etc.) and their interconnection (network links between the machines on the sites, network links between different sites, etc.) is extremely complex, so overall it is a very sophisticated system. That is why in the developed model the basic constituents of the system are united into interacting subsystems with appropriate high level of abstraction and the whole system is observed as a unification of those high level components - so called entities of the system. The model allows describing and simulating the work of the AliEn WMS as a discrete event-system [120].

In discrete-event simulation a system is modelled in terms of 1) the entities that represent the constituents of the system, 2) its state at each point in time and 3) the activities and events that cause the system state changes. Below we describe what the entities of the simulated system are, how the state of the system is defined and what are the events and activities which cause the changes of the system state.

59 In discrete-event simulation systems entities are defined as ‗objects of interest in the simulated system‘. The entities in our simulated system are AliEn Grid jobs, sites and users with corresponding group accounts in the banking service.

The properties of entities which are used to describe them are called attributes. In case of AliEn sites attributes are parameters which represent the site configuration (i.e. the number of worker nodes, SI2k ratio), in case of jobs they represent job parameters (e.g. job duration), and in case of users the attributes are parameters associated with the users (e.g. the name of group account in the banking service to which the user belongs). Job, site and user attributes used in the model are presented below.

The attributes using which the jobs are described are following:

 Job_Id – Identification number of the job

 Job_User – User who submitted the job

 Job_Site – Site where the job has been sent for execution

 Job_Price – Bid of the user who submitted the job

 Job_SI2k – Number of SI2k units the job is going to consume

 Job_Arrival_Time – Job arrival time

 Job_Start_Time – Time when the job was sent for execution

 Job_Run_Time – Job running time

 Job_End_Time – Time when the job finished

 Job_Status – Job status

 Job_Priority – Effective priority of the job

Sites are described using the following attributes:

 Site_Name – Name of the site

 Site_SI2k_Ratio – SI2k conversion ratio of the site worker nodes

 Site_Price – Price of the unit of CPU resource the site provides

60

 Site_Max_Jobs – Maximum number of jobs the site can concurrently run

 Site_Job_Request_Rate – Site job request rate (seconds)

 Site_Running_Jobs – Number of jobs running on the site

 Site_Account_Name – Name of the site account in the banking service

 Site_Account_Balance – The amount money available on the site banking account

The attributes which are used to describe the users are:

 User_Name – Name of the user

 User_Account_Name – Name of the group account in the banking service to which a given user is assigned

 User_Account_Balance – The amount of money available of user‘s banking account

 User_Running_Jobs – Number of running jobs belonging to a user

 User_Waiting_Jobs – Number of waiting jobs belonging to a user

 User_Max_Parallel_Jobs – Maximum number of jobs a user is allowed to run in parallel

 User_Priority – The value of priority variable which is used during computed priority calculation process (Section 3.1)

In discrete-event simulation systems the state of the system is defined by the collection of variables which are necessary to describe the system at any time. In our case system state variables are the attributes of the entities plus the following:

 System_Waiting_Jobs – Overall number of waiting jobs

 System_Running_Jobs – Overall number of running jobs The activities and events which cause the system state change are:

 Event_Job_New - Arrival of a new job

61

 Event_Job_End - Completion of a running job

 Even_Priority_Calculate – (Re)calculation of the priorities of the jobs waiting in the queue

The mechanism for advancing simulation time and guaranteeing that all events occur in correct chronological order is based on Event Scheduling/Time Advance (ESTA) Algorithm [120]. The algorithm uses so- called Future Event Lists (FEL) which contains the list of event notices for events which have been scheduled to occur at a future time. Scheduling a future event means that at an instant an activity begins (e.g. a job is sent for execution to the site) its duration is computed (e.g. job duration is computed based on the number of SI2k units a job needs and site SI2k conversion ratio) and that the corresponding end-activity event (e.g. job completion) together with its event time is placed on the future event list. To begin the simulation it is necessary to define the input data: configuration of the sites participating in the simulation, settings associated with the users participating in the simulation as well the parameters of the jobs which simulation program will need to process.

For each site participating in the simulation values of the following attributes have to be defined:  Site_Name  Site_SI2k_Ratio  Site_Price  Site_Start_Time  Site_Max_Jobs  Site_Account_Name  Site_Account_Balance

For each user participating in the simulation values of the following attributes have to be defined:

62  User_Name  User_Account_Name  User_Account_Balance  User_Max_Parallel_Jobs  User_Priority

The jobs have to be defined using the following attributes:

 Job_Id

 Job_User

 Job_Price

 Job_SI2k

 Job_Arrival_Time

Once the values of input parameters are given to the simulation can be started. It is based on the three interacting components: Simulation Site Manager (SSM), Simulation Job and Banking Manager (SJBM) and Simulation Data Manager (SDM). The operational details of these components and the role of each of them in the simulation process are described in the following paragraphs of this section.

SSM is the key component of the simulation which is responsible for advancing simulation clock as well as managing the objects which mimic the work of AliEn sites (so-called site objects). Upon start-up it sets the simulation clock to zero and creates the list of site objects. Each site object represents one AliEn site and is initialized using the site configuration data provided as the input data. With every site object there is an associated Future Event List (FEL) which can contain two types of events: either job request (Event_Job_Request) or job completion (Event_Job_End). Initially each site object‘s FEL contains one job request event scheduled at the time equal to the start time of a site. At each step of the iteration the site manager does the following:

63

 Reports to the SJBM the value of simulation clock

 Starts checking the FELs of the active site objects (the site object is considered active if the value of its Site_Start_Time attribute is greater or equal to the current value of simulation clock). If it finds the event which is scheduled to happen at the time equal to the current value of the simulation clock it takes the appropriate action: conveys either job request or job completion report from site object to SJBM

 Conveys the job start messages received from SJBM (if any) to the site object to which the message is addressed and instructs it to update its FEL, i.e. to calculate the duration of the job (job duration = Job_SI2k/Site_SI2k_Ratio) and schedule Event_Job_End event at the time which is equal to the sum of current simulation clock value and received job duration.

 Conveys ‗no job‘ messages received from SJBM (if any) to the site object to which the message is addressed and instructs it to update its FEL, i.e. to schedule Event_Job_Request at the time which is equal to the sum of current simulation clock value and the value of Site_Job_Request_Rate attribute of the site.

 Reports about all events (e.g. job assignment to the site) which have taken place during that iteration to the SDM

 Recalculates the values of Site_Running_Jobs attribute for all site objects and reports to SDM

SJBM mimics the work of AliEn central job management services (including the banking service). It is responsible for maintaining the list of jobs and distributing them to the simulated sites as well for maintaining users‘ and sites‘ bank accounts. It maintains its own FEL which can contain two types of

64 events: Event_Job_New which indicates the arrival of the new job and Event_Priority_Calculate which indicates the need to (re)calculate the priorities of the jobs which are waiting for execution.

During the initialization the SJBM loads the input data which concerns users and jobs, marks all jobs as inactive, adds to FEL Event_Priority_Calculate at time zero as well as iterates through the list of jobs and schedules Event_Job_New at the time equal to the value of appropriate Job_Arrival_Time attribute. After initialization it starts waiting for the messages from SSM. It can receive three types of messages from SSM:

 The first type is simulation clock value update message in which case SJBM checks the FEL and if there are events scheduled at the time equal to the received value of simulation clock takes appropriate action according to the type of event. Event_Job_New causes SJBM to mark the jobs whose arrival time is equal to the received value of simulation clock as active (i.e. set the value of their Job_Status attribute to ‗WAITING‘), update User_Waiting_Jobs attribute of the concerned users as well update the value of System_Waiting_Jobs variable. Event_Priority_Calculate event causes the SJBM to recalculate the effective priority of the jobs which have status WAITING (according to the algorithm described in Section 3.1) as well as to schedule the next Event_Priority_Calculate event.

 The second type is the job request message in which case SSJB sorts the jobs which have status ‗WAITING‘ according to their effective priorities and tries to find a job matching the site from which the job request originated. SSJB makes a reasonable assumption (since in AliEn the package installation is done automatically) that all the sites have the necessary packages installed and satisfy the hardware requirements of the jobs, so the only parameter which maters during the matching process is the price of the site which has to be less or equal to the bid of the job. Once it finds a job it sends it to the matched site for execution. After that it updates Job_Status (sets to

65 ‗RUNNING‘) and Job_Site (sets to the name of the matched site) attributes of the job, updates the values of User_Waiting_Jobs and User_Running_Jobs attributes of the user who has submitted the job as well as the values of the System_Running_Jobs and System_Waiting_Jobs varaiables. Otherwise (in case it does not find a matching job) it sends ‗no job‘ message to the SSM.

 The third type of the message which SSJB expects is the job completion report message. This message causes the update of Job_End_Time (sets to the current value of the clock, Job_Run_Time (Job_End_Time – Job_Start_Time) and Job_Status (sets to ‗DONE‘) attributes.

The messages of this type also cause the SSJB to update the values of the attributes associated with the banking service by imitating money by changing the values of User_Account_Balance (decreasing by Job_Price*Job_SI2k) and Site_Account_Balance (increasing by Job_Price*Job_SI2k) attributes.

SSJB reports the information about all events and entities which attribute values have been changed to the SDM.

66

Simulation Site Manager (SSM)

- Manages internal simulation clock

- Maintains the list of site objects

Simulation Site Manager (SSM)

- Manages internal simulation clock

- Maintains the list of site objects

Simulation Job and Banking Manager (SJBM) - Maintains the list of jobs - Maintains bank accounts

Simulation Job and Banking Manager (SJBM)

- Maintains the list of jobs - Maintains bank accounts

Simulation Data Manager (SDM)

- Records the simulation log

Simulation Data Manager (SDM)

- Records the simulation log

Figure 3.2. Components of discrete-event system simulation

SDM service collects from SSM and SSJB information about the state of the system. For each of the values of simulation clock it creates a snapshot of the system state by recording all the system state variables.

The simulation stops when the all the jobs which were provided as the input are processed. After the simulation stops the SDM will contain a comprehensive data about the simulated system.