The Grid Model - Decentralised economic resource allocation for computational grids

4.2.1 Resources and Jobs

A Resource on our model is a representation of a computational resource that can be scheduled to run Jobs. For the purposes of simplicity in this investigation, we ignore other types of resource, such as network requirements or storage.

A Resource is modelled by a unit capacity, which represents an amount of work it is capable of carrying out simultaneously. Given that parallelisation is the dominant factor for achieving greater performance, this is meant to loosely represent a number of proces-sors. We make several abstractions here.

1. Hardware comes in many different types and with different performance charac-teristics. Given the growing virtualistion of computing resources, as well as for simplicity, we abstract these details. Thus all processors are modelled as being equivalent in throughput, with the number of parallel units being the primary per-formance metric. This trend can be seen in the hardware world, as CPU clock speeds have hit a wall, and the manufacturers are focusing on multi-CPU chips.

2. Software also varies greatly. Again, virtualistion technologies are quickly making this a non-issue as users can bundle their application load in the operating system of their choice. So we assume a Resource can provide whatever software platform a particular application requires.

Closely linked with a Resource, a Job represents an amount of work to be carried out on a grid. This is modelled with a fixed unit size or work and a fixed duration. The size is meant to represent a Job’s minimum resource requirements and applies directly to a Resource’s capacity. A Job consumes its size of a Resource’s capacity when executing on that Resource for a length equal to its duration. A Job that has a larger size than a Resource’s free capacity cannot be executed on that Resource. This abstraction is made for simplicity, but overlooks two key challenges in this area of grid systems.

1. It is technically challenging to correctly estimate the actual execution time of a given grid Job, and this is an active research area [99] [38]. The user can provide an estimation of a Job’s requirements, but it is not guaranteed to be accurate. However, this problem is not our focus in this work so we assume that a job describes itself accurately and will run at that size for that duration. It is relatively simple to add a degree of run-time variance to the Job model at a later date along with an appro-priate penalty model for providing a bad estimate or overrunning. This is discussed further in Section 7.3.

2. In reality, the “size” of a Job is variable. A large job can still be run on a small resource, but will take a very long time to complete. Likewise a small job can potentially be scaled up and be completed faster on a large resource, although this is not as easy as the former, as it may be not able to be parallelised easily. This introduces a completion time into the equation when considering which Resource to run on. While this is a feasible option for the model, it has been omitted from the basic model for simplicity. Also, our primary focus is for “on demand” allocation (see Section 2.4). This means that it is reasonable to assume that the Job description exactly represents the amount of Resource needed, no more or less, at that particular time. Again, further ideas for modelling this aspect of grid Jobs are discussed in section 7.3.

For both Jobs and Resources, we do not consider the storage or network requirements (e.g. bandwidth) and availability in our model. We assume that there is enough of each to allow the Job to execute successfully. This is a reasonable assumption given that our titled focus is on computational grids, not storage grids. Possible ways to include these type of resources are discussed in Section 7.3.

4.2.2 Grid Sites

Our model uses the notion of a grid Site to capture the idea of diverse domains of own-ership of resources as well as physical location. It represents potentially many individual resources under one autonomous administrative domain. These resources are not nec-essarily physically co-located but are owned by the same group (or virtual organisation, in Grid terms). It is assumed that multiple resources at a site would have a faster intra-connection speed between themselves compared to that between separate Sites and thus can be represented as a single aggregated Resource consisting of the sum of the capacity of all resources at that Site.

The Site is the basic location of our simulation and the basic node in our network (see Section 4.4). A Site contains a Resource and a Seller for handling the negotiations about the resource.

A Site also has a Server object to model the performance and communication costs between it and other Sites. The Server is a software system assumed to be running on ded-icated hardware (i.e. not the Resource itself). This allows us to model the execution of grid Jobs (the actual performance of which we are less interested in) separately from the execution of the allocation system (which we are more interested in). The Server’s perfor-mance is modelled by a sampling execution time to process a message from a stochastic distribution (see Section 4.7.1), processing a single message at a time, with a wait queue of messages to be processed.

This is intended as a general measure of the Server’s ability to process requests from other Servers, not necessarily the performance of a specific machine. It may be run on a cluster or larger machine, but this is simply modelled by increasing the overall per-formance of the Server, rather than introducing more complex hardware models, with a similar rationale to the section above on our Resource model.

A Site also acts as an entry point to the grid network for Buyers, allowing them to communicate with other traders at other Sites. So any particular Site will have a number of currently active Buyers utilising that Site’s Server to find an allocation.

The management of a Job by a Seller in our simplified model consists of starting the Job when successfully allocated and finishing it when completed (i.e. it has run for its specified duration). In reality, the Seller has to do much more, including setting up the Job, monitoring its status (and reporting it to the Job’s user) during execution, and cleaning up and returning the results when completed. We assume this to be automatic in our model, as it is not part of the problem of resource allocation that we are looking at.

The actual local scheduling algorithm used by our Resource is very simple. If a re-source has enough free capacity for a Job at the current time, it reports being capable of

scheduling it, and if is told to do so, simply executes it straight away, and removes it when completed. Given that we are focusing on “on demand” grids, and not doing any advance reservation or completion time modelling, this is as complex as it needs to be.

4.2.3 Other Abstractions and Assumptions

Our model assumes perfect execution of Jobs and no Resource or trader failure. Fault-tolerance is an important part of any realistic grid allocation solution, but is not necessary for the first exploration of a decentralised economic system. Real economic systems suggest many possible methods to handle a breach of contract, but this is outside the scope of this initial investigation. A discussion of extending this model to include it can be found in section 7.3.1.

In document Decentralised economic resource allocation for computational grids (Page 60-63)