Considerations - Master Thesis. Evaluating Performance of Hadoop Distributed File System

These two models developed are quite generic and represent the entire HDFS architecture. In addition, the first one gives only performance measures of time required for an action without considering other measures, like queue lengths or throughput. This is why we developed a new model, focused on the analysis of the issue shown in section 3.2.2 that involves the minimal set of components needed for the problem. This model will be shown and used in the analysis in next chapters.

Figure 3.1: Graphical representation of HDFS modelled with CPN. Top Left: Complete System, Top Right:

Client, Bottom Left: Namenode, Bottom Right: Datanode

4

Modelling Formalism

As seen in previous chapter, several approaches to representation of systems for performance evaluation exists. As said, existing models of HDFS doesn’t fit with requirements of the analysis that has been set for solving the identified problem. In this work other different modelling formalism have been used for a simpler representation and more practical analysis. The core of all the analysis are M/M/1 queues, whose theory is applied for predicting average system usage.

The model described in next chapter will make use of PEPA_k, an extension of PEPA language. In following sections these languages are explained.

4.1 PEPA

Performance Evaluation Process Algebra [21] is a modelling language devel-oped by Jane Hillston. This language is inspired by existing process algebras, like CCS, with the objective of integrating performance analysis in system design. P EP A takes advantage of the features of process algebras and incor-porates the performance modelling features for specifying a stochastic process.

The main advantage from process algebras is the compositionality. Compo-sitionality allows to specify model in term of smaller components, like in the model that is being shown in next chapter, and combine this components for building more complex systems.

4.1.1 Language outline

As said before, in PEPA a system is described as cooperating components.

These components represents specific parts of a system and represent the active units. Components can be atomic or composed itself by other compo-nents.

Components has a behavior defined by the activities in which it can en-gage. An activity is composed by an action type and an activity rate. With action type we represent the possible actions that the component being mod-elled can do. With activity rate, instead, we represent the duration of an activity, which is an exponentially distributed random variable that it is rep-resented by a single real number parameter. An activity is so reprep-resented by the couple(α, r) where α is the action type and r is the activity rate.

When an activity, i.e. a = (α, r), is enabled this will wait for a certain time determined by the rate r and then the action α is executed. If several activities are enabled, each one will wait for the time determined by its own rate. An activity may be preempted, or aborted, if another action has been completed first.

4.1.2 Syntax and semantics

The syntax of PEPA is defined as follows:

P ::= (α, r).P | P + Q | P

BC

_L Q| P/L |A More detail are explained below:

Prefix (α, r).P the component (α, r).P first enables and activity with type α and then, at the end of the waiting time, behaves as component P;

Choice P + Q the component may behave either as P or as Q. The choice enables all the current activities of both P and Q. The first activity that completes determines which process the component will behave as;

Cooperation P

BC

_L Q P and Q are synchronized on the set of action types specified in L. The actions specified in L can be enabled only if both P andQ are ready to enable them. If an activity has an unspecified rate>, the component is passive with respect to that action type. This means that the component doesn’t contribute to the work for that action (i.e.

a message passing channel);

Hiding P/L the component behaves as P but the actions specified in L are not visible to external viewers;

Constant A^def= P defines a constant A that behaves as process P.

When multiple activities are enabled at the same time, a race condition determines the behavior of the model. All the activities are proceeding, but only the “fastest” is succeeding. Due to the nature of random variables de-termining the duration of the activities, the “fastest” cannot be determined in advance for every execution. This race condition leads to a probabilistic branching, so we can state that a certain behavior is more or less likely than another. If all the enabled activities share the same rate, they have equal probability to be the “fastest”, but if rates are different, is more likely that the “fastest” is the activity with the higher rate. Any other activity will be preempted. This means that the activity will be aborted, or more precisely, interrupted.

We said before that an activity can have an unspecified rate. This activity must be shared with another component of the system where this activity has a specified rate, which determines the rate of this shared activity. In case of multiple passive activities ready to be enabled, these passive actions must have an assigned weight. This weight is a natural number that defines the probability of that passive activity of being enabled. The higher is the weight, the more likely the passive action can be enabled. By default weight is 1.

The operational semantics of PEPA is shown in figure 4.1.

4.1.3 Underlying Markov Process

With the derivation graph of a PEPA model is possible to generate the rep-resentation of the modelled system as a stochastic process. Since the rate of activities is assumed to be exponentially distributed, the resulting stochastic process is a continuous time Markov process. With this CTMC is possible to solve the model and find steady state distribution. Steady state distribution permits to derive the probability of the model operating as a certain compo-nent of the PEPA model, when stability has been reached. This means that we can obtain the probability that, observing the system at random after it has been running some time, it is behaving in a certain way. From this prob-abilities is possible to obtain measures like throughput, average delay time and queue lengths.

Prefix

Figure 3.1: Operational Semantics of PEPA

For any activity instance its activity rate is the product of the apparent rate of the action type in this component and the probability, given that an activity of this type occurs, that it is this instance that completes. This leads to the following rule:

E−−−→ E^(α,r¹⁾ ^� F−−−→ F^(α,r²⁾ ^�

On the basis of the semantic rules PEPA can be defined as a labelled multi-transition system. In general a labelled transition system (S, T,{→ | t ∈ T }) is a system defined by^t a set of states S, a set of transition labels T and a transition relation→ ⊆ S × S for each^t t∈ T . In a multi-transition system the relation is replaced by a multi-relation in which the number of instances of a transition between states is recognised. Thus PEPA may be

Figure 4.1: PEPA operational semantics

In document Master Thesis. Evaluating Performance of Hadoop Distributed File System (Page 30-36)