Statistical Methodologies - Characterizing the spatial and temporal diversity of the Internet T

4.2 Preliminaries

4.2.5 Statistical Methodologies

In this section, we introduce the statistical techniques applied in this chapter and the following one. First, we present goodness-of-fit techniques that allow us to model the Internet end-hosts distribution and to assess the spatial diversity of the measurements and the temporal aggregation needed for stability. Second, we give a brief introduction to the ANOVA methodology, which allows us to measure the impact that factors such as the source network, country and direction have on the response variable, in this case the measured items (flows, packets, bytes).

Goodness-of-fit techniques

In order to find a suitable model for the traffic destinations, we perform visual inspection first (Section 4.3). This visualization can only give us some insight on the shape of the distribution, and this is not sufficient for hypothesis testing. To this end, we adopt a goodness-of-fit technique over a hypothesized distribution. In our case, the hypothesized distribution is a Zipf-Mandelbrot distribution [KV07, RB89] and the goodness-of-fit test is well-known χ² test [DS86, Chapter 3].

The underlying idea of Person’s χ² test was to reduce the problem of good-ness of fit of a general distribution to the testing fit to a multinomial distribution by dividing the support of the distribution in cells and comparing the observed values with the expected ones in each cell. In this way, to test if a random sam-ple X1, X2, · · · , Xn, with n ≥ 30, follows a specified distribution with parameter Θ ∈ R^r, F (x; Θ) (null hypothesis), one starts by partitioning the support of the observations into M buckets, namely C₁, C₂, · · · , C_M, where M ≥ 5. Let us de-note by O_i the number of observations that lie in the bucket C_i, then under the null hypothesis O_i follows a binomial distribution with parameters n and p_i, that is:

O_i ∼ B(n; p_i) (4.1)

with p_i being the probability of choosing the bucket C_i under the null hypothesis pi =

dF (x, Θ) (4.2)

4.2. Preliminaries 63

It is recommended that approximately the same number of observations lie in each bucket, and also that each bucket has at least 3 observations. With these assumptions, it is reasonable to think that O_i− E_i, being E_i the expected number of observations in cell C_i (E_i = np_i), is a good measure of distance between the observed data and its theoretical distribution. It can be shown [DS86, Chapter 3]

that under the null hypothesis, the statistic

χ² = XM

i=1

(O_i− E_i)²

E_i (4.3)

follows a χ² distribution. The number of degrees of freedom depends on the number of constraints placed on the data. If the parameter Θ is known, then χ² has d = M − 1 degrees of freedom. In contrast, if Θ is estimated from the observations, the degrees of freedom of χ² are d = M − 1 − r. The estimation of the parameter Θ should be performed by Maximum Likelihood (ML) for the former assumptions to be true. Finally, the null hypothesis of goodness of fit is rejected if χ² has a large value, e.g. if χ² > χ²_d,α, being χ²_d,α the percentile (1-α) of a χ² distribution with d degrees of freedom.

Analysis of Variance

ANOVA is a widely used statistical methodology whereby the observed variance of a given response or dependent variable is split into explanatory factors. ANOVA provides a way to determine if such factors have any importance in explaining the variability of a response variable, and to which extent. ANOVA performs a contrast using the ratio between the adjusted sum of squares of samples that belong to each factor level, i.e., intra-level samples, and the total, inter-level sam-ples. Such ratio is shown to follow a Snedecor’s F distribution, provided that the samples fulfill the following hypothesis: First, they must be necessarily indepen-dent; second, they must be fairly Gaussian; and third, all of them must share the same intra-level variance (i.e., exhibit homoscedasticity). However, the results of ANOVA are generally accepted provided that the number of elements in each group are similar (Balanced ANOVA), and there is a non-excessive deviation from the homoscedasticity assumption [OA84, GPS72].

The null hypothesis supports the homogeneity of means within each factor.

Basically, it contrasts, according to a given pre-defined significance level α (typ-ically α = 0.05), whether or not the intra-level variance values can be explained due to the randomness of measurements (generally, experimental errors) and not to differences in the population when grouped by categories (or levels). If so, the null hypothesis cannot be rejected and the factor used to build the groups is sta-tistically non-significant. Otherwise, the factor explains enough variance and it is considered as significant.

Following this, the simplest ANOVA univariate model for a response variable y with an only significant factor α is given by:

y_iu= µ + α_i+ ²_iu (4.4)

where yiu represents the u^th observation on the i^th level (i = 1, 2, . . . , I levels), µ represents the overall mean response (or intercept). On the other hand, αi refers to the effect due to the i^th level of factor α and ²iu is the deviation, random or experimental error, in the u^thsample on the i^th level. We also note thatP_I

i=1α_i = 0.

The resulting model in case of two significant factor is:

y_iju = µ + α_i+ β_j + (αβ)_ij + ²_iju (4.5)

and so forth in case of more than two factors. In this latter case, α_i and β_j represent the effect due to the i^th and j^th levels of factors α and β respectively.

Similarly, (αβ)_ij represents the interactions between i^th level of factor α and j^th level of factor β. Finally, ²_iju represents the deviation in u^th sample to the overall mean of the samples within i^th level of factor α and j^th level of factor β. Again, note that P_I

i=1α_i = 0 and P_J

j=1β_j = 0 being J the total number of levels of factor β. The reader is referred to [DC74, Jai91] for further details on the ANOVA methodology.

In document Characterizing the spatial and temporal diversity of the Internet Traffic: a capacity planning application to the RedIRIS network (Page 90-93)