• No results found

4.2 Preliminaries

4.2.5 Statistical Methodologies

In this section, we introduce the statistical techniques applied in this chapter and the following one. First, we present goodness-of-fit techniques that allow us to model the Internet end-hosts distribution and to assess the spatial diversity of the measurements and the temporal aggregation needed for stability. Second, we give a brief introduction to the ANOVA methodology, which allows us to measure the impact that factors such as the source network, country and direction have on the response variable, in this case the measured items (flows, packets, bytes).

Goodness-of-fit techniques

In order to find a suitable model for the traffic destinations, we perform visual inspection first (Section 4.3). This visualization can only give us some insight on the shape of the distribution, and this is not sufficient for hypothesis testing. To this end, we adopt a goodness-of-fit technique over a hypothesized distribution. In our case, the hypothesized distribution is a Zipf-Mandelbrot distribution [KV07, RB89] and the goodness-of-fit test is well-known χ2 test [DS86, Chapter 3].

The underlying idea of Person’s χ2 test was to reduce the problem of good-ness of fit of a general distribution to the testing fit to a multinomial distribution by dividing the support of the distribution in cells and comparing the observed values with the expected ones in each cell. In this way, to test if a random sam-ple X1, X2, · · · , Xn, with n ≥ 30, follows a specified distribution with parameter Θ ∈ Rr, F (x; Θ) (null hypothesis), one starts by partitioning the support of the observations into M buckets, namely C1, C2, · · · , CM, where M ≥ 5. Let us de-note by Oi the number of observations that lie in the bucket Ci, then under the null hypothesis Oi follows a binomial distribution with parameters n and pi, that is:

Oi ∼ B(n; pi) (4.1)

with pi being the probability of choosing the bucket Ci under the null hypothesis pi =

Z

Ci

dF (x, Θ) (4.2)

4.2. Preliminaries 63

It is recommended that approximately the same number of observations lie in each bucket, and also that each bucket has at least 3 observations. With these assumptions, it is reasonable to think that Oi− Ei, being Ei the expected number of observations in cell Ci (Ei = npi), is a good measure of distance between the observed data and its theoretical distribution. It can be shown [DS86, Chapter 3]

that under the null hypothesis, the statistic

χ2 = XM

i=1

(Oi− Ei)2

Ei (4.3)

follows a χ2 distribution. The number of degrees of freedom depends on the number of constraints placed on the data. If the parameter Θ is known, then χ2 has d = M − 1 degrees of freedom. In contrast, if Θ is estimated from the observations, the degrees of freedom of χ2 are d = M − 1 − r. The estimation of the parameter Θ should be performed by Maximum Likelihood (ML) for the former assumptions to be true. Finally, the null hypothesis of goodness of fit is rejected if χ2 has a large value, e.g. if χ2 > χ2d,α, being χ2d,α the percentile (1-α) of a χ2 distribution with d degrees of freedom.

Analysis of Variance

ANOVA is a widely used statistical methodology whereby the observed variance of a given response or dependent variable is split into explanatory factors. ANOVA provides a way to determine if such factors have any importance in explaining the variability of a response variable, and to which extent. ANOVA performs a contrast using the ratio between the adjusted sum of squares of samples that belong to each factor level, i.e., intra-level samples, and the total, inter-level sam-ples. Such ratio is shown to follow a Snedecor’s F distribution, provided that the samples fulfill the following hypothesis: First, they must be necessarily indepen-dent; second, they must be fairly Gaussian; and third, all of them must share the same intra-level variance (i.e., exhibit homoscedasticity). However, the results of ANOVA are generally accepted provided that the number of elements in each group are similar (Balanced ANOVA), and there is a non-excessive deviation from the homoscedasticity assumption [OA84, GPS72].

The null hypothesis supports the homogeneity of means within each factor.

Basically, it contrasts, according to a given pre-defined significance level α (typ-ically α = 0.05), whether or not the intra-level variance values can be explained due to the randomness of measurements (generally, experimental errors) and not to differences in the population when grouped by categories (or levels). If so, the null hypothesis cannot be rejected and the factor used to build the groups is sta-tistically non-significant. Otherwise, the factor explains enough variance and it is considered as significant.

Following this, the simplest ANOVA univariate model for a response variable y with an only significant factor α is given by:

yiu= µ + αi+ ²iu (4.4)

where yiu represents the uth observation on the ith level (i = 1, 2, . . . , I levels), µ represents the overall mean response (or intercept). On the other hand, αi refers to the effect due to the ith level of factor α and ²iu is the deviation, random or experimental error, in the uthsample on the ith level. We also note thatPI

i=1αi = 0.

The resulting model in case of two significant factor is:

yiju = µ + αi+ βj + (αβ)ij + ²iju (4.5)

and so forth in case of more than two factors. In this latter case, αi and βj represent the effect due to the ith and jth levels of factors α and β respectively.

Similarly, (αβ)ij represents the interactions between ith level of factor α and jth level of factor β. Finally, ²iju represents the deviation in uth sample to the overall mean of the samples within ith level of factor α and jth level of factor β. Again, note that PI

i=1αi = 0 and PJ

j=1βj = 0 being J the total number of levels of factor β. The reader is referred to [DC74, Jai91] for further details on the ANOVA methodology.