Simulation Modeling and Network Forecasting

4.4 Applicability of Web Page Classification

4.4.2 Simulation Modeling and Network Forecasting

We next compare traffic generated using ground-truth and classified labels. We use the ns-2 network simulator to simulate the web browsing behavior of 400 web users where all traffic gets aggregated on a shared 1Gbps link — the traffic for each individual web page download is taken from our data set. Each user behaves independently and randomly visits a web page. The inter-arrival time for web page downloads by a given user is gaussian distributed with a mean of 30s and standard deviation of 15s — this distribution is chosen for simplicity (and is adequate for our purpose of comparing distributions).

The download of each web page itself is simulated using TMIX, which provides a source-level traffic generation interface in ns-2 [Weigle et al., 2006]. Specifically, we provide this tool with the TCP/IP trace

TABLE4.8: Comparing Feature Distributions from Classified Web Pages and Ground Truth Labels

Feature - Statisical Test - Label - p-value (KNN/LDA)

Number of TCP Connections Ranked Sum Mobile web page .6066/.0230 Kolmogorov-Smirnov Mobile web page .7969/1.728×10−9 Ranked Sum Traditional web page .9998/1.66×10−136 Kolmogorov-Smirnov Traditional web page 1.000/1.846×10−49 Number of TCP Connections Ranked Sum Video web page .8764/1.13×10−8

Kolmogorov-Smirnov Video web page 1.000/8.1467×10−12 Ranked Sum Non-video web page .9583/.4879

Kolmogorov-Smirnov Non-video web page 1.000/.2954 Number of TCP Connections Ranked Sum Computers .6405/3.26×10−6

Kolmogorov-Smirnov Computers .8773/3.211×10−12

Ranked Sum Business .8193/7.9043×10−11

Kolmogorov-Smirnov Business .9828/4.5604×10−1

Ranked Sum Shopping .6660/.0019

Kolmogorov-Smirnov Shopping .9997/1.8681×10−10

Ranked Sum News .7675/0.0115

Kolmogorov-Smirnov News .6255/3.4481×10−6 Number of TCP Connections Ranked Sum Homepage .3018/3.2032×10−74

Kolmogorov-Smirnov Homepage .9828/8.3124×10−9

Ranked Sum Search .9262/2.88×10−106

Kolmogorov-Smirnov Search .6814/4.5578×10−18 Ranked Sum Clickable Content .5667/6.9477×10−31 Kolmogorov-Smirnov Clickable Content .4800/1.0665×10−28 Number of Bytes Ranked Sum Mobile web page .3133/1.449×10−4

Kolmogorov-Smirnov Mobile web page .4775/1.5063×10−19 Ranked Sum Traditional web page .8597/6.413×10−84 Kolmogorov-Smirnov Traditional web page .9999/4.6465×10−89

Number of Bytes Ranked Sum Video web page .9173/.4364

Kolmogorov-Smirnov Video web page 1.00/4.2076×10−4 Ranked Sum Non-video web page .9924/.3151 Kolmogorov-Smirnov Non-video web page 1.00/.7021

Number of Bytes Ranked Sum Computers .7127/6.0806×10−13

Kolmogorov-Smirnov Computers .9999/2.122×10−09

Ranked Sum Business .9440/6.2451×10−10

Kolmogorov-Smirnov Business .9941/2.573×10−1

Ranked Sum Shopping .2248/.0045

Kolmogorov-Smirnov Shopping .9821/4.933×10−13

Ranked Sum News .9108/3.1335×10−5

Kolmogorov-Smirnov News .5558/9.5609×10−10

Number of Bytes Ranked Sum Homepage .1847/4.9872×10−33

Kolmogorov-Smirnov Homepage .8981/3.3592×10−17

Ranked Sum Search .6982/6.0573×10−99

Kolmogorov-Smirnov Search 4978/1.4042×10−5 Ranked Sum Clickable Content .2476/0.0765

of a web page download (selected randomly from the 100,350 downloads we collect in Section 4.1). TMIX then derives from the trace, application-level descriptors of the corresponding traffic sources — including request sizes, response sizes, user think times, and server processing times. It then generates corresponding traffic in ns-2 by reproducing these source-level events. Thus, this tool allows us to faithfully produce realistic source-level behavior for each web page download. We use this traffic generation methodology in the context of the forecasting application below.

0 100 200 300 400 500 600 700 800 900 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Throughput Mbps (a) Cumulative Distribution Baseline − GT Baseline − ML Alternative Model 1 − GT Alternative Model 1 − ML Alternative Model 2 − GT Alternative Model 2 − ML (a) 0 200 400 600 800 1000 1200 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Throughput Mbps (b) Cumulative Distribution Baseline − GT Baseline − ML Alternative Model 1 − GT Alternative Model 1 − ML (b)

Figure 4.9: Distribution of aggregate throughput for mobile model (a) and video model (b)

Modeling Growth in Mobile Web Usage We first construct abaseline model, in which each user visits a mobile-optimized web page 20% of the time and a traditional page 80% of the time — nearly 20% of current web traffic is considered mobile [Mob, a]. The TMIX input for each user is obtained by randomly selecting a mobile (or traditional) page download from our set of 100,350 downloads — we conduct two experiments, in which the mobile or traditional pages are selected based on either ground-truth (GT) labels or KNN labels (ML). The throughput on the 1 Gbps aggregated link is observed every 1ms — its cumulative distribution is plotted in Figure 4.9(a).

We next conduct two sets of experiments that incorporate growth in mobile traffic. In the first set, referred to asalternate model 1, we envision the scenario in which all users increase their reliance on mobile devices — specifically, in this model, each user visits a mobile-optimized web page 50% of the time (labelled using either GT or ML). In the second set of experiments, we envision growth in the number of users that relysolelyon mobile devices. In this model, referred to asalternate model 2, we retain the behavior of the

400 baseline users, but simulate an additional 200 users that browse only (GT or ML-identified) mobile- optimized web pages (100% of the time).

The distribution of the aggregate throughput for each of the forecasting experiments is also plotted in Figure 4.9(a). We find that the distributions yielded by the ground-truth (GT) and the classified (ML) labels are quite similar to each other. In fact, we run the hypothesis testing approaches mentioned earlier to confirm that the distributions are, in fact, statistically equivalent. This is true for the baseline traffic, as well as each of the forecasted alternative models. This confirms thatweb page classification, based only on anonymized TCP/IP headers, can be used to effectively conduct traffic modeling studies involving mobile web traffic.

Modeling Growth in Video Streaming We use the same approach as above to construct abaseline model, in which a user downloads pages with video traffic 20% of the time, and analternate model 1, in which users access video-based web pages 50% of the time — please note that these percentages were arbitrarily chosen to compare the distributions of the different models. The aggregate throughput is plotted in Figure 4.9(b). We find that, as before, the distributions obtained by relying on classified (ML) labels are nearly identical to the ones derived using ground-truth (GT) labels. This is true even when the forecasted traffic drives the network to nearly full-utilization (alternate model 1).

We emphasize that our intention is not to make forecasting claims, but simply to illustrate that our classification work can very well facilitate such traffic modeling applications.

In document Sanders_unc_0153D_17177.pdf (Page 192-195)