• No results found

Does the source HTML differ, when the same URL is downloaded by different

In document Sanders_unc_0153D_17177.pdf (Page 106-113)

3.3 Impact of Browser on Web Page Traffic Features

3.3.1 Does the source HTML differ, when the same URL is downloaded by different

Windows 7 operating system. The Kruskal-Wallis test for the HTML-based features yields 8 features that have p-value<.05 across browser platforms — in fact, these p-values are generally less than 10−3. These 8 statistically significant features are: the number of “label” tags, the number of “tr” tags, the number of “table” tags, the number of “td” tags, the number of “style” tags, the number of “legend” tags, javascript length (i.e., the number of characters present between script tags), and the number of different words. Upon further analysis of our data, we find that these statistically significant features correspond to the following:

• Differences in javascript:We find that many content providers such as soundcloud.com and bing.com (particularly image search results) use different javascript code that is suited for different browsers — these javascript related differences were identified using the number of different words feature. We find that some javascript methods are implemented differently across browser platforms and/or have conditional statements that branch for different client browser platforms. For example, sound- cloud.com uses conditional statements that takes the client platform into account during javascript execution to determine whether HLS (HTTP Live Streaming) is supported by the client platform. Alternatively, Figure 3.1 shows an example where a Youtube.com page has javascript that is browser- specific — here the Chrome javascript for loading a video appears to be HTML5-based while the Firefox javascript appears to be flash-based (this flash content is identified by the “swf” references in Figure 3.1). It is known that if different client platforms are not taken into account, rendering differences across browsers can occur when the same source HTML is processed — for example, tar- get.com has differences in rendered tables across browsers despite referencing thesame source code that renders that portion of the page.

• Ads:We also observe “ads” that attempt to get a user to download a particular browser or app that is browser dependent. For example, Yahoo.com recommends that users update to the latest version of Firefox for client platforms that are not Firefox, whereas target.com recommends that Chrome users download the Target mobile app. These ads seem to be attempts to get users to utilize software that is fully supported by the content provider.

Figure 3.2 plots the cumulative distribution of the number of different words between the Chrome and the IE, Firefox, and Opera browsers. Please note that web pages that are thesameacross browser will have a number of different words value of 0 — hence, this feature for the baseline curve (Chrome vs Chrome) is always 0. This plot shows that over 50% of source HTML across different browsers do not have any differences (i.e., a value of 0 for the number of different words feature). In fact, approximately 75% of web pages differ by fewer than 50 words. This result shows that while there are some statistically significant features across browser platforms, many web pages are not influenced by client platforms at all.

Figure 3.1: Example where javascript is different for different browsers (Chrome vs Firefox). 3.3.2 Does the source HTML differ, when the same URL is downloaded by different browser ver-

sion?

We also briefly consider the impact that browser versionmay have on source HTML pages. Our data collection methodology, which downloads a web page using a browser, must be modified for this browser version analysis because it is difficult to use multiple versions of thesamebrowser on a single device. We instead use the urllib2 python library to make HTTP requests for each web page listed in Appendix 3 and extract features from the corresponding HTTP responses (i.e., the source HTML) [Documentation]. Please note that the urllib2 library is used because the User-Agent field in the HTTP request header can be modified such that the server will believe that the request originated from the client platform of interest (i.e., operating system, browser type, browser version, etc). This approach, however, cannot be used to approximate the actual web page traffic downloaded when a browser processes source HTML —this is because different browsers have different source code and may generate different web page traffic despite referring to the same HTML.8Hence, the only way to measure web page traffic generated by a browser is to download the page using said browser. Thus, we do not consider web page traffic features in this analysis.

We consider four pairs of browsers on Windows 7; 1) Firefox v 33.0 and Firefox v 17.0; 2) Internet Explorer v 11.0 and Internet Explorer v 9.0; 3) Chrome v 38.0.2125.122 and Chrome v 33.0.1750.154; and 4) Opera v 25.0.1614.68 and Opera v 12.16. When comparing two samples of web pages, a sample 8Understanding whether browsers indeed generate different traffic is part of the goal of this study.

0 200 400 600 800 Number of Different Words Across

Browser Compared to Chrome 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fraction of Web Pages

Vs Chrome (Baseline) Vs IE

Vs Opera Vs Firefox

Figure 3.2: Cumulative distribution of number of different words across browser as compared against Chrome.

comprised of the four outdated browsers vs a sample of four up-to-date browsers, our statistical test yields 13 statistically significant features. The most notable features that are not also influenced by browser platform, say Internet Explorer vs Firefox, are the number of script tags and the number of HTML5 tags. With respect to the number of script tags, we observe similar differences in scripting behavior as we did with the differences in browser platform. With respect to the number of HTML5 tags, we observe that there tends to be more HTML5-related tags for the latest browser versions as compared to the older versions — we believe this occurs because HTML5 is the newest and current version of HTML that is likely only supported on newer browsers.

We also observe cases where content providers treat outdated or unsupported browsers in the following ways. First, the content provider can respond to the HTTP request, but provide a warning to the user that their browser needs to be updated (e.g., zillow.com and soundcloud.com) — this may also result in failed HTTP requests. Second, the content provider can respond to the HTTP request by sending source HTML that is compatible with the user’s browser. This is explained next.

We find multiple instances when browser version has an impact on some of the HTML tag-based fea- tures. Figure 3.3 shows Google search results that were requested and rendered using an outdated Opera browser, while Figure 3.4 shows Google search results using an up-to-date Chromium-based Opera browser — these web pages are displayed differently for these different versions of Opera. These observed differ- ences are almost purely stylistic with respect to image size and the visibility of URLs on images — though there are some images present in the new version that are not present in the original. We next show a different

Figure 3.3: Search result page for old version of Opera

Figure 3.4: Search result page for new version of Opera.

example of when a web server responds with a web page for an outdated browser. Here, the HTTP request is for a mobile web page of a product on Amazon.com. Figure 3.5 shows that when a mobile web page is requested using anup-to-date mobile device and browser(an iPhone in particular), the request is satisfied as expected. When we make the same request for a mobile web page using anoutdated Firefox browser on a laptopwe also get thesamemobile web page — though we do not observe an ad for downloading an app. This web page is shown in Figure 3.6. Figure 3.7 shows that when the same request is made to Ama- zon.com using anup-to-date Firefox browser on a laptopwe get adifferentmobile web page that is clearly representing the same product shown in Figure 3.5. It is clear that these downloaded web pages are both (i) mobile-optimized web pages and (ii) different, where the version of the page shown in Figure 3.7 appears to be an older mobile web page design than the page shown in Figure 3.5. We conclude two things from these

Figure 3.5: Product web page for Amazon site as rendered using Safari browser on iPhone.

Figure 3.6: Product web page for Amazon site as rendered using an old Firefox Browser.

observations: 1) mobile web pages may sometimes be used to fulfill HTTP requests to outdated browsers (we observe similar behavior for yahoo.com and att.com — please refer to Figure 3.8 for an example of a mobile page being returned for an outdated Firefox browser and Figure 3.9 for the normal up-to-date ver- sion); and 2) interesting and unexpected quirks exist for some HTTP requests that are influenced by browser choice.9 The impact that browser version has on web page downloads is important for web crawling tools

because (i) web crawlers may be used for years without receiving any significant upgrades and (ii) content providers may respond to known web-crawler User-Agents in a manner that results in errors or downloading data that is limited (in a manner similar to mobile web pages) [Notess, 2002].

Implications of HTML-based differences: We found that source HTML can be influenced by browser 9Please note that the significant differences discussed here are primarily true for browser version analysis for Opera and Firefox

Figure 3.7: Product web page for Amazon site as rendered using a new Firefox Browser. type and version. A summary of the implications of our analysis is below:

• Our results show that it is definitely possible that web pages can differ across client platform. Thus, any measurement study or web-related application that relies on source HTML, such as web page archival, should verify that the web pages that they visit is not influenced by the browser that is used — else, it is possible that the downloaded source HTML is biased for a particular browser which will limit the scope of the analysis.

• We find that most of the differences observed across browser platforms correspond to compatibility issues across browser type and version or browser-specific ads. In fact, most of the significant differ- ences that we observe involve HTML tags that impact the way the page is displayed. These features do not tend to influence the number of web objects that are referenced by the page. Thus, it is not expected that these differences will have a dramatic impact on the TCP/IP and HTTP headers that correspond to these web pages.

• We find that approximately 50% of web pages differ, to some degree, across browser platform. While this is a large fraction of web pages, we are aware that time is a possible factor that may bias our results — we investigate this factor in Section 3.7. Nevertheless, this observation is important in the context of this dissertation because it provides a baseline for determining whether source HTML is the only cause for any possible differences in traffic observed across browsers. Specifically, if more than 50% (i.e., the approximate percentage of web pages that have different source HTML) of web page traffic differs across browsers we can confidently state that there are cases where choice of browser

Figure 3.8: Example where mobile web page is returned for an old desktop Firefox browser. influence web page traffic when the source HTML is the same.

In document Sanders_unc_0153D_17177.pdf (Page 106-113)