The World Wide Web, often abbreviated as WWW or “the Web”, is an application that allows documents and other resources to be linked together and accessed via the Internet [28]. Prior to the 1990’s, computer networks were organized in a loose confederation and connected by a backbone network known as the NSFNET [28]. In the early 1990’s, the NSFNET started to allow commer-cial activities to take place. Thus, the demand for an application that facilitated the exchange of information over the Internet was created. Between 1988 and 1991 [28], Tim Berners-Lee fulfilled this demand by inventing the Web. The Web provides a platform on which many popular Inter-net services, such as online shopping, social media, and file sharing, are built. In 1995 [28], the NSFNET was decommissioned and replaced with commercial Internet Service Providers (ISPs), leading to the birth of the modern Internet. This section details the Web and the components that are relevant to this thesis.
2.3.1 HTTP
The HyperText Transfer Protocol (HTTP) is the Web’s application-layer protocol. HTTP is imple-mented using a client-server architecture, meaning that client programs (Web browsers) commu-nicate with server programs (Web servers). HTTP defines the structure of these messages as well as how they are exchanged. Users enter information about the resource they wish to request into the Web browser, and the Web browser then translates this information into an HTTP request. The request is then sent to and handled by the Web server, which responds with an appropriate HTTP reply.
HTTP requests are sent over the Internet using the server’s IP address. However, humans have difficulty remembering long strings of numbers, thus most Web servers use domain names.
Domain names are memorable names (such as google.com) that map to an IP address (or a set of IP Addresses) that belongs to the server. Translation between the domain name (entered by the user) and IP address (used by the network layer) for the Web server is carried out using the Domain Name Service (DNS).
Individual resources on the Web are identified using Uniform Resource Locators (URLs). A URL is made up of two parts: the domain name and the resource path [28]. The domain name specifies the server that the resource is stored on. The resource path specifies the location of the resource on the server. For example, in the URL http://ucalgary.ca/pages/homepage.html, ucalgary.ca is the domain name and /pages/homepage.html is the resource path.
When Web browsers make HTTP requests, they must send the message in a valid HTTP format.
HTTP requests must start with a request line. The request line contains three fields: the method field, the URL field, and the version field. The method field indicates the action that the client is requesting from the server. The URL field indicates where the resource is located. The version field indicates which version of HTTP the browser is using. The request line can then be followed by a number of subsequent optional lines called header lines. The request may also include a body that is used to transfer data. These requests are sent using TCP and are typically directed to port
80.
Figure 2.5 shows an example of an HTTP request. In this example the HTTP message uses GET, which is the most popular HTTP method [28]. This method requests the resource in the URL field, in this case /pages/homepage.html, from the server. The version field also tells the server that the client is using HTTP version 1.1. The subsequent header lines indicate the server that the request is being made to and the user-agent (the type of Web browser) that the client is using. Other popular methods are the POST method, which sends data to the server, and the HEAD method, which requests meta-data about the resource in the URL field.
GET /pages/homepage.html HTTP/1.1
Host: www.ucalgary.ca
User-Agent: Chrome/1.0
Figure 2.5: Example HTTP GET request.
Upon receiving an HTTP request, the Web server will issue a response. HTTP responses must start with a status line. The status line has three fields: the version field, the status code, and the status message. The version field indicates the HTTP version being used by the server, the status code indicates the outcome of the request, and the status message provides information about the status code. The status line can be followed by a number of optional header fields and an optional body.
A sample HTTP response is depicted in Figure 2.6. The status line indicates that the Web server is using HTTP version 1.1, and that the request was “OK”, meaning that it was completed properly. The header lines indicate that the Web server being used by the host is Apache/1.0, the requested resource is an HTML file, and the size of the HTML file is 1,234 bytes. The contents of the file are then given in the body of the response. In this case, the request was fulfilled without issue, but there are a variety of status codes that are used in different circumstances, such as when errors occur.
HTTP/1.1 200 OK
HTTPS provides security in addition to the basic HTTP functionality. HTTPS is implemented by sending HTTP requests and responses through an encrypted connection [39]. The secure connec-tion is established using a secure transport-layer protocol such as Transport Layer Security (TLS) or Secure Socket Layer (SSL). HTTPS provides authentication, privacy, and integrity to HTTP communications. Authentication ensures that requests are being sent to the intended recipient.
HTTPS enables “privacy” by preventing others on the network from being able to view an indi-vidual’s HTTP requests and responses. HTTPS provides “integrity” by preventing others from altering the messages. HTTPS requests are typically sent to port 443 instead of port 80.
Though the contents of HTTP communications can be observed from monitoring a network, HTTPS communications are encrypted. This means that although one can observe metadata re-lated to HTTPS communications, it is not possible to determine the contents unless you know the encryption key.
2.3.3 Content Delivery Networks
A Content Delivery Network (CDN) is a collection of distributed servers that replicate Web re-sources [28]. The aim of a CDN is to provide Web users with better availability (by presenting alternatives when servers fail), better performance (by routing users to servers that are closer to them), and better scalability (by dividing the requests from users amongst the CDN). CDNs can
be run by the content provider, such as Netflix or Youtube, or by third-party organizations that sell CDN services, such as Akamai and Cloudflare.