2.5 The World Wide Web
2.5.2 Hypertext Transfer Protocol
HTTP is the foundation of data communication for the Web. It functions as a request-response protocol in the client-server computing model. A web browser,
is typical example of a client and an application running on a computer hosting a website (a set of web documents), is typical example of a server. The client submits an HTTP request message to the server. The server returns a response message to the client. The response contains status information about the request and may also contain requested content in its message body. In the following thesis, we will use agent to refer to clients such as web browser, web crawlers, mobile apps and any other software that accesses, consumes the Web.
HTTP is a protocol follows REST(Representational State Transfer) [62] architec- tural style. A REST-style protocol conventionally consist of clients and servers. Clients initiate requests to servers; servers process requests and return appropriate responses. Requests and responses are built around the transfer of representa- tions of resources. A resource can be essentially any coherent and meaningful concept that may be addressed. A representation of a resource is typically a document that captures the current or intended state of a resource. Fielding [62] summaries few characteristic of the REST style:
Client-Server : A client is a triggering process; a server is reactive process. Clients make requests that trigger reactions from servers.
Stateless : Each request from client to server must contain all of the informa- tion necessary to understand the request and cannot take advantage of any stored context on the server. The session state is therefore kept entirely on the client.
Cache : Data within a response to a request are implicitly or explicitly labeled as cacheable or non-cacheable. If a response is cacheable, then a client cache is given the right to reuse that response data for later.
Uniform Interface : REST is defined by four interface constraints: identifica- tion of resources (e.g. URIs), manipulation of resources through represen- tations (e.g. modify the resource via HTML webpage); self-descriptive mes- sages (enough information to describe how to process the message e.g. its encoding); and hypermedia as the engine of application state (HATEOAS) that is clients make state transitions only through actions that are dynam- ically identified within hypermedia by the server (e.g. by hyperlinks).
Layered System : Unless specialised, a client has no need to be aware of whether it is connected directly to the end server, or to an intermediary along the way. Intermediary servers may improve system scalability by enabling load-balancing and by providing shared caches. They may also enforce security policies.
Code-On-Demand : As an optional, it allows client functionality to be ex- tended by downloading and executing code in the form of applets or scripts.
HTTP defines methods to indicate the desired action to be performed on the identified resource. It defines seven methods, namely, GET, HEAD, POST, PUT, DELETE, TRACE, and CONNECT. The discussion here is mainly about GET and POST and PUT, the three most frequently used methods on the Web. The
GET method retrieves whatever the representation of the resource identified by the Request-URI. The POST method is used to request that the destination server accept the entity enclosed in the request as a new subordinate of the resource identified by the Request-URI (e.g. when one fills a HTML form and it is used to transfer data to the server). The PUT method requests that the enclosed entity be stored under the supplied Request-URI (e.g. for uploading data directly to the server). Figure 2.2 top part shows a sample request to
GEThttp://dbpedia.org/resource/Tim_Berners-Leefrom a Mozilla 5.0 Web
browser user agent.
An HTTP response from a server consists of an HTTP status code and an HTTP entity [67]. An HTTP status code is one of a finite number of codes which gives the user-agent information about the server’s HTTP response itself. For example, HTTP 200 means that the request was successful, while 404 means that the user- agent requested data that was not found on the server. The HTTP entity is the information transferred as the payload of a request or response. Figure 2.2 b) shows an example of an HTTP response with status code 303 See Other. The content type is the formal language that can be explicitly given in a response or request in HTTP server. In the example, the content type is text/html, so the user agent interpreted the encoding of the HTTP entity body as HTML. The Content- types in HTTP is an ‘Internet Media Types’, which can be applied with any internet protocol [124], not restricted to HTTP or MIME (Multimedia Internet Message Extensions, an email protocol, the Internet Media Type originally named MIME) [66]. A media type, for example, text/html, consists of the type and subtype of encoding with a slash for the separation marker. IANA (Internet
Assigned Numbers Authority) 1 registers the Internet media types.
http://dbpedia.org/resource/Tim_Berners-Lee GET /resource/Tim_Berners-Lee HTTP/1.1 Host: dbpedia.org
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:8.0.1) Gecko/20100101 Firefox/ 8.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Connection: keep-alive
HTTP/1.1 303 See Other
Date: Mon, 13 Feb 2012 21:18:36 GMT Content-Type: text/html; charset=UTF-8 Connection: keep-alive
Server: Virtuoso/06.03.3131 (Linux) x86_64-generic-linux-glibc25-64 VDB Location: http://dbpedia.org/page/Tim_Berners-Lee
Content-Length: 0
Figure 2.2: An HTTP Request and Response from a server