HTML Parsing and DOM tree construction

3.3 The Rendering Engine

3.3.2 HTML Parsing and DOM tree construction

In this section, a small description will be given on how the HTML parser is different from most other parsers. We will also provide an example of how a small HTML document is translated into a DOM tree. We refer the reader to other literature [4] for more details about the browser HTML parsing process, as it is beyond the scope of this thesis. The vocabulary and syntax of HTML [36] are defined in specifications created by the W3C organization.

HTML Grammar and Parsing

Grammar syntax can usually be defined formally using formats such as BackusNaur Form (or Backus Normal Form). This is a notation technique used to describe the syntax of context-free grammars. However, the conventional parser topics do not apply to HTML, because it is not a context free grammar. The browser will make use of the traditional context-free parsers for XML, JavaScript and CSS. It might strike some readers as odd that XML and HTML parsers would be inherently different, seeing as the languages are rather closely related, and there even exists an XML variation of HTML, XHTML. The biggest difference between the two is that the HTML approach was designed to be more “forgiving” of strictly invalid syntax, while XML employs a more stiff and demanding syntax. When a browser detects certain closing tags to be missing, it is possible that these tags will be added implicitly without any error being thrown. This forgiving nature of HTML is one of the main reasons why it is so popular: mistakes are often automatically solved and it facilitates the work of the developer of the HTML document. The downside of this is that it makes it difficult to write a formal grammar, and it may cause (and has caused) additional security vulnerabilities. Strict interpretation is needed to be a context-free grammar, as such HTML needs another formal definition format. The formal format used for defining HTML is Document Type Definition (DTD), which is thus not a context free grammar. This format is used to define languages of the SGML family and it contains definitions for all allowed elements, their attributes and hierarchy. The recommended Doctype deceleration list

can be found at [37]. Because it is the parsing method of an HTML document, it is strongly advised that a correct doctype declaration is added when authoring HTML or XHTML documents.2 _{It is important that the doctype must be exact in spelling and} case to have the desired effect.

We also note that browsers can switch parsing context while parsing an HTML document. There are situations where an element will switch the parsing context to XML, after the parsing of this element is complete the context will be switch back to HTML. This will become apparent in both the examples and experiments of section 4.6. We will not discuss the specifics of the parsing algorithms here3_{, the reader is again} referred to [4] for additional details. We will, however, list the reasons why HTML cannot be parsed using the regular top down or bottom up parsers:

1. As mentioned earlier, the forgiving nature of the language.

2. As an addition to the forgiving nature, most browsers have error tolerance to support well known cases of invalid HTML.

3. The parsing process is reentrant. In the case of HTML, this is caused by the fact that script tags can possibly add extra tokens (elements), so it is possible that the parsing process actually modifies the input. This can be achieved by using the document.createElement() JavaScript method.

Browser error tolerance is an important feature of HTML and closely related to the topics discussed here, we will discuss it in the next section.

Browser Error Tolerance

Browser error tolerance is, as discussed before, one of the features that makes HTML such a popular standard. Modern browsers never throw an “Invalid Syntax” error on an HTML page, but merely fix any invalid content in the requested document and render it as best they can.4 _{The HTML file in listing 3.1 is an example of very badly written} HTML code. It breaks many rules: <badtag> is not a standard tag, the <p> and <div> elements are nested invalidly and the <html> closing tag is even left out. The browser will not throw any errors and fix the invalid document, so it can be displayed in the browsing window. A lot of the parser code is thus committed to fixing HTML coding mistakes. Listing 3.2 gives the same HTML code as seen in Firebug [38], after it has been fixed by Firefox.

2_{We will often not include a doctype declaration in the examples of this thesis ourselves; however,}

we advise to declare the doctype when developing HTML documents for public (or business) use.

3_{They consist of the tokenization and tree construction algorithm and are specified in the HTML5}

specification.

4_{We must note that it is possible to receive parsing errors in browsers on XML files, but when}

browsing content is switched to XML inside an HTML document they will not. Parsing context will instead be switched back to HTML.

1 <html> 2 <b a d t a g> 3 </ b a d t a g> 4 <div> 5 <p> 6 </ div> 7 Bad HTML 8 </p>

Listing 3.1: Example markup: very badly written HTML document.

1 <html> 2 <head></head> 3 <body> 4 <b a d t a g> </ b a d t a g> 5 <div> 6 <p> </p> 7 </ div> 8 Bad HTML 9 <p></p> 10 </body> 11 </html>

Listing 3.2: Example markup: Firefox output after reading the badly written HTML. The error handling was quite consistent in browsers even before it was part of any HTML specification, and it was something that has evolved cross-browser over the years. With the introduction of HTML5 into the world wide web, the error handling is finally being standardised to some extent. The HTML5 specification also includes an extended section about “An introduction to error handling and strange cases in the parser” [36]. For the first time since the start of HTML development, a specification exists that also defines what browsers should do when they are dealing with badly formed documents.

Because Webkit summarizes the error handling requirements nicely in the comment at the beginning of the HTML parser class, we will quote this comment here.

“The parser parses tokenized input into the document, building up the document tree. If the document is well-formed, parsing it is straightforward.

Unfortunately, we have to handle many HTML documents that are not well-formed, so the parser has to be tolerant about errors.

We have to take care of at least the following error conditions:

1. The element being added is explicitly forbidden inside some outer tag. In this case we should close all tags up to the one, which forbids the element, and add it afterwards. 2. We are not allowed to add the element directly. It could be that the person writing the document forgot some tag in between (or that the tag in between is optional). This

could be the case with the following tags: HTML HEAD BODY TBODY TR TD LI (did I forget any?).

3. We want to add a block element inside to an inline element. Close all inline elements up to the next higher block element.

4. If this doesn’t help, close elements until we are allowed to add the element or ignore the tag.”

Document Object Model

The parsing of the HTML document will result in the DOM tree; this parse tree is a tree of DOM element and attribute nodes. It is the object presentation of the HTML document and the interface of HTML elements to the other modules, like the JavaScript Module. The root of the tree is the “Document” object.

The DOM has an almost one to one relation to the markup. In listing 3.3 a simple HTML document is given, and in figure 3.3 the corresponding DOM tree can be seen.5

1 <html> 2 <body> 3 <p>

4 H e l l o World 5 </p>

6 <div> <img s r c=” example . png ” /></ div> 7 </body>

8 </html>

Listing 3.3: Example markup: a simple HTML document.

The DOM is also specified by the W3C organization, the DOM technical reports can be found in [39].

In document Dutch Summary. Inleiding (Page 37-40)