The characters you can enter into an XML document are tab, carriage-re-turn, line feed, and any of the legal characters belonging to the Unicode character set (or the equivalent ISO/IEC 10646 character set), which in-cludes characters for all the world’s written languages. (For more informa-tion on these character sets and the specific characters you can use in XML, see the section “2.2 Characters” in the XML specification at http://www.w3.org/TR/REC-xml.)
An XML file can represent, or encode, the Unicode characters in different ways. For example, if the file uses the encoding scheme known as UTF-8, it represents a capital A as the number 65 stored in 8 bits (41 in hexadeci-mal). However, if it uses the encoding scheme known as UTF-16, it repre-sents a capital A as the number 65 stored in 16 bits (0041 in hexadecimal).
If you save your XML document in a plain text format using Notepad or
78 XML Step by Step
Suppose, however, that you want to be able to type characters that aren’t in the ASCII set directly into your element character data or your attribute values, such as the á and ñ in the following element:
<AUTHOR>Vicente Blasco Ibáñez</AUTHOR>
In this case, you must do two things:
1 Make sure that the XML file is encoded using a scheme that the XML processor can understand. All conforming XML processors must be able to handle UTF-8 and UTF-16 encoded files, so try to use one of these schemes. Some XML processors, however, support additional en-coding schemes you can use.
To create your XML document, you must use a word processor or other program that can create text files in which all characters are uniformly encoded in a supported scheme. For example, you can create a UTF-8 encoded XML document by opening or creating it in Microsoft Word 2002, and then saving the file by choosing the Save As command from the File menu, selecting Plain Text (*.txt) in the Save As Type drop-down list in the Save As dialog box, clicking the Save button, and then in the File Conversion dialog box selecting the Unicode (UTF-8) encoding scheme. (In Word 2000, you need to select Encoded Text (*.txt) in the Save As Type drop-down list rather than Plain Text (*.txt).)
The Microsoft Notepad editor supplied with some versions of Windows also lets you select the encoding scheme when you save a file.
2 If your XML document is encoded in a scheme other than UTF-8 or UTF-16, you must specify the name of the scheme by including an en-coding declaration in the XML declaration, immediately following the version information. For example, the following encoding declaration indicates that the file is encoded using the ISO-8859-1 scheme:
<?xml version=”1.0" encoding=”ISO-8859-1" ?>
(If you also include a standalone document declaration, as described in the sidebar “The standalone Document Declaration” on page 159, it must go after the encoding declaration.) If the XML processor can’t handle the specified encoding scheme, it will generate a fatal error.
Also, if your XML document references an external DTD subset (de-scribed in Chapter 5) or an external parsed entity (de(de-scribed in Chapter continued
Chapter 3 Creating Well-Formed XML Documents 79
3Well-Formed Documents
6), and if the file containing the subset or entity uses an encoding scheme other than UTF-8 or UTF-16, you must include a text declara-tion at the very beginning of the file. A text declaradeclara-tion is similar to an XML declaration, except that the version information is optional, the encoding declaration is mandatory, and it can’t include a standalone document declaration. Here’s an example:
<?xml version=”1.0" encoding=”ISO-8859-1" ?>
(In an external parsed entity, the text declaration is not part of the entity’s replacement text that gets inserted by an entity reference.) You can also insert non-ASCII characters into any XML document, regard-less of its encoding, by using character references as discussed in “Insert-ing Character References” on page 153.
The XML specification’s support for the Unicode character set allows you to freely include characters belonging to any written language. It might also be important to tell the application that handles your document the specific language used for the text in a particular element. For example, the appli-cation might need to know the language of the text in order to display it properly on the screen or to check its spelling. XML reserves an attribute named xml:lang for this purpose. (The xml: indicates that this attribute belongs to the xml namespace. Because this namespace is predefined, you don’t have to declare it. See “Using Namespaces” on page 69.) To specify the language of the text in a particular element (the text in the element’s character data as well as its attribute values) include an xml:lang attribute specification in the element’s start-tag, assigning it an identifier for the lan-guage, as in the following example elements:
<!-- This element contains U.S. English text: -->
<TITLE xml:lang=”en-US”>The Color Purple</TITLE>
80 XML Step by Step
For a description of the official language identifiers you can assign to xml:lang, see the section “2.12 Language Identification” in the XML fication at http://www.w3.org/TR/REC-xml. The xml:lang attribute speci-fication applies to the element in which it occurs and to any nested elements, unless it is overridden by another xml:lang attribute specification in a nested element. To indicate the language of the text throughout your entire docu-ment, just include xml:lang in the document element.
The xml:lang attribute doesn’t affect the behavior of the XML processor.
The processor merely passes the attribute specification on to the applica-tion, which can use the value as appropriate. The XML specification doesn’t say how the xml:lang setting must be used.
When you get to Chapters 5 and 7 on creating valid documents, keep in mind that in a valid document the xml:lang attribute must be defined just like any other attribute. (This will make sense when you read those chap-ters.) For instance, in a DTD you could define this attribute as in the fol-lowing example attribute-list declaration:
<!ATTLIST TITLE xml:lang NMTOKEN #REQUIRED>
continued
81