Markup and Core Concepts
Anatomy of a Document
The Building Blocks of XML
More Muscle for Elements
Expanding Your Vocabulary
Placeholders for Content
the Most out of Markup
This is probably the most important chapter in the book, as it describes
the fundamental building blocks of all XML-derived languages: elements,
attributes, entities, and processing instructions. It explains what a document
is, and what it means to say it is well-formed or valid. Mastering these
concepts is a prerequisite to understanding the many technologies,
applications, and software related to XML.
How do we know so much about the syntactical details
of XML? It's all described in a technical document maintained by the W3C, the
XML recommendation (http://www.w3.org/TR/2000/REC-xml-20001006). It's not light
reading, and most users of XML won't need it, but you many be curious to know
where this is coming from. For those interested in the standards process and
what all the jargon means, take a look at Tim Bray's interactive, annotated
version of the recommendation at http://www.xml.com/axml/testaxml.htm.
Example 2-1 shows a bite-sized XML example. Let's take a look.
Example 2.1. A Small XML Document
<message>Don't forget to recharge K-9
<emphasis>twice a day</emphasis>.
Also, I think we should have his
bearings checked out. See you soon
(or late). I have a date with
It's a goofy example, but perfectly acceptable XML. XML lets you name the
parts anything you want, unlike HTML, which limits you to predefined tag
names. XML doesn't care how you're going to use the document, how it will
appear when formatted, or even what the names of the elements mean. All that
matters is that you follow the basic rules for markup described in this
chapter. This is not to say that matters of organization aren't important,
however. You should choose element names that make sense in the context of the
document, instead of random things like signs of the zodiac. This is more for
your benefit and the benefit of the people using your XML application than
This example, like all XML, consists of content interspersed with markup
symbols. The angle brackets (<>) and the names they enclose are called
tags. Tags demarcate and label the parts of the
document, and add other information that helps define the structure. The text
between the tags is the content of the document, raw information that may be
the body of a message, a title, or a field of data. The markup and the content
complement each other, creating an information entity with partitioned,
labeled data in a handy package.
Although XML is designed to be relatively readable by humans, it isn't
intended to create a finished document. In other words, you can't open up just
any XML-tagged document in a browser and expect it to be formatted
nicely.XML is really meant as a way to hold content so that, when combined
with other resources such as a stylesheet, the document becomes a finished
product style and polish .
We'll look at how to combine a stylesheet with an XML document to generate
formatted output in Chapter 4, "Presentation: Creatingthe End Product". For
now, let's just imagine what it might look like with a simple stylesheet
applied. For example, it could be rendered as shown in Example 2-2.
Example 2.2. The Memorandum, Formatted with a
Don't forget to recharge K-9 twice a day.
Also, I think we should have his bearings checked out.
See you soon (or late). I have a date with some Daleks...
From: The Doctor
The rendering of this example is purely speculative at this point. If we
used some other stylesheet, we could format the same memo a different way. It
could change the order of elements, say by displaying the From: line above the
message body. Or it could compress the message body to a width of 20
characters. Or it could go even further by using different fonts, creating a
border around the message, causing parts to blink on and off--whatever you
want. The beauty of XML is that it doesn't put any restrictions on how you
present the document.
Let's look closely at the markup to discern its structure. As Figure 2-1
demonstrates, the markup tags divide the memo into regions, represented in the
diagram as boxes containing other boxes. The first box contains a special
declarative prolog that provides administrative information about the
document. (We'll come back to that in a moment.) The other boxes are called
elements. They act as containers and labels of text.
The largest element, labeled <time-o-gram>, surrounds all the other
elements and acts as a package that holds together all the subparts. Inside it
are specialized elements that represent the distinct functional parts of the
document. Looking at this diagram, we can say that the major parts of a <time-o-gram> are the destination (<to>), the sender (<from>), a message teaser (<subject>), and the message body (<message>). The last is the most complex,
mixing elements and text together in its content. So we can see from this
example that even a simple XML document can harbor several levels of
Figure 2.1. Elements in the memo document
A Tree View
Elements divide the document into its constituent parts. They can contain
text, other elements, or both. Figure 2-2 breaks out the hierarchy of elements
in our memo. This diagram, called a tree because of
its branching shape, is a useful representation for discussing the
relationships between document parts. The black rectangles represent the seven
elements. The top element (<time-o-gram>)
is called the root element. You'll often hear it
called the document element, because it encloses all
the other elements and thus defines the boundary of the document. The
rectangles at the end of the element chains are called leaves, and represent the actual content of the document.
Every object in the picture with arrows leading to or from it is a node.
Figure 2.2. Tree diagram of the memo
There's one piece of Figure 2-2 that we haven't yet mentioned: the box on
the left labeled pri. It was inside the <time-o-gram> tag, but here we see it
branching off the element. This is a special kind of content called an attribute that provides additional information about an
element. Like an element, an attribute has a label (pri) and some content (important).
You can think of it as a name/value pair contained in the <time-o-gram> element tag. Attributes are
used mainly for modifying an element's behavior rather than holding data;
later processing might print "High Priority" in large letters at the top of
the document, for example.
Now let's stretch the tree metaphor further and think about the diagram as
a sort of family tree, where every node is a parent or a child (or both) of
other nodes. Note, though, that unlike a family tree, an XML element has only
one parent. With this perspective, we can see that the root element (a
grizzled old <time-o-gram>) is the
ancestor of all the other elements. Its children are the four elements
directly beneath it. They, in turn, have children, and so on until we reach
the childless leaf nodes, which contain the text of the document and any empty
elements. Elements that share the same parent are said to be siblings.
Every node in the tree can be thought of as the root of a smaller subtree.
Subtrees have all the properties of a regular tree, and the top of each
subtree is the ancestor of all the descendant nodes below it. We will see in
Chapter 6, "Transformation:RepurposingDocuments", that an XML document can be
processed easily by breaking it down into smaller subtrees and reassembling
the result later. Figure 2-3 shows some examples of subtrees in our <time-o-gram> example.
Figure 2.3. Some subtrees
And that's the 10-minute overview of XML. The power of XML is its
simplicity. In the rest of this chapter, we'll talk about the details of the
The Document Prolog
Somehow, we need to tip off the world that our document is marked up in
XML. If we leave it to a computer program to guess, we're asking for trouble.
A lot of markup languages look similar, and when you add different versions to
the mix, it becomes difficult to tell them apart. This is especially true for
documents on the World Wide Web, where there are literally hundreds of
different file formats in use.
The top of an XML document is graced with special information called the
document prolog. At its simplest, the prolog merely
says that this is an XML document and declares the version of XML being
But the prolog can hold additional information that nails down such details
as the document type definition being used, declarations of special pieces of
text, the text encoding, and instructions to XML processors.
Let's look at a breakdown of the prolog, and then we'll examine each part
in more detail. Figure 2-4 shows an XML document. At the top is an XML
declaration (1). After this is a document type declaration (2) that links to a
document type definition (3) in a separate file. This is followed by a set of
declarations (4). These four parts together comprise the prolog (6), although
not every prolog will have all four parts. Finally, the root element (5)
contains the rest of the document. This ordering cannot be changed: if there
is an XML declaration, it must be on the first line; if there is a document
type declaration, it must precede the root element.
Figure 2.4. A Document with a prolog and a root
Let's take a closer look at our <time-o-gram> document's prolog, shown here
in Example 2-3. Note that because we're examining the prolog in more detail,
the numbers in Example 2-3 aren't the same as those in Figure 2-4.
Example 2.3. A Document Prolog
<?xml version="1.0" encoding="utf-8"?> ()<!DOCTYPE time-o-gram () PUBLIC "-//LordsOfTime//DTD TimeOGram 1.8//EN" () "http://www.lordsoftime.org/DTDs/timeogram.dtd" ()[ () <!ENTITY sj "Sarah Jane"> () <!ENTITY me "Doctor Who">
. The XML declaration describes some of the most general properties of
the document, telling the XML processor that it needs an XML parser to
interpret this document.
. The document type declarationdescribes the
root element type, in this case <time-o-gram>, and (on lines 3 and 4)
designates a document type definition(DTD) to
control markup structure.
. The identity code, called a public identifier,
specifies the DTD to use.
. A system identifierspecifies the location of
the DTD. In this example, the system identifier is a URL.
. This is the beginning of the internal subset,
which provides a place for special declarations.
. Inside this internal subset are two entity
. The end of both the internal subset (]) and the
document type declaration (>) complete the
Each of these terms is described in more detail later in this chapter.
The XML declaration
The XML declaration is an announcement to the XML processor
that this document is marked up in XML. Its form is shown in Figure 2-5. The
declaration begins with the five-character delimiter <?xml (1), followed by some number of property
definitions (2), each of which has a property name (3) and value in quotes
(4). The declaration ends with the two-character closing delimiter ?> (5).
Figure 2.5. XML declaration syntax
There are three properties that you can set:
Sets the version number. Currently there is only one XML version, so the
value is always 1.0. However, as new versions are approved, this property
will tell the XML processor which version to use. You should always define
this property in your prolog.
Defines the character encoding used in the document, such as US-ASCII or iso-8859-1. If you
know you're using a character set other than the standard Latin characters
of UTF-8 (e.g., Japanese Katana, or Cyrillic), you should declare this
property. Otherwise, it's okay to leave it out. Character encodings are
explained in Chapter 7, "Internationalization".
Tells the XML processor whether there are any other files to load. For
example, you would set this to no if there are
external entities (see "Entities: Placeholders for Content"" later in this
chapter) or a DTD to load in addition to the document's main file. If you
know that the file can stand on its own, setting standalone="yes" can improve downloading performance.
This parameter is explained in more detail in Chapter 5, "Document Models:A
Higher Levelof Control".
Some examples of well-formed XML declarations are:
<?xml version='1.0' encoding='US-ASCII' standalone='yes'?>
<?xml version = '1.0' encoding= 'iso-8859-1' standalone ="no"?>
All of the properties are optional, but you should try to include at least
the version number in case something changes drastically in a future revision
of the XML specification. The parameter names must be lowercase, and all
values must be quoted with either double or single quotes.
The document type declaration
The second part of the prolog is the document type
declaration.This is where you can specify various parameters such as entity
declarations, the DTD to use for validating the document, and the name of the
root element. By referring to a DTD, you are requesting that the parser
compare the document instance to a document model, a process called validity checking. Checking the validity of your document
is optional, but it is useful if you need to ensure that the document follows
predictable patterns and includes required data. See Chapter 5, "Document
Models:A Higher Levelof Control" for detailed information on DTDs and validity
The syntax for a document type declaration is shown in Figure 2-6. The
declaration starts with the literal string <!DOCTYPE
(1) followed by the root element (2), which is the first XML element to appear
in the document and the one that contains the rest of the document. If you are
using a DTD with the document, you need to include the URI of the DTD (3)
next, so the XML processor can find it. After that comes the internal subset
(5), which is bound on either side by square brackets (4 and 6). The
declaration ends with a closing >.
Figure 2.6. Document type declaration syntax
The internal subset provides a place to put various declarations for use in
your document, as we saw in Figure 2-4. These declarations might include
entity definitions, and parts of DTDs. The internal subset is the only place
where you can put these declarations within the document itself.
The internal subset is used to augment or redefine the declarations found
in the external subset. The external subset is the
collection of declarations existing outside the document, like in a DTD. The
URI you provide in the document type declaration points to a file containing
these external declarations. Internal and external subsets are optional.
Chapter 5, "Document Models:A Higher Levelof Control" explains internal and
Elements are parts of a document. You can separate a document into parts so
they can be rendered differently, or used by a search engine. Elements can be
containers, with a mixture of text and other elements. This element contains
<flooby>This is text contained inside an element</flooby>
and this element contains both text and elements:
<outer>this is text<inner>more
text</inner>still more text</outer>
Some elements are empty, and contribute information by their position and
attributes. There is an empty element inside this example:
<outer>an element can be empty: <nuttin//></outer>
Figure 2-7 shows the syntax for a container element. It begins with a start
tag (1) consisting of an angle bracket (<) followed by a name (2). The
start tag may contain some attributes (3) separated by whitespace, and it ends
with a closing angle bracket (>). An attribute defines a property of the
element and consists of a name (4) joined by an equals sign (=) to a value in quotes (5). An element can have any number
of attributes, but no two attributes can have the same name. Following the
start tag is the element's content (6), which in turn is followed by an end
tag (7). The end tag consists of an opening angle bracket, a slash, the
element's name, and a closing bracket. The end tag has no attributes, and the
element name must match the start tag's name exactly.
Figure 2.7. Container element syntax
As shown in Figure 2-8, an empty element (one with no content) consists of
a single tag (1) that begins with an opening angle bracket (<) followed by
the element name (2). This is followed by some number of attributes (3), each
of which consists of a name (4) and a value in quotes (5), and the element
ends with a slash (/) and a closing angle bracket.
Figure 2.8. Empty element syntax
An element name must start with a letter or an underscore, and can contain
any number of letters, numbers, hyphens, periods, and underscores. Element
names can include accented Roman characters; letters from alphabets such as
Cyrillic, Greek, Hebrew, Arabic, Thai, Hiragana, Katakana, and Devanagari; and
ideograms from Chinese, Japanese, and Korean. The colon symbol is used in
namespaces, as explained in "Namespaces: Expanding Your Vocabulary," so avoid
using it in element names that don't use a namespace. Space, tab, newline,
equals sign, and any quote characters are separators for element names,
attribute names, and attribute values, so they are not allowed either. Some
valid element names are: <Bob>, <chapter.title>, <THX-1138>, or even <_>. XML names are case-sensitive, so <Para>, <para>, and <pArA> are three different elements.
There can be no space between the opening angle bracket and the element
name, but adding extra space anywhere else in the element tag is okay. This
allows you to break an element across lines to make it more readable. For
><crewmember class="rower">Dronicus Laborius</crewmember >
There are two rules about the positioning of start and end tags:
To understand the second rule, think of elements as boxes. A box can sit
inside or outside another box, but it can't protrude through the box without
making a hole in the side. Thus, the following example of overlapping elements
<a>Don't <b>do</a> this!</b>
These untangled elements are okay:
Anything in the content that is not an element is text, or character data. The text can include any character in the
character set that was specified in the prolog. However, some characters must
be represented in a special way so as not to confuse the parser. For example,
the left angle bracket (<) is reserved for element tags. Including it
directly in content causes an ambiguous situation: is it the start of an XML
tag or is it just data? Here's an example:
<foo>x < y</foo> yikes!
To resolve this conflict, you need to use a special code in place of the
offending character. For the left angle bracket, the code is <. (The equivalent code for the right
angle bracket is >.) So we can rewrite
the above example like this:
<foo>x < y</foo>
Such a substitution is known as an entity
reference. We'll describe entities and entity references in "Entities:
Placeholders for Content"."
In XML, all characters are preserved as a matter of course, including the
white-space characters space, tab, and newline; compare this to programming
languages such as Perl and C, where whitespace characters are essentially
ignored. In markup languages such as HTML, multiple sequential spaces are
collapsed by the browser into a single space, and lines can be broken anywhere
to suit the formatter. XML, on the other hand, keeps all space characters by
XML Is Not HTML
If you've had some experience writing HTML documents, you should pay close
attention to XML's rules for elements. Shortcuts you can get away with in HTML
are not allowed in XML. Some important changes you should take note of
Element names are case-sensitive in XML. HTML allows you to write tags in
whatever case you want.
In XML, container elements always require both a start and an end tag. In
HTML, on the other hand, you can drop the end tag in some cases.
Empty XML elements require a slash before the right bracket (i.e., <example/>), whereas HTML uses a lone start tag
with no final slash.
XML elements treat whitespace as part of the content, preserving it
unless they are explicitly told not to. But in HTML, most elements throw
away extra spaces and line breaks when formatting content in the
Unlike many HTML elements, XML elements are based strictly on function, and
not on format. You should not assume any kind of formatting or presentational
style based on markup alone. Instead, XML leaves presentation for stylesheets,
which are separate documents that map the elements to styles.
Sometimes you need to convey more information about an element than its
name and content can express. The use of attributes lets you describe details
about the element more clearly. An attribute can be used to give the element a
unique label so it can be easily located, or it can describe a property about
the element, such as the location of a file at the end of a link. It can be
used to describe some aspect of the element's behavior or to create a subtype.
For example, in our <time-o-gram> earlier
in the chapter, we used the attribute pri to
identify it as having a high priority. As shown in Figure 2-9, an attribute
consists of a property name (1), an equals sign (2), and a value in quotes
Figure 2.9. Attribute syntax
An element can have any number of attributes, as long as each has a unique
name. Here is an element with three attributes:
<kiosk music="bagpipes" color="red" id="page-81527">
Attributes are separated by spaces. They must always follow the element
name, but they can be in any order. The values must be in single (') or double (") quotes. If the
value contains quotes, use the opposite kind of quote to contain it. Here is
If you prefer, you can replace the quote with the entity ' for a single quote or " for a double quote:
An element can contain only one occurrence of each attribute. So the
following is not allowed:
<!-- Wrong -->
<team person="sue" person="joe" person="jane">
Here are some possible alternatives. Use one attribute to hold all the
<team persons="sue joe jane">
Use multiple attributes:
<team person1="sue" person2="joe" person3="jane">
Attribute values can be constrained
to certain types if you use a DTD. One type is ID,
which tells XML that the value is a unique identifier code for the element. No
two elements in a document can have the same ID.
Another type, IDREF, is a reference to an ID. Let's demonstrate how these might be used. First, there
is an element somewhere in the document with an ID-type
Elsewhere, there is an element that refers to it:
<description>This nut is compatible with <partref
If you use a DTD with your document, you can actually assign the ID and IDREF types to particular
attributes and your XML parser will enforce the syntax of the value, as well
as warn you if the IDREF points to a nonexistent
element or if the ID doesn't have a unique value. We
talk more about these attributes in Chapter 3, "Connecting Resourceswith
Another way a DTD can restrict attributes is by creating an allowed set of
values. You may want to use an attribute called day that can have one of seven values: "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", or "Sunday". The DTD can then tell an XML parser to
reject any value not on that list, e.g., day="Halloween" is invalid. For a more detailed explanation
of attribute types, see Chapter 5, "Document Models:A Higher Levelof
Reserved Attribute Names
Some attribute names have been
set aside for special purposes by the XML working group. These attributes are
reserved for XML's use and begin with the prefix xml:.
The names xml:lang and xml:space are defined for XML Version 1.0. Two
other names, xml:link and xml:attribute, are defined by XLink, another
standard that complements XML and defines how elements can link to one
another. These special attribute names are described here:
Classifies an element by the language of its content. For example, xml:lang="en" describes an element as having English
content. This is useful for creating conditional text, which is content
selected by an XML processor based on criteria such as what language the
user wants to view a document in. We'll return to this topic in Chapter 7,
Specifies whether whitespace
should be preserved in an element's content. If set to "preserve", any XML processor displaying the
document should honor all newlines, spaces, and tabs in the element's
content. If it is set to "default", then the
processor can do whatever it wants with whitespace (i.e., it sets its own
default). If the xml:space attribute is
omitted, the processor preserves whitespace by default. Thus, if you want to
compress whitespace in an element, set the attribute xml:space="default" and make sure you are using
an XML processor whose default is to remove extra whitespace.
Signals to an XLink processor that an element is a link element. For
information on how to use this attribute, see Chapter 3, "Connecting
In addition to xml:link, XLink relies on
a number of attribute names. But to prevent conflict with other potential
uses of those attributes, XLink defines the xml:attribute attribute, which allows you to
"remap" those special attributes. That is, you can say, "When XLink is
looking for an attribute called title, I
want you to use the attribute called linkname instead." This attribute is also
discussed in more detail in Chapter 3, "Connecting Resourceswith Links".
when you want to include elements or attributes from different document types?
For example, you might want to put an equation encoded in the MathML language inside an XML
document. You can't combine multiple DTDs for a single document,
unfortunately, but no one says you have to use a DTD in XML. If you can
survive without a DTD (and most browsers will tolerate documents without
them), you can use a feature of XML called namespaces.
A namespace is a group of element and attribute
names. You can declare that an element exists within a particular namespace
and that it should be validated against that namespace's DTD. By appending a
namespace prefix to an element or attribute name, you tell the parser which
namespace it comes from.
Imagine, for example, that the English language is divided into namespaces
corresponding to conceptual topics. We'll take two of these, say hardware and
food. The topic hardware contains words such as hammer and bolt, while food
has words like fruit and meat. Both namespaces contain the word nut, which has
a different meaning in each context even though it's spelled the same in both.
It really is two different words with the same name, but how can we express
that fact without causing a namespace clash?
This same problem can occur in XML, where two XML objects in different
name-spaces can have the same name, resulting in ambiguity about where they
came from. The solution is to have each element or attribute specify which
namespace it comes from by including the namespace as a prefix.
The syntax for this qualified element name is
shown in Figure 2-10. A namespace prefix (1) is joined by a colon (2) to the
local name of the element or attribute (3).
Figure 2.10. Qualified name syntax
Figure 2-11 illustrates how an element, <nut>, must be treated to use the versions
from both the hardware and food
Figure 2.11. Qualifying an element's namespace with
Namespaces aren't useful only for preventing name clashes. More generally,
they help the XML processor sort out different groups of elements for
different treatments. Returning to the MathML example, the elements from
MathML's namespace must be treated differently from regular XML elements. The
browser needs to know when to enter "math equation mode" and when to be in
"regular XML mode." Namespaces are crucial for the browser to switch
In another example, the transformation language XSLT (see Chapter 6,
"Transformation:RepurposingDocuments") relies on namespaces to distinguish
between XML objects that are data, and those that are instructions for
processing the data. The instructional elements and attributes have an xsl: namespace prefix. Anything without a namespace prefix
is treated as data in the transformation process.
A namespace must be declared in the document before you can use it. The
declaration is in the form of an attribute inside an element. Any descendants
of that element become part of the namespace. Figure 2-12 shows the syntax for
a namespace declaration. It starts with the keyword xmlns: (1) to alert the XML parser that this attribute is a
namespace declaration. This is followed by a colon, then a namespace prefix
(2), an equals sign, and finally a URL in quotes (3).
Figure 2.12. Namespace declaration syntax
If the namespace prefix bob isn't to your liking,
you can use any name you want, as long as it observes the element-naming
rules. As a result, b, bobs-company, or wiggledy.piggledy
are all acceptable names. Be careful not to use prefixes like xml, xsl, or other names reserved by
XML and related languages.
The value of the xmlns: attribute is a
URL, usually belonging to the organization that maintains the namespace. The
XML processor isn't required to do anything with the URL, however. There
doesn't even have to be a document at the location it points to. Specifying
the URL is a formality to provide additional information about the namespace,
such as who owns it and what version you're using.
Any element in the document can contain a namespace declaration. Most
often, the root element will contain the declarations used in the document,
but that's not a requirement. You may find it useful to limit the scope of a
namespace to a region inside the document by declaring the namespace in a
deeper element. In that case, the namespace applies only to that element and
Here's an example of a document combining two namespaces, myns and eq:
<myns:date>March 4, 2001</myns:date>
<myns:subject>Effects of Caffeine on Psychokinetic
<myns:abstract>The experiment consists of a subject, a can of
caffeinated soda, and a goldfish tank. The ability to make a
goldfish turn in a circle through the power of a human's mental
control is given by the well-known equation:
where P is the probability it will turn in a given time interval,
m is the mental acuity of the fish, M is the mental acuity of
the subject, and d is the distance between
fish and subject.</myns:abstract>
We can declare one of the namespaces to be the default by omitting the
colon (:) and the name from the xmlns
attribute. Elements and attributes in the default namespace don't need the
namespace prefix, resulting in clearer markup:
<date>March 4, 2001</date>
<subject>Effects of Caffeine on Psychokinetic Ability</subject>
<abstract>The experiment consists of a subject, a can of
caffeinated soda, and a goldfish tank. The ability to make a
goldfish turn in a circle through the power of a human's mental
control is given by the well-known equation:
where P is the probability it will turn in a given time interval,
m is the mental acuity of the fish, M is the mental acuity
of the subject, and d is the distance between
fish and subject.</myns:abstract>
Namespaces can be a headache if used in conjunction with a DTD. It would
be nice if the parser ignored any elements or attributes from another
namespace, so your document would validate under a DTD that had no knowledge
of the namespace. Unfortunately, that is not the case. To use a namespace
with a DTD, you have to rewrite the DTD so it knows about the elements in
Another problem with namespaces is that they don't import a DTD or any
other kind of information about the elements and attributes you're using. So
you can actually make up your own elements, add the namespace prefix, and
the parser will be none the wiser. This makes namespaces less useful for
those who want to constrain their documents to conform to a DTD.
For these and other reasons, namespaces are a point of contention among
XML planners. It's not clear what will happen in the future, but something
needs to be done to bridge the gap between structure enforcement and
With the basic parts of XML markup defined, there is one more component we
need to look at. An entity is a placeholder for
content, which you declare once and can use many times almost anywhere in the
document. It doesn't add anything semantically to the markup. Rather, it's a
convenience to make XML easier to write, maintain, and read.
Entities can be used for different reasons, but they always eliminate an
inconvenience. They do everything from standing in for impossible-to-type
characters to marking the place where a file should be imported. You can
define entities of your own to stand in for recurring text such as a company
name or legal boilerplate. Entities can hold a single character, a string of
text, or even a chunk of XML markup. Without entities, XML would be much less
You could, for example, define an entity w3url to
represent the W3C's URL. Whenever you enter the entity in a document, it will
be replaced with the text http://www.w3.org/.
Figure 2-13 shows the different kinds of entities and their roles. The two
major entity types are parameter entities and generalentities. Parameter
entities are used only in DTDs, so we'll describe them in Chapter 5,
"Document Models:A Higher Levelof Control". In this section, we'll focus on
the other type, general entities. General entities
are placeholders for any content that occurs at the level of or inside the
root element of an XML document.
Figure 2.13. Taxonomy of entities
An entity consists of a name and a value. When an XML parser begins to
process a document, it first reads a series of declarations, some of which define entities by
associating a name with a value. The value is anything from a single character
to a file of XML markup. As the parser scans the XML document, it encounters
entity references, which are special markers derived
from entity names. For each entity reference, the parser consults a table in
memory for something with which to replace the marker. It replaces the entity
reference with the appropriate replacement text or markup, then resumes
parsing just before that point, so the new text is parsed too. Any entity
references inside the replacement text are also replaced; this process repeats
as many times as necessary.
Figure 2-14 shows that there are two kinds of syntax for entity references.
The first, consisting of an ampersand (&), the
entity name, and a semicolon (;), is for general
entities. The second, distinguished by a percent sign (%) instead of the ampersand, is for parameter entities.
Figure 2.14. Syntax for entity references
The following is an example of a document that declares three general
entities and references them in the text:
<!DOCTYPE message SYSTEM "/xmlstuff/dtds/message.dtd"
<!ENTITY client "Mr. Rufus Xavier Sasperilla">
<!ENTITY agent "Ms. Sally Tashuns">
<!ENTITY phone "<number>617-555-1299</number>">
<body>We have an exciting opportunity for you! A set of
ocean-front cliff dwellings in Piñata, Mexico have been
renovated as time-share vacation homes. They're going fast! To
reserve a place for your holiday, call &agent; at ☎.
Hurry, &client;. Time is running out!</body>
The entities &client;, &agent;, and ☎ are declared in the internal subset
of this document and referenced in the <message> element. A fourth entity, ñ, is a numbered character entity that
represents the character ñ. This entity is referenced but not declared; no
declaration is necessary because numbered character entities are implicitly
defined in XML as references to characters in the current character set. (For
more information about character sets, see Chapter 7, "Internationalization".)
The XML parser simply replaces the entity with the correct character.
The previous example looks like this with all the entities resolved:
<!DOCTYPE message SYSTEM "/xmlstuff/dtds/message.dtd">
<opening>Dear Mr. Rufus Xavier Sasperilla</opening>
<body>We have an exciting opportunity for you! A set of
ocean-front cliff dwellings in Piñata, Mexico have been
renovated as time-share vacation homes. They're going fast! To
reserve a place for your holiday, call Ms. Sally Tashuns at
Hurry, Mr. Rufus Xavier Sasperilla. Time is running out!</body>
All entities (besides predefined ones) must be declared before they are
used in a document. Two acceptable places to declare them are in the internal
subset, which is ideal for local entities, and in an external DTD, which is
more suitable for entities shared between documents. If the parser runs across
an entity reference that hasn't been declared, either implicitly (a predefined
entity) or explicitly, it can't insert replacement text in the document
because it doesn't know what to replace the entity with. This error prevents
the document from being well-formed.
Entities that contain a single character are called, naturally, character entities. These fall into several groups:
- Predefined character entities
Some characters cannot be used in the text of an XML document because
they conflict with the special markup delimiters. For example, angle
brackets (<>) are used to delimit element tags.
The XML specification provides the following predefined
character entities, so you can express these characters safely:
- Numbered character entities
XML supports Unicode, a huge character set with tens of thousands of
different symbols, letters, and ideograms. You should be able to use any
Unicode character in your document. The problem is how enter a nonstandard
character from a keyboard with less than 100 keys, or how to represent one
in a text-only editor display. One solution is to use a numbered character entity, an entity whose name is of
the form #n, where n is a number that represents the character's
position in the Unicode character set.
The number in the name of the
entity can be expressed in decimal or hexadecimal format. For example, a
lowercase c with a cedilla (ç) is the 231st Unicode character. It can be
represented in decimal as ç or in
hexadecimal as ç. Note that the hexadecimal
version is distinguished with an x as the prefix to the number. The range of
characters that can be represented this way starts at zero and goes up to
65,536. We'll discuss character sets and encodings in more detail in Chapter
- Named character entities
The problem with numbered character entities is that they're hard to
remember: you need to consult a table every time you want to use a special
character. An easier way to remember them is to use mnemonic entity names.
These named character entities use easy-to-remember
names for references like Þ, which stands
for the Icelandic capital thorn character (Þ).
Unlike the predefined and numeric character entities, you do have to
declare named character entities. In fact, they are technically no different
from other general entities. Nevertheless, it's useful to make the
distinction, because large groups of such entities have been declared in DTD
modules that you can use in your document. An example is ISO-8879, a
standardized set of named character entities including Latin, Greek, Nordic,
and Cyrillic scripts, math symbols, and various other useful characters
found in European documents.
Entity values aren't limited to a single character, of course. The more
general mixed-content entities have values of
unlimited length and can include markup as well as text. These entities fall
into two categories: internal and external. For internal
entities, the replacement text is defined in the entity declaration; for
external entities, it is located in another file.
Internal mixed-content entities are most often used to stand in for
oft-repeated phrases, names, and boilerplate text. Not only is an entity
reference easier to type than a long piece of text, but it also improves
accuracy and maintainability, since you only have to change an entity once for
the effect to appear everywhere. The following example proves this point:
<!DOCTYPE press-release SYSTEM "http://www.dtdland.org/dtds/reports.dtd"
<!ENTITY bobco "Bob's Bolt Bazaar, Inc.">
<title>&bobco; Earnings Report for Q3</title>
<par>The earnings report for &bobco; in fiscal
quarter Q3 is generally good. Sales of &bobco; bolts increased 35%
over this time a year ago.</par>
<par>&bobco; has been supplying high-quality bolts to contractors
for over a century, and &bobco; is recognized as a leader in the
construction-grade metal fastener industry.</par>
The entity &bobco; appears in the
document five times. If you want to change something about the company name,
you only have to enter the change in one place. For example, to make the name
appear inside a <companyname> element,
simply edit the entity declaration:
"<companyname>Bob's Bolt Bazaar, Inc.</companyname>">
When you include markup in entity declarations, be sure not to use the
predefined character entities (e.g., < and >). The parser knows to read the markup as an entity
value because the value is quoted inside the entity declaration. Exceptions to
this are the quote-character entity " and the single-quote character entity
'. If they would conflict with the
entity declaration's value delimiters, then use the predefined entities, e.g.,
if your value is in double quotes and you want it to contain a double quote.
Entities can contain entity references, as long as the entities being
referenced have been declared previously. Be careful not to include references
to the entity being declared, or you'll create a circular pattern that may get
the parser stuck in a loop. Some parsers will catch the circular reference,
but it is an error.
Sometimes you may need to create an entity for such a large amount of mixed
content that it is impractical to fit it all inside the entity declaration. In
this case, you should use an external entity, an
entity whose replacement text exists in another file. External entities are
useful for importing content that is shared by many documents, or that changes
too frequently to be stored inside the document. They also make it possible to
split a large, monolithic document into smaller pieces that can be edited in
tandem and that take up less space in network transfers. Figure 2-15
illustrates how fragments of XML and text can be imported into a document.
Figure 2.15. Using external entities to import XML and
External entities effectively break a document into multiple physical
parts. However, all that matters to the XML processor is that the parts
assemble into a perfect whole. That is, all the parts in their different
locations must still conform to the well-formedness rules. The XML parser
stitches up all the pieces into one logical document; with the correct markup,
the physical divisions should be irrelevant to the meaning of the document.
External entities are a linking mechanism. They connect parts of a document
that may exist on other systems, far across the Internet. The difference from
traditional XML links (XLinks) is that for external entities, the XML
processor must insert the replacement text at the time of parsing. See Chapter
3, "Connecting Resourceswith Links" for others kinds of links.
External entities must always be declared, so the parser knows where to
find the replacement text. In the following example, a document declares the
three external entities &part1;, &part2;, and &part3; to hold its content:
The syntax just shown for declaring an external entity uses the keyword SYSTEM followed by a quoted string containing a filename.
This string is called a system identifier and is used
to identify a resource by location. The quoted string is actually a URL, so
you can include files from anywhere on the Internet. For example:
<!ENTITY catalog SYSTEM "http://www.bobsbolts.com/catalog.xml">
The system identifier suffers from the same drawback as all URLs: if the
referenced item is moved, the link breaks. To avoid that problem, you can use
a public identifier in the entity declaration. In theory, a public identifier
will endure any location shuffling and still fetch the correct resource. For
<!ENTITY faraway PUBLIC "-//BOB//FILE Catalog//EN"
Of course, for this to work, the XML processor has to know how to use
public identifiers, and it must be able to find a catalog that maps them to
actual locations. In addition, there's no guarantee that the catalog is up to
date. A lot can go wrong. Perhaps for this reason, the public identifier must
be accompanied by a system identifier (here, "http://www.bobsbolts.com/catalog.xml"). If the XML
processor for some reason can't handle the public identifier, it falls back on
the system identifier. Most web browsers in use today can't deal with public
identifiers, so perhaps the backup is a good idea.
The last kind of entity discussed in this chapter is the unparsed entity. This kind of entity holds content that
should not be parsed because it contains something other than text and would
likely confuse the parser. Unparsed entities are used to import graphics,
sound files, and other non-character data.
The declaration for an unparsed entity looks similar to that of an external
entity, with some additional information at the end. For example:
<!DOCTYPE doc [
<!ENTITY mypic SYSTEM "photos/erik.gif" NDATA GIF>
<para>Here's a picture of me:</para>
This declaration differs from an external entity declaration in that there
is an NDATA keyword following the system path
information. This keyword tells the parser that the entity's content is in a
special format, or notation, other than the usual
parsed mixed content. The NDATA keyword is followed by
a notation identifier that specifies the data format.
In this case, the entity is a graphic file encoded in the GIF format, so the
word GIF is appropriate.
The notation identifier must be declared in a separate notation declaration, which is a complex affair discussed
in Chapter 5, "Document Models:A Higher Levelof Control". GIF and other
notations are not built into XML, and an XML processor may not know what to do
with them. At the very least, the parser will not blindly load the entity's
content and attempt to parse it, which offers some protection from errors.
Elements, attributes, namespaces, and entities are the most important
markup objects, but they are not the end of the story. Other markup objects
including comments, processing instructions, and CDATA sections shield content
from the parser in various ways, allowing you to include specialized
Comments are notes in the document that are not interpreted by the parser.
If you're working with other people on the same files, these messages can be
invaluable. They can be used to identify the purpose of files and sections to
help navigate a cluttered document, or simply to communicate with each other.
So, in XML there is a special kind of markup called a comment. The syntax for comments is shown in
Figure 2.17. Syntax for comments
A comment starts with four characters: an open angle bracket, an
exclamation point, and two dashes (1). It ends with two dashes and a closing
angle bracket (3). In between these delimiters goes the content to be ignored
(2). The comment can contain almost any kind of text you want, including
spaces, newlines, and markup. However, since two dashes in a row (--) are used
tell the parser when a comment begins and ends, they can't be placed anywhere
inside the comment. This means that instead of using dashes to create an
easily visible line, you should use another symbol like an equals sign (=) or
an underscore (_):
Good: <!-- - - - - - - - - - - - - - - - - - - - - - - - - - - - -->
Bad: <!-- -- Don't do this! -- -->
Comments can go anywhere in your document except before the XML declaration
and inside tags; an XML parser will ignore those completely. So this piece of
<p>The quick brown fox jumped<!-- test -->over the lazy dog.
The quick brown <!-- test --> fox jumped over the lazy dog. The<!--
-->quick brown fox
jumped over the lazy dog.</p>
becomes this, after the parser has removed the comments:
<p>The quick brown fox jumpedover the lazy dog.
The quick brown fox jumped over the lazy dog. Thequick brown fox
jumped over the lazy dog.</p>
Since comments can contain markup, they can be used to "turn off" parts of
a document. This is valuable when you want to remove a section temporarily,
keeping it in the file for later use. In this example, a region of code is
<p>Our store is located at:</p>
<address>59 Sunspot Avenue</address>
<address>210 Blather Street</address>
When using this technique, be careful not to comment out any comments,
i.e., don't put comments inside comments. Since they contain double dashes in
their delimiters, the parser will complain when it gets to the inner
If you mark up characters frequently in your text, you may find it tedious
to use the predefined entities <, >, &. They require typing and are generally
hard to read in the markup. There's another way to type lots of forbidden
characters, however: the CDATA section.
CDATA is an acronym for "character data," which just means "not markup."
Essentially, you're telling the parser that this section of the document
contains no markup and should be treated as regular text. The only thing that
cannot go inside a CDATA section is the ending delimiter (]]>). For that, you have to resort to a predefined
entity and write it as ]]>.
The CDATA section syntax is shown in Figure 2-18. A CDATA section begins
with the nine-character delimiter <![CDATA[ (1), and
it ends with the delimiter ]]> (3). The content of
the section (2) may contain markup characters (<,
>, and &) but they are
ignored by the XML processor.
Figure 2.18. CDATA section syntax
Here's an example of a CDATA section in action:
<para>Then you can say <![CDATA[if (&x < &y)]]> and be done
CDATA sections are most convenient when used over large areas, say the size
of a small computer program. If you use it a lot for small pieces of text,
your document will become hard to read, so you'd be better off using entity
Presentational information should be kept out of a document whenever
possible. Still, there may be times when you don't have any other option, for
example, if you need to store page numbers in the document to facilitate
generation of an index. This information applies only to a specific XML
processor and may be irrelevant or misleading to others. The prescription for
this kind of information is a processing instruction.
It is a container for data that is targeted toward a specific XML processor.
Processing instructions (PIs) contain two pieces of information: a target
keyword and some data. The parser passes processing instructions up to the
next level of processing. If the processing instruction handler recognizes the
target keyword, it may choose to use the data; otherwise, the data is
discarded. How the data will help processing is up to the developer.
Figure 2-19 shows the PI syntax. A PI starts with a two-character delimiter
(1) consisting of an open angle bracket and a question mark (<?), followed by a target (2),
an optional string of characters that is the data portion of the PI (3), and a
closing delimiter (4), consisting of a question mark and closing angle bracket
Figure 2.19. Processing instruction syntax
"Funny," you say, "PIs look a lot like the XML declaration." You're right:
the XML declaration can be thought of as a processing instruction for all XML
processors that broadcast general information about the document.
The target is a keyword that an XML processor uses to determine whether the
data is meant for it or not. The keyword doesn't necessarily mean anything,
such as the name of the software that will use it. More than one program can
use a PI, and a single program can accept multiple PIs. It's sort of like
posting a message on a wall saying, "The party has moved to the green house,"
and people interested in the party will follow the instructions, while those
The PI can contain any data except the combination ?>, which would be interpreted as the closing delimiter.
Here are some examples of valid PIs:
<?flubber pg=9 recto?>
<?xyz stop: the presses?>
If there is no data string, the target keyword itself can function as the
data. A forced line break is a good example. Imagine that there is a long
section heading that extends off the page. Rather than relying on an automatic
formatter to break the title just anywhere, we want to force it to break in a
Here is what a forced line break would look like:
<title>The Confabulation of Branklefitzers <?lb?>in a Portlebunky
Frammins <?lb?>Without Denaculization of <?lb?>Crunky Grabblefooties
XML gives you considerable power to choose your own element types and
invent your own grammars to create custom-made markup languages. But this
flexibility can be dangerous for XML parsers if they don't have some minimal
rules to protect them. A parser dedicated to a single markup language such as
an HTML browser can accept some sloppiness in markup, because the set of tags
is small and there isn't much complexity in a web page. Since XML processors
have to be prepared for any kind of markup language, a set of ground rules is
These rules are very simple syntax constraints. All tags must use the
proper delimiters; an end tag must follow a start tag; elements can't overlap;
and so on. Documents that satisfy these rules are said to be well-formed. Some of these rules are listed here.
The first rule is that an element containing text or elements must have
start and end tags.
An empty element's tag must have a slash (/) before the end bracket.
All attribute values must be in quotes.
Elements may not overlap.
<a>A good <b>nesting</b>
<a>This is <b>a poor</a>
Isolated markup characters may not appear in parsed content. These include
<, ]]>, and &.
<equation>5 < 2</equation>
<equation>5 < 2</equation>
A final rule stipulates that element names may start only with letters and
underscores, and may contain only letters, numbers, hyphens, periods, and
underscores. Colons are allowed for namespaces.
Why All the Rules?
Web developers who cut their teeth on HTML will notice that XML's syntax
rules are much more strict than HTML's. Why all the hassle about well-formed
documents? Can't we make parsers smart enough to figure it out on their own?
Let's look at the case for requiring end tags in every container element. In
HTML, end tags can sometimes be omitted, leaving it up to the browser to
decide where an element ends:
<p>This is a paragraph.
<p>This is also a paragraph.
This is acceptable in HTML because there is no ambiguity about the <p> element. HTML doesn't allow a <p> to reside inside another <p>, so it's clear that the two are siblings. All
HTML parsers have built-in knowledge of HTML, referred to as a grammar. In XML, where the grammar is not set in stone,
ambiguity can result:
<blurbo>This is one element.
<blurbo>This is another element.
Is the second <blurbo> a sibling or a
child of the first? You can't tell because you don't know anything about that
element's content model. XML doesn't require you to use a grammar-defining
DTD, so the parser can't know the answer either. Because XML parsers have to
work in the absence of grammar, we have to cut them some slack and follow the
These days, more and more software vendors are claiming that their products
are "XML-compliant." This sounds impressive, but is it really something to be
excited about? Certainly, well-formed XML guarantees some minimum standards
for data quality; however, that isn't the whole story. XML is not itself a
language, but a set of rules for designing markup languages. Therefore, until
you see what kind of language the vendors have created for their products, you
should greet such claims with cautious optimism.
The truth is, many XML-derived markup languages are atrocious. Often,
developers don't put much thought into the structure of the document data, and
their markup ends up looking like the same disorganized native data files with
different tags. A good markup language has a thoughtful design, makes good use
of containers and attributes, names objects clearly, and has a logical
Here's a case in point. A well-known desktop publishing program can output
its data as XML. However, it has a serious problem that limits its usefulness:
the hierarchical structure is very flat. There are no sections or divisions to
contain paragraphs and smaller sections; all paragraphs are on the same level,
and section heads are just glorified paragraphs. Compare that to an XML
language such as DocBook (see "XML Application: DocBook"" later in this
chapter), which uses nested elements to represent relationships: that is, to
make it clear that regions of text are inside particular sections. This
information is important for setting up styles in stylesheets or doing
Another markup language is used for encoding marketing information for
electronic books. Its design flaw is an unnecessarily obscure and unhelpful
element-naming scheme. Elements used to hold information such as the ISBN or
the document title are named <A5>, <B2>, or <C1>. These names have nothing to do with the
purpose of the elements, whereas element names like <isbn> and <title> would have been easily
Elements are the first consideration for a good markup language. They can
supply a lot of information in different ways:
The name inside the start and end tags of an element distinguishes it
from other types and gives XML programs a handle for processing. These names
should be representations of the element's purpose in the document and
should be readable by humans as well as machines. Choose names that are as
descriptive and recognizable as possible, like <model> or <programlisting>. Follow the convention of
all-lowercase letters and avoid alternating cases (e.g., <OrderedList>), as people will forget when
to use which case. Resist the urge to use generic element types that could
hold almost anything. And anyone who chooses nonsensical names like <XjKnpl> or <J-9> should be taken outside and pelted
An element's content can include characters, elements, or a mixture of
both. Elements inside mixed content modify the character data (for example,
labeling a word for emphasis), and are called inline elements. Other elements are used to divide a
document into parts, and are often called components or blocks. In
character data, whitespace is usually significant, unlike in HTML and other
The position of an element inside another element is important. The order
of elements is always preserved, so a sequence of items such as a numbered
list can be expressed. Elements, often those without content, can be used to
mark a place in text; for example, to insert a graphic or footnote. Two
elements can mark a range of text when it would be inconvenient to span that
range with a single element.
The element's ancestors can contribute information as well. For example,
a <title> is formatted differently when
it is inside a <chapter>, <section>, or <table>, with different typefaces and
sizes. Stylesheets can use the information about ancestor elements to decide
how to process an element.
Elements can be categorized by their source or purpose using namespaces.
In XSLT, for example, the xsl namespace elements are
used to control the transformation process, while other elements are merely
data for producing the result tree. Some web browsers can handle documents
with multiple name-spaces, such as Amaya's support of MathML equations
within HTML pages. In both cases, the namespace helps the XML processor
decide how to process the elements.
The second consideration for a good markup language is the use of
attributes. Use them sparingly, because they tend to clutter up markup--but do
use them when you need them. An attribute conveys specific information about
an element that helps specify its role in the document. It should not be used
to hold content. Sometimes, it's hard to decide between an attribute or a
child element. Here are some rough guidelines.
Use an element when:
The content is more than a few words long. Some XML parsers may have an
upper limit to how many characters an attribute can contain, and long
attribute values are hard to read.
Order matters. Attribute order in an element is ignored, but the order of
elements is significant.
The information is part of the content of the document, not just a
parameter to adjust the behavior of the element. In the case that an XML
processor cannot handle your document (perhaps if it does not support your
stylesheet completely), attributes are not displayed, while the contents of
an element are displayed as-is. If this happens, at least your document will
still be decipherable if you've used an element instead of an attribute.
Use an attribute when:
The information modifies the element in a subtle way that would affect
processing, but is not part of the content. For example, you may want to
specify a particular kind of bullet for a bulleted list:
You want to restrict the value. Using a DTD, you can ensure that an
attribute is a member of a set of predefined values.
The information is a unique identifier or a reference to an identifier in
another element. XML provides special mechanisms for testing identifiers in
attributes to ensure that links are not broken. See "Internal Linking with
ID and IDREF"" in Chapter 3, "Connecting Resourceswith Links" for more on
this type of linking.
Processing instructions should be used as little as possible. They
generally hold noncontent information that doesn't pertain to any one element
and is used by a particular XML processor. For example, PIs can be used to
remember where to break a page for a printed copy, but would be useless for a
web version of the document. It's not a good idea for a markup language to
rely too heavily on PIs.
Doubtless you will run across good and bad examples of XML markup, but you
don't have to make the same mistakes yourself. Strive to put as much thought
as possible into your design.
An XMLapplication is a markup language derived
from XML rules, not to be confused with XML software applications, called
XMLprocessors in this book. An XML application is
often a standard in its own right, with a publicly available DTD. One such
application is DocBook, a markup language for technical documentation.
DocBook is a large markup language consisting of several hundred elements.
It was developed by a consortium of companies and organizations to handle a
wide variety of technical documentation tasks. DocBook is flexible enough to
encode everything from one-page manuals to multiple-volume sets of books.
Today, DocBook enjoys a large base of users, including open source developers
and publishers. Details about the DocBook standard can be found in Appendix B,
"A Taxonomy of Standards".
Example 2-4 is an instance of a DocBook document, in this case a product
instruction manual. (Actually, it uses a DTD called "Barebones DocBook," a
similar but much smaller version of DocBook described in Chapter 5, "Document
Models:A Higher Levelof Control".) Throughout this example are numbered
markers corresponding to comments appearing at the end.
Example 2.4. A DocBook Document
<?xml version="1.0" encoding="utf-8"?> (1)
<!DOCTYPE book SYSTEM "/xmlstuff/dtds/barebonesdb.dtd" (2)
<!ENTITY companyname "Cybertronix">
<!ENTITY productname "Sonic Screwdriver 9000">
<title>&productname; User Manual</title> (4)
<!-- Note to author: maybe put a picture here? -->
The information in this manual is available in the following forms:
Instant telepathic injection
Ink on compressed, dead, arboreal matter
Cuneiform etched in clay tablets
The &productname; is sold in galactic pamphlet boutiques or wherever
&companyname; equipment can be purchased. For more information, or
to order a copy by hyperspacial courier, please visit our universe-wide
Web page at <systemitem (7)
While <emphasis>every</emphasis> (8)
effort has been taken to ensure the accuracy and
usefulness of this guide, we cannot be held responsible for the
occasional inaccuracy or typographical error.
<chapter id="intro"> (9)
Congratulations on your purchase of one of the most valuable tools in
the universe! The &companyname; &productname; is
equipment no hyperspace traveller should be without. Some of the
myriad tasks you can achieve with this device are:
Pick locks in seconds. Never be locked out of your tardis
again. Good for all makes and models including Yale, Dalek, and
Spot-weld metal, alloys, plastic, skin lesions, and virtually any
Rid your dwelling of vermin. Banish insects, rodents, and computer
viruses from your time machine or spaceship.
Slice and process foodstuffs from tomatoes to brine-worms. Unlike a
knife, there is no blade to go dull.
Here is what satisfied customers are saying about their &companyname;
Should we name the people who spoke these quotes? --Ed.
<quote>It helped me escape from the prison planet Garboplactor VI. I
wouldn't be alive today if it weren't for my Cybertronix 9000.</quote>
<quote>As a bartender, I have to mix martinis <emphasis>just
right</emphasis>. Some of my customers get pretty cranky if I slip
up. Luckily, my new sonic screwdriver from Cybertronix is so accurate,
it gets the mixture right every time. No more looking down the barrel
of a kill-o-zap gun for this bartender!</quote>
<title>Mastering the Controls</title>
<xref linkend="controls-diagram"/> is a diagram of the parts of your
<figure id="controls-diagram"> (11)
<title>Exploded Parts Diagram</title>
<xref linkend="controls-table"/> (12)
lists the function of the parts labeled in the diagram.
<table id="controls-table"> (13)
<entry>Decoy Power Switch</entry>
Looks just like an on-off toggle button, but only turns on a small
flashlight when pressed. Very handy when your &productname; is misplaced
and discovered by primitive aliens who might otherwise accidentally
<entry><emphasis>Real</emphasis> Power Switch</entry>
An invisible fingerprint-scanning capacitance-sensitive on/off switch.
<entry>The <quote>Z</quote> Twiddle Switch</entry>
We're not entirely sure what this does. Our lab testers have had
various results from teleportation to spontaneous
liquification. <emphasis role="bold">Use at your own risk!</emphasis>
A note to arthropods: Stop forcing your inflexible appendages to adopt
un-ergonomic positions. Our new claw-friendly control template is
<title>Why a decoy?</title>
Talk about the Earth's Tunguska Blast of 1908 here.
<title>The View Screen</title>
The view screen displays error messages and warnings, such as a
(low battery) message.<footnote> (15)
The advanced model now uses a direct psychic link to the user's
visual cortex, but it should appear approximately the same as the more
primitive liquid crystal display.
</footnote> When your &productname; starts up, it should
show a status display like this:
BATT: 1.782E8 V
TEMP: 284 K
FREQ: 9.32E3 Hz
Your &productname; is capable of generating tremendous amounts of
energy. For that reason, any old battery won't do. The power source is
a tiny nuclear reactor containing a piece of ultra-condensed plutonium
that provides up to 10 megawatts of power to your device. With a
half-life of over 20 years, it will be a long time before a
replacement is necessary.
Following are notes about Example 2-4:
1. The XML declaration states this file contains an XML document
corresponding to Version 1.0 of the XML specification, and the UTF-8
character set should be used (see Chapter 7, "Internationalization" for more
about character sets). The standalone property is not
mentioned, so the default value of "no" will be used.
2. This document type declaration does three things. First, it tells us
that <book> will be the root element.
Second, it associates a DTD with the document, specifying the location /xmlstuff/dtds/barebonesdb.dtd. Third, it declares two
general entities in the document's internal subset of declarations. These
entities will be used throughout the document wherever the company name or
product name are used. If in the future the product's name is changed or the
company is bought out, the author needs only to update the values in the
3. The <book> element is the
document root, the element that contains all the content. It begins a
hierarchy that includes a <preface> and
<chapter>, followed by some sections
labeled <sect1>, then <sect2>, and so on, down to the level of
paragraphs and lists. Only two <chapter>s are shown in the example, but in
a real document they would be followed by additional chapters, each with its
own sections and paragraphs, etc.
4. Notice that all the major components (preface, chapter, sections)
start with a <title> element. This is
an example of how an element can be used in different contexts. In a
formatted copy of this document, the titles in different levels will be
rendered differently, some large and others small. A stylesheet will use the
hierarchical information (i.e., what is the ancestor of this <title>) to determine how to format it.
5. A <para> is an example of a block
element, which means that it starts on a new line and contains a mixture of
character data and elements that are bound in a rectangular region.
6. This element begins a bulleted list of items. If this were a numbered
list (for instance, <orderedlist>
instead of <itemizedlist>), we would
not have to insert the numbers as content. The XML formatter would do that
for us, simultaneously preserving the order of <listitem>s and automatically generating
numbers according to the stylesheet's settings. This is another example of
an element (<listitem>) that is treated
differently based on which element it appears in.
7. This <systemitem> element is an
example of an inline element that modifies text within the flow. In this
case, it labels its contents as a URL to a resource on the Internet. The XML
processor can use this information both to apply style (make it appear
different from surrounding text) and in certain media, for example, a
computer display, to turn it into a link that the user can click to view the
8. Here's another inline element, this time encoding its contents as text
requiring emphasis, perhaps turning it bold or italic.
9. The <chapter> element has an ID attribute because we may want
to add a cross-reference to it somewhere in the text. A cross-reference is an empty element like this:
where idref is the value of the referenced
element's ID. In this case, it
might be <xref linkend="chapt-1"/>.
When the document is formatted, this cross-reference element is replaced
with text, like for instance, "Chapter 1, `Introduction'".
10. This block element contains a comment meant as a note to someone on
the editorial team. It will be formatted so it stands out, perhaps appearing
in a lighter shade. When the book goes to press, a different stylesheet will
be used that prevents these <comment>
elements from being printed.
11. This <figure> element contains a
graphic and its caption. The <graphic>
element is a link (see Chapter 3, "Connecting Resourceswith Links") to a
graphic file, which the XML processor will have to import for
12. Here's an example of a cross-reference in action. It references a <table> element (the linkend attribute and the <table>'s ID attribute are the same). This is an ID-IDREF
link, which is described in Chapter 3, "Connecting Resourceswith Links". The
formatter will replace the <xref>
element with text such as "Table 2-1". Now, if you read the sentence again
and substitute that text for the cross-reference element, it makes sense,
right? One reason to use a cross-reference element like this instead of just
writing "Table 2-1" is that if the table is moved to another chapter, the
formatter will update the text automatically.
13. This is how a table with eight rows and two columns would be
marked up in DocBook. The first row, appearing in a <thead>, is the head of the table.
14. The <errorcode> element is an
inline tag, but in this case does not denote special formatting (although we
can choose to format it differently if we want to). Instead, it labels a
specific kind of item: an error code used in a computer program. DocBook is
full of special computer terms: for example, <filename>, <function>, and <guimenuitem>, which are used as inline
We want to mark up these items in detail because there is a strong
possibility someone might want to search the book for a particular kind of
item. You can always plug a keyword into a search engine and it will fetch
the matches for you, but if you can constrain the search to the content of
<errorcode> elements, you are much more
likely to receive only a relevant match, rather than a homonym in the wrong
context. For example, the keyword string occurs in
many programming languages, and can be anything from part of a method name
to a data type. To search an entire book on Java would give you back
literally hundreds of matches, so to narrow your search you could specify
that the term is contained within a certain element like <type>.
15. Here, we've inserted a footnote. The <footnote> element acts as both a container
of text and a marker, labeling a specific point for special processing. When
the document is formatted, that point becomes the location of a footnote
symbol such as an asterisk (*). The contents of the footnote are moved
somewhere else, probably to the bottom of the page.
16. A <screen> is defined to
preserve all whitespace (spaces, tabs, newlines), since computer programs
often contain extra space to make them more readable. XML preserves
whitespace in any element unless told not to. DocBook tells XML processors
to disregard extra space in all but a few elements, so when the document is
formatted, paragraphs lose extra spaces and justify correctly, while screens
and program listings retain their extra spaces.
That's a quick snapshot of DocBook in action. For more information about
this popular XML application, check out the description in Appendix B, "A
Taxonomy of Standards".