In Depth
A Document Type Definition (DTD) is the foundation from which
XML documents are created. You can create your own DTDs or use predefined DTDs
from a wide variety of industries, such as investing, software development, and
scientific research. Whether you create your own DTD or use a predefined DTD,
you need to know what a DTD is, how it’s constructed, how it works, and how it
drives the development of XML documents.
NOTE: DTDs
are not required. You can build XML documents that don’t have DTDs and still
use them in many XML documents. We recommend that you build DTDs for most
documents. If the XML processor uses a validating parser you need to supply a
DTD for your document.
Creating a DTD is one of the first steps in structuring your XML
documents properly. DTDs provide the XML parsers with information, and they
provide you with a guide to building standardized XML documents. In this
chapter, we show you how to read, understand, and create DTDs. We also give you
criteria to determine when an external DTD is needed and when to use an
internal DTD. In addition, we explore why you might want to use previously
defined DTDs and let you know when you should consider building your own DTD.
NOTE: The
W3C is working on replacing DTDs with XML Schema. See the W3C site at www.w3.org/XML/Schema.html
for more information.
What Is a DTD?
A DTD defines what markup components you can use to describe the
different parts of a document. It includes a set of element and attribute
declarations and the entities, notations, and comments you want to use to
describe your data. In addition, the DTD outlines how you can and cannot use
these components, what kind of content they can contain, and whether they are
required or optional pieces of the document. Basically, a DTD is a set of rules
that you use to guide your production of XML documents. Usually, a DTD has been
well thought out and tested to be sure it can be used to accurately describe a
particular kind of content, such as a stock ticker entry or a mathematical
equation.
The easiest way to explain what a full DTD looks like and how it
works is to quickly examine one. The following is an example of an internal DTD
for a relatively simple document and is meant to demonstrate how a DTD looks and
operates:
<!DOCTYPE doc [
<!ELEMENT doc (subject, date, address, memo)>
<!ELEMENT subject (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT address (#PCDATA)>
<!ELEMENT memo (#PCDATA)>
<!ENTITY publisher "The Coriolis Group">
]>
The DTD is contained within the <!DOCTYPE doc [ ]> marker, which specifies that the doc element is the document element. A document element is simply the XML element that
contains all the other elements and content in a document. Inside the DTD are
five element declarations, contained within <!ELEMENT> marker, and one entity declaration, contained
within an <!ENTITY> marker.
All of these declarations and markers are described in detail later in the
chapter. Before we delve into the different components of the DTD, let’s look
at Listing 5.1 to see how the DTD is actually used to drive the structure of
information in a document.
Listing 5.1 Using a DTD to drive the structure of
information in a document.
<?xml version="1.0" encoding="UTF-8"
standalone="no"?>
<!DOCTYPE doc [
<!ELEMENT doc (subject, date, address, memo)>
<!ELEMENT subject (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT address (#PCDATA)>
<!ELEMENT memo (#PCDATA)>
<!ENTITY publisher "The Coriolis Group">
]>
<doc>
<subject>XML Black Book Memo</subject>
<date>July 5, 2000</date>
<address>
14455 N. Hayden Road,
Suite 220, Scottsdale, AZ 85260
</address>
<memo>
This memo is to alert
you that the new version
of the XML Black Book
has now been printed. Published
by &publisher;,
this book outlines everything you
need to know about XML.
</memo>
</doc>
The constructs in the DTD are reflected in the document itself.
Of course, knowing what information you should include in each construct (such
as including a valid date in the date element)
requires some knowledge of the purpose and design goals of the DTD.
Understanding how to use a DTD to guide document development and using the DTD
correctly are two different things. Currently, we’re only concerned with how to
interpret a DTD and use it to describe data.
Let’s take a closer look at our example DTD to get a feeling for
what each line of code does in terms of defining the document. Here’s a quick
rundown of the most important components of this document:
• <?xml version="1.0" encoding="UTF-8"
standalone="no"?>—The XML declaration specifies that the
document is an XML document. It’s the first line of instructions that is sent
to the parser. For more specific information about the XML declaration, see
“The XML Declaration” section later in this chapter.
• <!DOCTYPE doc [—The DOCTYPE declaration indicates that the DTD
components are about to be defined. The components can then be defined within
the declaration, called an internal DTD,
as in our example DTD. Or, the declaration can point to a set of components
contained in an external file, called an external
DTD. In some cases, a DOCTYPE declaration may include both internal and
external DTD components. Later sections in the chapter include more information
on internal and external DTDs. The declaration also specifies the doc element as the document element.
• <!ELEMENT doc (subject, date, address, memo)>—Defines the
list of elements for the root doc
element. This element declaration tells the parser that the root element doc must contain the child elements subject, date, address, and memo and that they must appear in this
order within the document. If they don’t, any tool that attempts to process the
document and validate it against the DTD will return an error.
• <!ELEMENT subject (#PCDATA)>—Defines the element subject and specifies that the element
will contain parsed character data, or data that a processor will try to
process and treat as XML information. Chapter 6 discusses the kind of content
an element can contain in more detail.
• <!ELEMENT date (#PCDATA)>—Defines the element date and specifies that the element
will contain parsed character data.
• <!ELEMENT address (#PCDATA)>—Defines the element address and specifies that the element
will contain parsed character data.
• <!ELEMENT memo (#PCDATA)>—Defines the element memo and specifies that the element
will contain parsed character data.
• <!ENTITY publisher "The Coriolis Group">—Defines
the parsed entity, publisher, and
specifies that the value for this entity is The Coriolis Group. When a processor encounters the entity
reference in a document, it replaces the entity name (&publisher;) with the entity value, The Coriolis Group.
• ]>—Indicates the end of the DTD.
• <doc>…</doc>—Contains the XML document that is driven
by the DTD. See Table 5.1 for a rundown of how the document content reflects
the DTD components.
Table 5.1 How DTD components are reflected in a
document.
DTD Component Document
Component Description
<!ELEMENT doc (subject, date, <doc>…</doc> Element named doc that must
address, memo)> contain one instance each of the
subject, date,
address, and
memo elements
<!ELEMENT subject (#PCDATA)> <subject>Today’s
subject element that
contains
Memo</subject> text
<!ELEMENT date (#PCDATA)> <date>July
5, 2000 date element that
contains text
</date>
<!ELEMENT address (#PCDATA)> <address>
14455 N. address element
that contains
Hayden Road, Suite 220, text
Scottsdale, AZ 85260
</address>
<!ELEMENT memo (#PCDATA)> <memo>…</memo> memo
element that contains text
<!ENTITY publisher “The &publisher; Parsed entity that equates the
Coriolis Group”> code &publisher;
in a
document with the phrase The
Coriolis Group
A DTD drives the way you build XML documents. It’s as simple as
that. The actual components of a DTD may exist directly inside a document or
may be included in a reference file linked to the document. Regardless of the
physical location of the DTD components, a DTD is considered part of an XML
document. The DTD defines the parameters of the document, and the document
holds data in a way that conforms to those parameters. To work with DTDs, you
need to be familiar with their terminology and constructs. The next several
sections of the chapter review the terminology and concepts associated with
DTDs, and include cross-references to those chapters that cover the full
functionality of these components in all realms of XML.
Declarations
A declaration defines
the specifics about the elements, attributes, entities, and notations you want
to use to describe content in your document, as well as any special
instructions you might want to pass to the XML processor. A DTD is comprised
entirely of declarations. When you learn to build and read DTDs, you’re really
learning to build and read declarations.
The first two lines of most XML documents contain two
declarations: the XML declaration and the document type declaration (DOCTYPE
declaration). Unlike other declarations (element, attribute, entity, and
notation), the XML declaration and the DOCTYPE declaration don’t construct the
document. They don’t explain the structural role of any particular element or
attribute. Instead, they tell a processor what standards to use, what type of
document is being processed, and where the DTD that actually drives the
document is stored. Looking once again at the two lines of code in the sample
document from the previous section, you see the XML declaration and the DOCTYPE
declaration:
<?xml version="1.0" encoding="UTF-8"
standalone="no"?>
<!DOCTYPE doc [
NOTE: There’s
a difference between the terms DTD (Document Type Definition) and DOCTYPE
declaration (document type declaration). The DTD holds all the declarations
that guide a document’s development. The DOCTYPE declaration holds the DTD
itself, or points to the DTD’s location in an external file. The DOCTYPE
declaration points the processor to the DTD—it connects the DTD to the
document. Think of it as the glue that binds the document to the definition
that explains the document. Therefore, within the DOCTYPE declaration, you find
the components of or links to the DTD.
The XML Declaration
The first line in the example document (Listing 5.1) is called
the XML declaration, and it tells the processor to use version 1.0 of the XML
specification to process the document as an XML document. Currently, version
1.0 is the only version of the specification; but as new versions become
available, you can indicate which processor applications can and cannot process
the document based on the version of XML the processor recognizes. The second
part of the XML declaration specifies what kind of character encoding will be
used for the document. In this example, we use the 8-bit Unicode
character-encoding scheme, which is XML’s default character-encoding scheme,
but we could also specify 16- or 32-bit or, depending on the parser, a variety
of other character encoding schemes. The “Parts of a DTD” section later in this
chapter provides further insight into the issues of character encoding, and
Chapter 9 discusses the topic in even more detail.
Related solutions: Found
on page:
Declaring a Text-Encoding Scheme for Entities 226
The final piece of information that we include in our XML
declaration is called the standalone
document attribute. It specifies whether an XML document stands alone or
whether it relies on an external declaration. For example, if the value is set
to "yes", the document is
self-sufficient and doesn’t use an external DTD. In other words, a standalone
XML document contains all the pertinent information within itself.
A value of "yes"
also tells an XML processor to ignore any declarations that point to external
references (such as DTD information included in an external file). A value of "no" or omitting the standalone attribute tells the
processor that it should process external declarations. When you set the
standalone attribute to "no",
you’re simply telling the XML processor that the document can reference any
external declarations. However, doing so doesn’t mean that you must include
external references—only that the processor should accept and process any if
they’re noted in the document.
When do you set the standalone declaration to "no" and when do you set it
to "yes"? As we’ve
mentioned, if your document relies on an external DTD, you need to set the
value to "no". You can set
the value to "yes" if you
aren’t using external references at all and if you’re only using the general
entities that are specified as part of the XML language, such as ampersands,
greater-than or less-than symbols, apostrophes, or quotation marks.
The DOCTYPE Declaration
The second line of code is called the DOCTYPE declaration, and
it’s used to associate the XML document with its corresponding DTD. Following <!DOCTYPE is the name of the
document element. In the case of an internal DTD, the list of declarations that
make up the DTD follows the document element in the declaration and is
contained within brackets ([ ]). If
you’re referencing an external DTD, you simply identify the location of the DTD
in the declaration immediately following the document element. If you’re
combining internal and external DTDs, you first point to the external DTD and
then use brackets to include internal DTD elements. The next section, “Storing
DTDs” discusses the ins and outs of DTD location and references in detail.
NOTE: If
you’re going to include a DTD in your document, include the DOCTYPE declaration
after the XML declaration but before the first element (the document element).
If you try to place the DTD any place else in the document, an XML processor
will spit out the document along with an error message telling you to rearrange
your document. For more information on DTD-less documents, see the section “To
DTD or Not to DTD” later in the chapter.
Storing DTDs
As we mentioned earlier in the chapter, the declarations that
make up a DTD can be stored within the document (internally) or in a separate
file (externally). Internal DTDs combine all the declarations that comprise a
document’s DTD. Internal DTDs are specified with the following code within the
DOCTYPE declaration:
<!DOCTYPE document_element
[ DTD goes here]>
In the example we’ve worked with so far in the chapter, the DTD
is stored
internally:
<?xml version="1.0" encoding="UTF-8"
standalone="no" ?>
<!DOCTYPE doc [
<!ELEMENT doc (subject, date, address, memo)>
<!ELEMENT subject (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT address (#PCDATA)>
<!ELEMENT memo (#PCDATA)>
<!ENTITY publisher "The Coriolis Group">
]>
External DTDs are also referenced within the DOCTYPE
declaration, but the actual declarations that make up the DTD are stored in an
external file. The name of the external DTD doesn’t have to correspond in any
way to the XML document, but it must have the three-letter extension .dtd. For
example, if your XML document is named bizmemo.xml, you might want to name the
DTD memo.dtd, instead of something like xyz.dtd.
Storing your DTDs externally is beneficial because a single DTD
can be used and reused with literally thousands of documents. This is one of
XML’s most powerful features. Having the ability to create a single DTD to
encompass a variety of documents that are based on a common structure makes it
possible for you to create standardized ways to store and describe information
for presentation across a variety of devices.
Public and System DTDs
Within a DOCTYPE declaration, you can identify an external DTD
reference as a public DTD or a system DTD. You can use publicly
available DTDs that have already been defined for a particular need, or you can
use your own locally developed DTD. When you use a publicly available DTD, you
need to use the keyword PUBLIC
within the DOCTYPE declaration when you specify the DTD. When you use your own
DTD, you need to use the keyword SYSTEM.
Here’s an example of a DOCTYPE declaration that references the public XHTML DTD
(XHTML is a reformulation of HTML under the umbrella of XML; Chapter 18 is
devoted to the subject):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
Strict//EN"
"DTD/http://www.w3.org/DTD/xhtml1-strict.dtd">
The public identifier structure used by XML for publicly available
DTDs is the same structure used for Standard Generalized Markup Language (SGML)
public identifiers. If the specified entity or DTD is an International
Organization for Standardization (ISO) standard, you must start the DTD with ISO. If it’s not an ISO standard, but
the standard is officially approved by a standards body, start the declaration
with a plus sign (+); if it’s not
officially approved by a standards body, start the declaration with a minus
sign (-). Following the identifier
are two forward slashes (//) and then the owner of the DTD. If we dissect the
DOCTYPE declaration in the preceding example, we find that the DTD specified is
not standard and that the W3C owns this DTD. We also find that the name of the
DTD is DTD XHTML 1.0 Strict and it’s
located at http://www.w3.org/ in the
DTD directory.
Now, let’s see an example of what a DOCTYPE declaration looks
like when you specify a DTD stored on a local system:
<!DOCTYPE book SYSTEM
"http://www.site.com/dtds/book.dtd">
This declaration includes just the keyword SYSTEM followed by the location and file name of the DTD. The SYSTEM keyword allows you to specify
the location of any DTD (not just DTDs that are recognized standards) using a
URI.
Combining Internal and External, Public
and System DTDs
The actual DTD that guides your document may be a combination of
an internal DTD and an external DTD, as well as a system-specific DTD and a
publicly available DTD. As long as the DTDs are syntactically correct and are
written according to the XML specification, you can mix and match DTDs all you
like. If a publicly available DTD meets most, but not all, of your needs, you
can reference it as an external DTD in your DOCTYPE declaration and then add
your own declarations in an internal DTD to augment the public DTD. This
example combines the external XHTML DTD with the sample memo DTD from earlier
in the chapter:
<?xml version="1.0" encoding="UTF-8"
standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
Strict//EN"
"DTD/xhtml1-strict.dtd"
[
<!ELEMENT doc (subject, date, address, memo)>
<!ELEMENT subject (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT address (#PCDATA)>
<!ELEMENT memo (#PCDATA)>
<!ENTITY publisher "The Coriolis Group">
]>
The document element is html,
and the complete DTD for this document combines all the declarations in the
XHTML DTD with the declarations listed in the internal memo DTD. The most
common use of this feature is not to mix and match the elements and attributes
from multiple DTDs into a single DTD, but is instead to define entities that
are specific to the document within the internal DTD. Chapter 9 discusses the
ins and outs of entity placement in more detail.
WARNING! Although you can use the DOCTYPE declaration to
combine declarations from several DTD sources into a single DTD, you may want
to think twice before you do so. A DTD developer usually builds all the
components in the DTD to work well together. If you introduce new elements from
another DTD into the fray, you may find that the combined collection of
components doesn’t work
well together. If a DTD isn’t meeting your needs, consider reengineering it
instead of simply adding
to it to ensure that all the components are compatible.
To DTD or Not to DTD
XML documents don’t necessarily require a DTD to work well, unlike
SGML documents, which always require a DTD. Because XML was fashioned to work
within the constraints of the Web, sometimes a DTD simply doesn’t work because
of bandwidth limitations. Often, developers will work with a sample set of data
and build elements and attributes around that data to get a good idea of what
constructs will be needed to describe similar kinds of data. They then work
backward from the DTD-less markup they’ve created to build a functional DTD.
So, how do you know when you need a DTD and when you don’t? And
how do you know when you should make it internal and when you should make it
external? Several mitigating factors will help you decide:
• Structurally complex documents more often have external DTDs. With
an external DTD, you can create a certain amount of standardization. Doing so
makes the document more coherent, because DTD users will have to follow certain
rules.
• The more people and data involved, the more useful a DTD is. Remember that the whole point of a DTD
is to help predefine data structures so you can achieve consistency across
documents that contain that data. The more data you have and the more people
there are who need to share the documents you have to deal with, the harder it
is to maintain that consistency. If everyone has a standard to work from, your
consistency problems will be effectively solved (barring human error, of
course—but that’s what validating parsers are for).
• Small documents don’t require external DTDs. Unless you want to
standardize every single bit of data in your company, you should consider not
creating a DTD for simple correspondence (for example, one-page memos or
faxes).
• Some Internet-oriented documents don’t require DTDs. If you have
bandwidth constraints, an external DTD may cause more bandwidth overhead.
• Non-validating XML processors don’t require a DTD. When you use a
non-validating XML processor that only checks for well-formedness, you don’t
need an external DTD.
Although we’ve given you guidelines to help you determine when
you should include a DTD and when doing so isn’t required, you really should
consider creating an external DTD for complex documents and documents with data
that will be shared among many users. By creating and keeping DTDs in a
separate file, they’re not only reusable, but also easier to manage, update,
and change. You can also prevent people from changing the DTDs.
External vs. Internal DTDs
After you’ve decided to create a DTD, you need to determine if
it should be stored internally or externally. The size of the document is only
one consideration. Other factors to take into consideration are reusability of
the DTD and portability of the document.
Internal DTDs
When you create a document, the first thing you should ask
yourself is whether the document needs to be self-contained. A self-contained
document can be moved from system to system without losing components. You can
use the document on your local system without being connected to the Internet.
You can also place a self-contained document on a disk or cartridge drive and
carry it with you.
And any XML processor can process it without having to look for an accompanying
DTD.
Another reason for placing the DTD information within a single
file is to cut down on the amount of processing time and the amount of bandwidth
required to load, parse, and display the file. Sometimes, it’s more efficient
to place the DTD within the document so the XML processor only has to read one
file—not two, three, or more—to display the information. Finally, when you use
an internal DTD, you create a self-contained file that is both valid and
complete. Any XML processor can process it without having to look for an
external DTD.
External DTDs
Although they add a certain amount of overhead, processing time,
and bandwidth requirements, it’s usually better to use external DTDs. External
DTDs offer many benefits, especially in the areas of managing, updating, and
editing documents. Here are just a few reasons you would want to use an
external DTD:
• If you use an external DTD,
you can use public DTDs. A public DTD may have all the capabilities you
need. Instead of reinventing the wheel, you can use a DTD that someone else has
already built, which means your document structure will be standard. Updates to
a public DTD are automatically incorporated within your documents.
• With small documents, you can
focus on content. Instead of worrying about the structure of a small
document, you can focus on creating the content. You can use an externally
stored DTD so you don’t have to worry about putting all the information about
the document structure within a small document. For quick documents that need
to match a particular structure, an external DTD is preferable.
• External DTDs provide for better document management. With external
DTDs, you can easily create a set of documents that define rules for specific
needs. Then, you can edit and update the DTD as required without having to open
the XML content document, much as you would when reformatting a style sheet.
Also, instead of entering the same information again and again, you enter the
information once. You don’t have to worry about whether you’ve entered the same
element name in a variety of documents.