perfectxml.com
 Basic Search  Advanced Search   
Topics Resources Free Library Software XML News About Us
  You are here: home »» Free Library »» The Coriolis Group XML Books » Chapter 5 from XML Black Book 2nd Edition Friday, 13 July 2007
 

Chapter 5

DTDs in XML

Page 1 of 3. Goto Page 2 | 3


In Depth

A Document Type Definition (DTD) is the foundation from which XML documents are created. You can create your own DTDs or use predefined DTDs from a wide variety of industries, such as investing, software development, and scientific research. Whether you create your own DTD or use a predefined DTD, you need to know what a DTD is, how it’s constructed, how it works, and how it drives the development of XML documents.

NOTE: DTDs are not required. You can build XML documents that don’t have DTDs and still use them in many XML documents. We recommend that you build DTDs for most documents. If the XML processor uses a validating parser you need to supply a DTD for your document.

Creating a DTD is one of the first steps in structuring your XML documents properly. DTDs provide the XML parsers with information, and they provide you with a guide to building standardized XML documents. In this chapter, we show you how to read, understand, and create DTDs. We also give you criteria to determine when an external DTD is needed and when to use an internal DTD. In addition, we explore why you might want to use previously defined DTDs and let you know when you should consider building your own DTD.

NOTE: The W3C is working on replacing DTDs with XML Schema. See the W3C site at www.w3.org/XML/Schema.html for more information.

What Is a DTD?

A DTD defines what markup components you can use to describe the different parts of a document. It includes a set of element and attribute declarations and the entities, notations, and comments you want to use to describe your data. In addition, the DTD outlines how you can and cannot use these components, what kind of content they can contain, and whether they are required or optional pieces of the document. Basically, a DTD is a set of rules that you use to guide your production of XML documents. Usually, a DTD has been well thought out and tested to be sure it can be used to accurately describe a particular kind of content, such as a stock ticker entry or a mathematical equation.

The easiest way to explain what a full DTD looks like and how it works is to quickly examine one. The following is an example of an internal DTD for a relatively simple document and is meant to demonstrate how a DTD looks and operates:

 

<!DOCTYPE doc [

<!ELEMENT doc (subject, date, address, memo)>

<!ELEMENT subject (#PCDATA)>

<!ELEMENT date (#PCDATA)>

<!ELEMENT address (#PCDATA)>

<!ELEMENT memo (#PCDATA)>

<!ENTITY publisher "The Coriolis Group">

]>

 

The DTD is contained within the <!DOCTYPE doc [ ]> marker, which specifies that the doc element is the document element. A document element is simply the XML element that contains all the other elements and content in a document. Inside the DTD are five element declarations, contained within <!ELEMENT> marker, and one entity declaration, contained within an <!ENTITY> marker. All of these declarations and markers are described in detail later in the chapter. Before we delve into the different components of the DTD, let’s look at Listing 5.1 to see how the DTD is actually used to drive the structure of information in a document.

Listing 5.1  Using a DTD to drive the structure of information in a document.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<!DOCTYPE doc [

<!ELEMENT doc (subject, date, address, memo)>

<!ELEMENT subject (#PCDATA)>

<!ELEMENT date (#PCDATA)>

<!ELEMENT address (#PCDATA)>

<!ELEMENT memo (#PCDATA)>

<!ENTITY publisher "The Coriolis Group">

]>

<doc>

<subject>XML Black Book Memo</subject>

<date>July 5, 2000</date>

<address>

   14455 N. Hayden Road, Suite 220, Scottsdale, AZ 85260

</address>

<memo>

   This memo is to alert you that the new version

   of the XML Black Book has now been printed. Published

   by &publisher;, this book outlines everything you

   need to know about XML.

</memo>

</doc>

 

The constructs in the DTD are reflected in the document itself. Of course, knowing what information you should include in each construct (such as including a valid date in the date element) requires some knowledge of the purpose and design goals of the DTD. Understanding how to use a DTD to guide document development and using the DTD correctly are two different things. Currently, we’re only concerned with how to interpret a DTD and use it to describe data.

Let’s take a closer look at our example DTD to get a feeling for what each line of code does in terms of defining the document. Here’s a quick rundown of the most important components of this document:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>—The XML declaration specifies that the document is an XML document. It’s the first line of instructions that is sent to the parser. For more specific information about the XML declaration, see “The XML Declaration” section later in this chapter.

  <!DOCTYPE doc [—The DOCTYPE declaration indicates that the DTD components are about to be defined. The components can then be defined within the declaration, called an internal DTD, as in our example DTD. Or, the declaration can point to a set of components contained in an external file, called an external DTD. In some cases, a DOCTYPE declaration may include both internal and external DTD components. Later sections in the chapter include more information on internal and external DTDs. The declaration also specifies the doc element as the document element.

  <!ELEMENT doc (subject, date, address, memo)>—Defines the list of elements for the root doc element. This element declaration tells the parser that the root element doc must contain the child elements subject, date, address, and memo and that they must appear in this order within the document. If they don’t, any tool that attempts to process the document and validate it against the DTD will return an error.

  <!ELEMENT subject (#PCDATA)>—Defines the element subject and specifies that the element will contain parsed character data, or data that a processor will try to process and treat as XML information. Chapter 6 discusses the kind of content an element can contain in more detail.

  <!ELEMENT date (#PCDATA)>—Defines the element date and specifies that the element will contain parsed character data.

  <!ELEMENT address (#PCDATA)>—Defines the element address and specifies that the element will contain parsed character data.

  <!ELEMENT memo (#PCDATA)>—Defines the element memo and specifies that the element will contain parsed character data.

  <!ENTITY publisher "The Coriolis Group">—Defines the parsed entity, publisher, and specifies that the value for this entity is The Coriolis Group. When a processor encounters the entity reference in a document, it replaces the entity name (&publisher;) with the entity value, The Coriolis Group.

]>—Indicates the end of the DTD.

  <doc>…</doc>—Contains the XML document that is driven by the DTD. See Table 5.1 for a rundown of how the document content reflects the DTD components.

 

Table 5.1  How DTD components are reflected in a document.

DTD Component           Document Component           Description

<!ELEMENT doc (subject, date,            <doc>…</doc>           Element named doc that must
address, memo)>
                contain one instance each of the
                      subject
, date, address, and
                      memo
elements

<!ELEMENT subject (#PCDATA)>           <subject>Today’s            subject element that contains
           Memo</subject>
           text

<!ELEMENT date (#PCDATA)>           <date>July 5, 2000           date element that contains text
           </date>
   

<!ELEMENT address (#PCDATA)>           <address> 14455 N.            address element that contains
           Hayden Road, Suite 220,           
text
           Scottsdale, AZ 85260
           </address>

<!ELEMENT memo (#PCDATA)>           <memo>…</memo>           memo element that contains text

<!ENTITY publisher “The            &publisher;           Parsed entity that equates the
Coriolis Group”>
               code &publisher; in a
                      document with the phrase The
                      Coriolis Group

 

A DTD drives the way you build XML documents. It’s as simple as that. The actual components of a DTD may exist directly inside a document or may be included in a reference file linked to the document. Regardless of the physical location of the DTD components, a DTD is considered part of an XML document. The DTD defines the parameters of the document, and the document holds data in a way that conforms to those parameters. To work with DTDs, you need to be familiar with their terminology and constructs. The next several sections of the chapter review the terminology and concepts associated with DTDs, and include cross-references to those chapters that cover the full functionality of these components in all realms of XML.

Declarations

A declaration defines the specifics about the elements, attributes, entities, and notations you want to use to describe content in your document, as well as any special instructions you might want to pass to the XML processor. A DTD is comprised entirely of declarations. When you learn to build and read DTDs, you’re really learning to build and read declarations.

The first two lines of most XML documents contain two declarations: the XML declaration and the document type declaration (DOCTYPE declaration). Unlike other declarations (element, attribute, entity, and notation), the XML declaration and the DOCTYPE declaration don’t construct the document. They don’t explain the structural role of any particular element or attribute. Instead, they tell a processor what standards to use, what type of document is being processed, and where the DTD that actually drives the document is stored. Looking once again at the two lines of code in the sample document from the previous section, you see the XML declaration and the DOCTYPE declaration:

 

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<!DOCTYPE doc [

 

NOTE: There’s a difference between the terms DTD (Document Type Definition) and DOCTYPE declaration (document type declaration). The DTD holds all the declarations that guide a document’s development. The DOCTYPE declaration holds the DTD itself, or points to the DTD’s location in an external file. The DOCTYPE declaration points the processor to the DTD—it connects the DTD to the document. Think of it as the glue that binds the document to the definition that explains the document. Therefore, within the DOCTYPE declaration, you find the components of or links to the DTD.

The XML Declaration

The first line in the example document (Listing 5.1) is called the XML declaration, and it tells the processor to use version 1.0 of the XML specification to process the document as an XML document. Currently, version 1.0 is the only version of the specification; but as new versions become available, you can indicate which processor applications can and cannot process the document based on the version of XML the processor recognizes. The second part of the XML declaration specifies what kind of character encoding will be used for the document. In this example, we use the 8-bit Unicode character-encoding scheme, which is XML’s default character-encoding scheme, but we could also specify 16- or 32-bit or, depending on the parser, a variety of other character encoding schemes. The “Parts of a DTD” section later in this chapter provides further insight into the issues of character encoding, and Chapter 9 discusses the topic in even more detail.

Related solutions:  Found on page:

Declaring a Text-Encoding Scheme for Entities 226

The final piece of information that we include in our XML declaration is called the standalone document attribute. It specifies whether an XML document stands alone or whether it relies on an external declaration. For example, if the value is set to "yes", the document is self-sufficient and doesn’t use an external DTD. In other words, a standalone XML document contains all the pertinent information within itself.

A value of "yes" also tells an XML processor to ignore any declarations that point to external references (such as DTD information included in an external file). A value of "no" or omitting the standalone attribute tells the processor that it should process external declarations. When you set the standalone attribute to "no", you’re simply telling the XML processor that the document can reference any external declarations. However, doing so doesn’t mean that you must include external references—only that the processor should accept and process any if they’re noted in the document.

When do you set the standalone declaration to "no" and when do you set it to "yes"? As we’ve mentioned, if your document relies on an external DTD, you need to set the value to "no". You can set the value to "yes" if you aren’t using external references at all and if you’re only using the general entities that are specified as part of the XML language, such as ampersands, greater-than or less-than symbols, apostrophes, or quotation marks.

The DOCTYPE Declaration

The second line of code is called the DOCTYPE declaration, and it’s used to associate the XML document with its corresponding DTD. Following <!DOCTYPE is the name of the document element. In the case of an internal DTD, the list of declarations that make up the DTD follows the document element in the declaration and is contained within brackets ([ ]). If you’re referencing an external DTD, you simply identify the location of the DTD in the declaration immediately following the document element. If you’re combining internal and external DTDs, you first point to the external DTD and then use brackets to include internal DTD elements. The next section, “Storing DTDs” discusses the ins and outs of DTD location and references in detail.

NOTE: If you’re going to include a DTD in your document, include the DOCTYPE declaration after the XML declaration but before the first element (the document element). If you try to place the DTD any place else in the document, an XML processor will spit out the document along with an error message telling you to rearrange your document. For more information on DTD-less documents, see the section “To DTD or Not to DTD” later in the chapter.

Storing DTDs

As we mentioned earlier in the chapter, the declarations that make up a DTD can be stored within the document (internally) or in a separate file (externally). Internal DTDs combine all the declarations that comprise a document’s DTD. Internal DTDs are specified with the following code within the DOCTYPE declaration:

 

<!DOCTYPE document_element [ DTD goes here]>

 

In the example we’ve worked with so far in the chapter, the DTD is stored
internally:

 

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>

<!DOCTYPE doc [

<!ELEMENT doc (subject, date, address, memo)>

<!ELEMENT subject (#PCDATA)>

<!ELEMENT date (#PCDATA)>

<!ELEMENT address (#PCDATA)>

<!ELEMENT memo (#PCDATA)>

<!ENTITY publisher "The Coriolis Group">

]>

 

External DTDs are also referenced within the DOCTYPE declaration, but the actual declarations that make up the DTD are stored in an external file. The name of the external DTD doesn’t have to correspond in any way to the XML document, but it must have the three-letter extension .dtd. For example, if your XML document is named bizmemo.xml, you might want to name the DTD memo.dtd, instead of something like xyz.dtd.

Storing your DTDs externally is beneficial because a single DTD can be used and reused with literally thousands of documents. This is one of XML’s most powerful features. Having the ability to create a single DTD to encompass a variety of documents that are based on a common structure makes it possible for you to create standardized ways to store and describe information for presentation across a variety of devices.

Public and System DTDs

Within a DOCTYPE declaration, you can identify an external DTD reference as a public DTD or a system DTD. You can use publicly available DTDs that have already been defined for a particular need, or you can use your own locally developed DTD. When you use a publicly available DTD, you need to use the keyword PUBLIC within the DOCTYPE declaration when you specify the DTD. When you use your own DTD, you need to use the keyword SYSTEM. Here’s an example of a DOCTYPE declaration that references the public XHTML DTD (XHTML is a reformulation of HTML under the umbrella of XML; Chapter 18 is devoted to the subject):

 

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

"DTD/http://www.w3.org/DTD/xhtml1-strict.dtd">

 

The public identifier structure used by XML for publicly available DTDs is the same structure used for Standard Generalized Markup Language (SGML) public identifiers. If the specified entity or DTD is an International Organization for Standardization (ISO) standard, you must start the DTD with ISO. If it’s not an ISO standard, but the standard is officially approved by a standards body, start the declaration with a plus sign (+); if it’s not officially approved by a standards body, start the declaration with a minus sign (-). Following the identifier are two forward slashes (//) and then the owner of the DTD. If we dissect the DOCTYPE declaration in the preceding example, we find that the DTD specified is not standard and that the W3C owns this DTD. We also find that the name of the DTD is DTD XHTML 1.0 Strict and it’s located at http://www.w3.org/ in the DTD directory.

Now, let’s see an example of what a DOCTYPE declaration looks like when you specify a DTD stored on a local system:

 

<!DOCTYPE book SYSTEM "http://www.site.com/dtds/book.dtd">

 

This declaration includes just the keyword SYSTEM followed by the location and file name of the DTD. The SYSTEM keyword allows you to specify the location of any DTD (not just DTDs that are recognized standards) using a URI.

Combining Internal and External, Public
and System DTDs

The actual DTD that guides your document may be a combination of an internal DTD and an external DTD, as well as a system-specific DTD and a publicly available DTD. As long as the DTDs are syntactically correct and are written according to the XML specification, you can mix and match DTDs all you like. If a publicly available DTD meets most, but not all, of your needs, you can reference it as an external DTD in your DOCTYPE declaration and then add your own declarations in an internal DTD to augment the public DTD. This example combines the external XHTML DTD with the sample memo DTD from earlier in the chapter:

 

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"DTD/xhtml1-strict.dtd"

[

<!ELEMENT doc (subject, date, address, memo)>

<!ELEMENT subject (#PCDATA)>

<!ELEMENT date (#PCDATA)>

<!ELEMENT address (#PCDATA)>

<!ELEMENT memo (#PCDATA)>

<!ENTITY publisher "The Coriolis Group">

]>

 

The document element is html, and the complete DTD for this document combines all the declarations in the XHTML DTD with the declarations listed in the internal memo DTD. The most common use of this feature is not to mix and match the elements and attributes from multiple DTDs into a single DTD, but is instead to define entities that are specific to the document within the internal DTD. Chapter 9 discusses the ins and outs of entity placement in more detail.

WARNING! Although you can use the DOCTYPE declaration to combine declarations from several DTD sources into a single DTD, you may want to think twice before you do so. A DTD developer usually builds all the components in the DTD to work well together. If you introduce new elements from another DTD into the fray, you may find that the combined collection of components doesn’t work
well together. If a DTD isn’t meeting your needs, consider reengineering it instead of simply adding
to it to ensure that all the components are compatible.

To DTD or Not to DTD

XML documents don’t necessarily require a DTD to work well, unlike SGML documents, which always require a DTD. Because XML was fashioned to work within the constraints of the Web, sometimes a DTD simply doesn’t work because of bandwidth limitations. Often, developers will work with a sample set of data and build elements and attributes around that data to get a good idea of what constructs will be needed to describe similar kinds of data. They then work backward from the DTD-less markup they’ve created to build a functional DTD.

So, how do you know when you need a DTD and when you don’t? And how do you know when you should make it internal and when you should make it external? Several mitigating factors will help you decide:

  Structurally complex documents more often have external DTDs. With an external DTD, you can create a certain amount of standardization. Doing so makes the document more coherent, because DTD users will have to follow certain rules.

The more people and data involved, the more useful a DTD is. Remember that the whole point of a DTD is to help predefine data structures so you can achieve consistency across documents that contain that data. The more data you have and the more people there are who need to share the documents you have to deal with, the harder it is to maintain that consistency. If everyone has a standard to work from, your consistency problems will be effectively solved (barring human error, of course—but that’s what validating parsers are for).

Small documents don’t require external DTDs. Unless you want to standardize every single bit of data in your company, you should consider not creating a DTD for simple correspondence (for example, one-page memos or faxes).

Some Internet-oriented documents don’t require DTDs. If you have bandwidth constraints, an external DTD may cause more bandwidth overhead.

Non-validating XML processors don’t require a DTD. When you use a non-validating XML processor that only checks for well-formedness, you don’t need an external DTD.

Although we’ve given you guidelines to help you determine when you should include a DTD and when doing so isn’t required, you really should consider creating an external DTD for complex documents and documents with data that will be shared among many users. By creating and keeping DTDs in a separate file, they’re not only reusable, but also easier to manage, update, and change. You can also prevent people from changing the DTDs.

External vs. Internal DTDs

After you’ve decided to create a DTD, you need to determine if it should be stored internally or externally. The size of the document is only one consideration. Other factors to take into consideration are reusability of the DTD and portability of the document.

Internal DTDs

When you create a document, the first thing you should ask yourself is whether the document needs to be self-contained. A self-contained document can be moved from system to system without losing components. You can use the document on your local system without being connected to the Internet. You can also place a self-contained document on a disk or cartridge drive and carry it with you.
And any XML processor can process it without having to look for an accompanying DTD.

Another reason for placing the DTD information within a single file is to cut down on the amount of processing time and the amount of bandwidth required to load, parse, and display the file. Sometimes, it’s more efficient to place the DTD within the document so the XML processor only has to read one file—not two, three, or more—to display the information. Finally, when you use an internal DTD, you create a self-contained file that is both valid and complete. Any XML processor can process it without having to look for an external DTD.

External DTDs

Although they add a certain amount of overhead, processing time, and bandwidth requirements, it’s usually better to use external DTDs. External DTDs offer many benefits, especially in the areas of managing, updating, and editing documents. Here are just a few reasons you would want to use an external DTD:

If you use an external DTD, you can use public DTDs. A public DTD may have all the capabilities you need. Instead of reinventing the wheel, you can use a DTD that someone else has already built, which means your document structure will be standard. Updates to a public DTD are automatically incorporated within your documents.

With small documents, you can focus on content. Instead of worrying about the structure of a small document, you can focus on creating the content. You can use an externally stored DTD so you don’t have to worry about putting all the information about the document structure within a small document. For quick documents that need to match a particular structure, an external DTD is preferable.

  External DTDs provide for better document management. With external DTDs, you can easily create a set of documents that define rules for specific needs. Then, you can edit and update the DTD as required without having to open the XML content document, much as you would when reformatting a style sheet. Also, instead of entering the same information again and again, you enter the information once. You don’t have to worry about whether you’ve entered the same element name in a variety of documents.



Page 1 of 3. Goto Page 2 | 3



  Contact Us | E-mail Us | Site Guide | About PerfectXML | Advertise ©2004 perfectxml.com. All rights reserved. | Privacy