The internal form of the data tree resulting from a DOM parse is quite convenient if
you are going to be accessing document content out of order. That is, if your program
needs to rearrange the incoming data for its output, or if it needs to move around the
document and select data in random-access order, you should find that the DOM doc-
ument tree will provide what you need for doing this. You can search for things in the
tree and pull out what you need without regard to where it appeared in the input doc-
ument. One disadvantage of DOM is that a large document will take up a lot of space
because the entire document is held in memory. With modern operating systems, how-
ever, the document would need to be extremely large before it would adversely affect
anything. DOM also has the disadvantage of being more complicated to use than SAX.
Because DOM can randomly access the stored data, the API for it is necessarily more
complex. Although DOM is more complicated to use than SAX, it can be used to do
much more. For more details about how DOM works, see Chapters 3, 6, and 7.
Internally in the JAXP, the DOM parser actually uses SAX as its lexical scanner. That
is, a SAX parser is used to read the document and break it down into a stream of its
components, and the DOM software takes this token stream and constructs a tree
from it. This is why it is best to have an understanding of SAX before trying to get a
clear idea of how JAXP DOM works. Although you may never use SAX directly, its a
good idea to know how it works and how the incoming document is broken down. At
the very least, you will need to be familiar with the meaning of its error messages and
how to process them in your application, which means you will need to know how
SAX works. For more details, see Chapters 3, 4, and 5
SGML
SGML stands for Standard Generalized Markup Language. This is the parent markup
language of XML and HTML, which were both derived as special-purpose subsets of
SGML. Included in the 500-page SGML specification document is a definition of the
system for organizing and tagging elements in a document. It became a standard with
the International Organization of Standards (ISO) in 1986, but the specification had ac-
tually been in use some time before that. It was designed to manage large documents
so that they could be frequently changed and also printed. It is a large language defin-
ition and too difficult to actually implement, which has resulted in the subsets XML
and HTML.
XML works well being a subset of SGML because the complexity of SGML isnt nec-
essary to do all of the tagging and transforming that needs to be done. Being a practi-
cal subset makes it much easier to write a parser for XML. Because of the reduction in
complexity of the language, XML documents are smaller and easier to create than
SGML documents would be. For example, where SGML always requires the presence
of a DTD, in XML the DTD is largely optional. If you are going to validate the correct-
ness of an XML document, the DTD is necessary, but otherwise it can be omitted.
XML is a bit closer to being like SGML than is HTML. For one thing, HTML is filled
with ambiguities because it allows things like an opening tag that has no closing tag to
match it. This prevents any attempts to standardize HTML because a parser cannot
8
Chapter 1
3851 P-01 1/28/02 10:32 AM Page 8