The Book of SAX: The Simple API for XML
By: W. Scott Means and Michael A. Bodie
The Book of SAX includes everything XML and Java developers need to write SAX applications. Specific examples show how to use SAX to solve XML parsing problems that are impractical to address with tree-based technologies-including real-time parsing, very large documents, and high-performance applications. The authors guide readers through the development of picoSAX, a functioning SAX 2.0 XML parser.
ISBN: 1886411778 Buy this book now!
Read the Book Review
Read Chapter 2: A Complete SAX Application from this book
Excerpted with permission from The Book of SAX, by W. Scott Means and Michael A. Bodie. ©2002 
|
Book Review:
|
The Book of SAX is an excellent tutorial and reference guide for all Java SAX developers. The first part
of this book (nine chapters) focuses on teaching SAX programming with Java, and the second part is the complete
SAX 2.0 reference guide.
The book begins with an introduction to SAX, and compares it with DOM.
The second chapter shows the steps required to
build XML parsing application using SAX. The third chapter focuses on error handling with SAX. The next
three chapters (4, 5, and 6) discuss topics like character encoding issues, locator interface, capturing
DTD information, and the namespace support in SAX. Chapter 7 discusses the XMLFilter interface in detail.
Next chapter (8) outlines the steps required to migrate SAX 1.0 applications to SAX 2.0.
The book also shows how to write a simple SAX parser in Java. This parser is called picoSAX and is
described in chapter 9.
Chapter 10 is an excellent SAX 2.0 API reference and discusses each and every SAX 2.0 interface and its methods and
properties.
The final chapter (11) in the book talks about the deprecated SAX 1.0 API.
In summary, a must to have book for all Java SAX developers.
Buy this book for only $20.97
|
2
A Complete SAX Application
This chapter walks you through the process of writing your very first SAX application.
This chapter discusses the following topics:
What Is SAX?
In reality, SAX is nothing more than a set of Java class and interface descriptions that document a system for writing event-driven XML applications. The
SAX specification (along with the source code for a set of Java interfaces and
classes) lives on its own Web site (www.saxproject.org) and is still maintained
and extended by the members of the XML-DEV mailing list.
But to actually write a SAX application, you need an XML parser that has a
concrete implementation of the various interfaces and classes that make up the
org.xml.sax and org.xml.sax.helpers Java packages.
Where Do I Get It?
Before we get into the nitty-gritty details of writing a SAX application, youll
need to install an XML parser distribution that supports SAX. Although SAX
was originally designed for Java programmers, implementations of the SAX
interfaces have been written for many other programming languages such as
Perl, C++, and Visual Basic. For the rest of this book, well assume that youre
writing Java applications that use the Xerces parser distribution that is supported by the Apache Foundations XML Project (http://xml.apache.org).
After downloading the Xerces package from the Apache XML Project Web
site, youll need to follow the installation instructions that go with the specific
version you will be using. The examples in this book have been tested with
Xerces version 1.4.0, but they should work with any other parser that supports
the SAX 2.0 standard. The installation process for 1.4.0 amounts to no more
than unzipping the package in a directory on the local hard drive using either
the Java jar utility or the unzipping program of your choice (we use WinZip 8.0
on our Windows machines).
You will also need a copy of the Java 2 SDK to compile and execute your
SAX application. The examples in this book were compiled using the JDK version 1.2.2 that is installed by default by the Borland JBuilder version 3.0 Java
Integrated Development Environment (IDE).
What Version Number Was That?
Observant readers will notice that the version numbers for the products mentioned in this section are (in some cases) one or two major revision numbers
out of date. This is actually a conscious decision (if a somewhat unconventional one). After reviewing the additional features and bug fixes that were
available for newer versions of these products, the trouble spent upgrading
seemed to outweigh the value that would be gained by using the more recent
version. Since we may be using an older version of these tools than you are,
they should be forward compatible with more recent versions. If we had used
the most recent versions, the reverse would probably not have been true.
|
The examples in this book were developed using the Borland JBuilder
Foundation 3.0 IDE. Although it was lacking in a few areas, it is an excellent
value for the money (the lowest-end version of JBuilder is available for download
from Borland at no cost). It can be obtained from the Borland Web site
(www.borland.com). Although an IDE (such as JBuilder) can certainly make soft-
ware development less tedious, there is no requirement that you use one. The
examples in this book could just as well have been developed using a basic text
editor and command-line tools.
Configuring Your Environment
The following instructions assume that youre using the Xerces parser. If youre
not, youll need to follow the instructions that come with your parser to add the
org.xml.sax and org.xml.sax.helpers packages to your systems Java CLASSPATH
environment variable.
If you are using Xerces, you should add the xerces.jar file in the root of the
distribution directory to your systems CLASSPATH environment variable. This can
be done through the System applet of the Control Panel for most versions of
Windows. This will allow java.exe (the JDK Java runtime) to locate the SAX
classes at runtime without your needing to supply their location on the command line.
Basic SAX Application Flow
Every SAX application needs to go through the same basic steps to process
XML documents:
-
Obtain a reference to an object that implements the XMLReader interface.
-
Create an instance of an application-specific object that implements one or
more of the various SAX *Handler (DTDHandler, ContentHandler, and
ErrorHandler) interfaces.
-
Register the object instance with the XMLReader object so that it will receive
notifications as XML parsing events occur.
-
Call the XMLReader.parse() method for each XML document that needs to
be processed by the application. The object instance that was registered in
step 3 will receive notifications progressively as the document is parsed.
It is up to the application-specific object (or objects) to track and process
the information that is delivered via the various event notification methods that
it implements. For example, an application that wants to strip markup out of
an XML document and leave only the text content would need to implement
the ContentHandler interface and implement its specific processing in the
characters() event callback method. Even though the setup for every SAX application is almost identical, the data structures and algorithms that process the
event notifications will vary widely depending on what the application is
designed to do.
The XShell Sample Application
To illustrate the various features (and pitfalls) of SAX, we will be developing a
complex example throughout the tutorial chapters. This example will implement
a simple command-line shell interface (similar to command.com or cmd.exe) where
the commands are encoded as XML elements. Keep in mind that our goal is not
to build a full-featured command-line interpreter, but to show off the various
features of SAX that make it ideally suited to interactive XML applications.
Whenever a trade-off between application functionality and clarity of SAX usage
needs to be made, SAX will always come out the winner.
The full source code for this application (as well as other goodies) is available
at this books Web site (www.bookofsax.com).
Throughout the book, we will call attention to complementary information that can be found at the Web site through the use of the Web site icon.
For now, we will define the XShell application as a Java program that
accepts zero or more XML document URIs from the command line, and parses
each given document, and executes the recognized batch commands within. If
no URIs are given on the command line, the XML commands will be read from
the System.in InputStream. Input will be batched one line at a time, until a
blank line is detected. The batch will then be parsed using the same algorithm
that is used to parse documents read from URIs. Input will be terminated by an
empty batch (no XML content).
Listing 2-1 shows the basic skeleton of the XShell class, including the main()
method that acts as the command-line entry point.
Listing 2-1: The XShell class skeleton
public class XShell extends DefaultHandler {
. . .
/**
* Command line entry point.
*/
public static void main(String[] args)
{
// create a new XShell session object and set the output to go to stdout
XShell xbSession = new XShell(System.out);
try {
if (args.length > 0) {
// user provided document URIs on the command line
for (int i = 0; i < args.length; i++) {
InputSource is = new InputSource(args[i]);
xbSession.doBatch(is);
}
} else {
// no command line argument provided, read from stdin
xbSession.doCLI(System.in);
}
} catch (IOException ioe) {
System.err.println(ioe);
} catch (SAXException se) {
System.err.println(se);
}
}
Using the XMLReader InterfaceXMLReader
The primary entry point to any SAX implementation is the XMLReader interface.
This interface contains methods for:
- Controlling how the underlying XML parser will operate (validating versus
nonvalidating, and so on).
-
Enabling and disabling specific SAX features (such as namespace
processing).
-
Registering object instances to receive XML parsing notifications (via the
*Handler interfaces).
-
Initiating the parsing process on a specific document URI or input source
(via the parse() methods).
Before an application can use the XMLReader interface, it must first obtain a
reference to an object that implements it. How the XMLReader interface is sup-
ported is a decision that is left up to the implementers of the particular SAX
distribution. For instance, the Xerces package supplies a class called
org.apache.xerces.parsers.SAXParser that implements the XMLReader interface.
Any application that uses Xerces to provide SAX support can simply create a
new instance of the SAXParser class and use it immediately.
The SAX specification does define a special helper class (from the
org.xml.sax.helpers package) called XMLReaderFactory that is intended to act as
a class factory for XMLReader instances. It has two static methods for creating a
new XMLReader object instance:
XMLReaderFactory.createXMLReader();
XMLReaderFactory.createXMLReader(String className);
Of course, both of these methods require that the class name of the class
that supports the XMLReader interface be known in advance. Since all that
these methods do is use the class name given to create a new object instance,
this helper class is, at best, of questionable utility.
The XShell constructor obtains a reference to an XMLReader class instance
by directly instantiating the SAXParser class from the Xerces package inside its
constructor:
m_xrShell = (XMLReader)new org.apache.xerces.parsers.SAXParser();
Now that we have an XMLReader object instance to work with, we can register
our class to receive XML parse callback notifications.
Implementing the ContentHandler Interface
The most interesting interface in the SAX package, from the application
authors standpoint, is the ContentHandler interface. This interface provides
notification of:
- documents starting and ending
-
element start and end tags recognized
-
character data found
-
namespace prefixes going in and out of scope
-
processing instructions found
-
ignorable whitespace read
For normal XML applications, XML element and text content are the most
important types of data that will be processed. That is why almost every SAX
application will want to implement the ContentHandler interface and register to
receive its notifications. It is a rather large interface, however, and many applications
will neither need nor want to receive all of the notifications it offers.
To save application writers from having to implement half a dozen methods
that they have no intention of using, SAX provides the DefaultHandler class.
This class provides a default empty implementation
of all of the callback inter faces supported by SAX 2.0, including the EntityResolver, DTDHandler,
ContentHandler, and ErrorHandler interfaces. By deriving your SAX application
object from DefaultHandler, you can immediately register to receive any of the
callback notifications without writing empty methods to satisfy the various Java
interface specifications. The XShell class is derived from the DefaultHandler
class directly, like this:
public class XShell extends DefaultHandler { . . .
Then, in the XShell constructor, it registers itself to receive ContentHandler
notifications using the XMLReader object instance it just created:
m_xrShell = (XMLReader)new org.apache.xerces.parsers.SAXParser();
m_xrShell.setContentHandler(this);
Now that the XShell instance is ready to receive content notifications, the application is ready to parse an XML document.
Parsing and Processing
Whether the XML command-line instructions come from an external document
or from System.in, both cases end up calling the doBatch() method:
/**
* Given an InputSource object, parses and executes the commands given.
*/
public void doBatch(InputSource is) throws IOException
{
try {
m_xrShell.parse(is);
} catch (SAXException se) {
// we just eat it, we already notified the user
}
}
This very simple method calls the parse() method of the XMLReader instance
that was created by the XShell constructor. The parse() method does not return
until either the document referenced by the passed InputSource has been
completely parsed, or a fatal parsing error occurs. In either case, no indication of
the result is returned to the application by the parse() method. Error handling
is covered thoroughly in Chapter 3.
After the XML parser has been invoked, the SAX interface begins calling
notification methods for the object instances that were registered by the
set*Handler() methods of the XMLReader interface. To help you understand
the sequence in which notifications are sent by SAX, weve developed a simple
utility called com.bookofsax.SAXAnimate that prints the interface and method
names of every SAX callback for a given XML document. Given the following
simple XML document:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<element xmlns="http://namespaces.bookofsax.com/null">
<message>Hello, SAX.</message>
</element>
</root>
The SAXAnimate utility generates the following list of notification calls:
ContentHandler.setDocumentLocator()
ContentHandler.startDocument()
ContentHandler.startElement()
ContentHandler.characters()
ContentHandler.startPrefixMapping()
ContentHandler.startElement()
ContentHandler.characters()
ContentHandler.startElement()
ContentHandler.characters()
ContentHandler.endElement()
ContentHandler.characters()
ContentHandler.endElement()
ContentHandler.endPrefixMapping()
ContentHandler.characters()
ContentHandler.endElement()
ContentHandler.endDocument()
By reading through this trace and matching it to the content in the XML
document, you should be able to more easily understand how SAX sequences
notifications to ensure that your application has an accurate picture of the XML
document at any point in time. For instance, the namespace prefix mapping
notification is sent before the corresponding element declaration, since elements
and attributes can use a namespace as soon as it is declared.
Now, given this understanding, how will the XShell application process notifications to ensure that it will have the appropriate information available when it
needs it? How much (or how little) information needs to be preserved from the
data that is delivered by the ContentHandler interface?
To answer this question, we will need to define exactly how the XShell will
function. The pseudo-code for the application is:
while more-documents-to-process or more-interactive-commands
while more-child-elements-of-root
read-complete-child-element
if element-name-recognized then
execute-named-command
end if
end while
end while
Its a bit simplistic, but thats basically all that most command-line interpreters do.
The additional complication that we will face is that the read-complete-child-element step is not as
straightforward as it seems. As we can see
from the output of the SAXAnimate class, the information we need will be passed
to us a little bit at a time. It is our job to determine how much and which of the
given content we are to keep, and how to store it.
The hierarchical nature of XML data makes the stack a very popular structure for
storing arbitrary levels of document information. The simplest possible
notification sequence for a simple leaf element that contains some character
data is:
ContentHandler.startElement()
ContentHandler.characters()
ContentHandler.endElement()
Based on this, our program will need to preserve the data that is passed in
by the startElement() and characters() notifications, and act on that data when
the corresponding endElement() notification is received. The method signature
for the startElement() notification is:
public void startElement(java.lang.String namespaceURI,
java.lang.String localName,
java.lang.String qName,
Attributes atts)
Of all of this information, the only part we really need to preserve are the
attributes. The endElement() method will be given the same namespaceURI,
localName, and qName parameters when the element close tag is parsed (or
immediately, if it is an empty element). Since elements can be nested, we will
need to use the java.util.Stack object to keep track of the attributes that are
valid for the element we are currently processing. As long as the attributes are
pushed on the stack in the startElement() method and popped off in the
endElement() method, the stack is always guaranteed to be accurate. To support
this, we will need to add two members to the XShell class and write a
startElement() method that looks like this:
/**
* Attribute stack.
*/
Stack m_stAtts = new Stack();
/**
* Accumulated character data on stack.
*/
Stack m_stChars = new Stack();
. . .
/**
* Sets up attribute and character data stacks for processing later by the
* endElement() method.
*/
public void startElement(java.lang.String namespaceURI,
java.lang.String localName, java.lang.String qName, Attributes atts)
{
// save the attributes for this element, for use later in endElement()
m_stAtts.push(new AttributesImpl(atts));
// create new, empty string buffer to capture character data
m_stChars.push(new StringBuffer());
}
Look at the startElement() method; the first push operation actually
pushes a new AttributesImpl object rather than a reference to the object that
was passed in as a parameter. The object that was passed in is not guaranteed to
be valid outside the scope of the startElement() method. This is generally true
for any of the data that is passed to a SAX notification method. Anything that
you need to preserve will need to be copied to a new object instance that your
program controls. In this case, the AttributesImpl object is an object provided
in the org.sax.helpers package. It implements a copy constructor that can be
used to duplicate the contents of an object that supports the Attributes interface for later use.
The second push operation provides a new, empty Java StringBuffer object
instance that will be used to capture any character data this element might contain. Once again, since elements can be nested, a new StringBuffer instance
must be created for every element open tag. The characters() notification
method appends any character data it receives to the StringBuffer on the top
of the m_stChars stack.
/**
* Appends new characters to the StringBuffer on top of the
* character data stack.
*/
public void characters(char[] ch, int start, int length)
{
((StringBuffer)m_stChars.peek()).append(ch, start, length);
}
Finally, when the endElement() notification is received, we can examine the
element name and decide what action, if any, to take. At this point, we are saving
all of the element content for every element we encounter. We are going on
the assumption that this content is valid and will be used later. One possible
future optimization would be to set flags and ignore attribute and character data
for unrecognized elements, but since we will be expanding our application to
incorporate additional commands in the future, we will capture everything now
and deal with it later. The endElement() notification method looks like this:
/**
* Pops saved data off of stack and executes the given
* command.
*/
public void endElement(String namespaceURI, String localName, String qName)
{
Attributes atts = (Attributes)m_stAtts.pop();
String strCharData = ((StringBuffer)m_stChars.pop()).toString();
// dispatch the message, based on the localName (XML tag name)
if (localName.equals("echo")) {
doEcho(atts, strCharData);
} else if (localName.equals("dir")) {
doDir(atts, strCharData);
} else if (localName.equals("cd")) {
doCD(atts, strCharData);
}
}
This method recognizes the three command elements that the shell currently
supports (echo, dir, and cd) and dispatches the cached attribute and character
data to the worker methods that implement the actual shell logic. In the interest
of space, we will not show the implementations of all three of the doEcho(),
doDir(), and doCD() methods here. They are available (along with the rest of the
application source code) on this books Web site. We will show the implementa-
tion of the doCD() method because it does show the basic usage of the
Attributes interface.
Accessing Attribute Values Using the Attributes Interface
The attributes interface provides access to the XML attributes that are included
(either directly or through implicit values in the DTD) in an elements open tag.
There are two basic methods for accessing attribute values: by name or by index.
It is illegal to have two attributes on a single element with the same qualified
name (meaning that the namespace URIs and local parts match). Therefore,
retrieving attribute values by name is very common when processing XML documents.
The Attributes interface provides the getType() method for getting an
attributes type (CDATA, ID, IDREF, and so on), and the getValue() method for
getting an attributes string value. Both methods can use either the
attributes qualified name or its namespace URI and local name. Since no guarantees are made
that attribute values will be made available in the same order in which they
appear in the element tag, referencing by name is the only safe way to ensure
that your application is reading the attribute you intend.
When enumerating attributes, however, the getLength(), getQName(), and
getValue() methods can be used to get information about all of the attributes
that are available for a particular element. The Attributes interface is not com-
pletely symmetric, and certain methods are accessible only using an attributes
index (such as the getURI() method for retrieving the attributes namespace
URI). The getIndex() methods are provided to perform a reverse lookup from
an attributes qualified name or namespace/local part to its corresponding
index number.
Many of these methods are not particularly useful until an application
becomes namespace aware. Information about how SAX deals with namespaces
and qualified names is thoroughly covered in Chapter 6, Namespace Support.
Wrapping Up
We have now built a completely functional SAX application. It parses docu-
ments, handles notification events, and even does (some) useful work. Download
the application (complete with source code) from this books Web site and try
the test script you find there. Try running it in interactive mode and enter some
invalid XML. Youll probably find that the default error handling leaves some-
thing to be desired. Well tackle that next, in Chapter 3, Handling Errors.
The Book of SAX: The Simple API for XML
By: W. Scott Means and Michael A. Bodie
The Book of SAX includes everything XML and Java developers need to write SAX applications. Specific examples show how to use SAX to solve XML parsing problems that are impractical to address with tree-based technologies-including real-time parsing, very large documents, and high-performance applications. The authors guide readers through the development of picoSAX, a functioning SAX 2.0 XML parser.
ISBN: 1886411778 Buy this book now!
Read the Book Review
Read Chapter 2: A Complete SAX Application from this book
Excerpted with permission from The Book of SAX, by W. Scott Means and Michael A. Bodie. ©2002 
|
|
|