perfectxml.com
 Basic Search  Advanced Search   
Topics Resources Free Library Software XML News About Us
  You are here: home Free Library No Starch Press » The Book of SAX Wednesday, 7 March 2007
 


Buy this book now!The Book of SAX: The Simple API for XML
By: W. Scott Means and Michael A. Bodie

The Book of SAX includes everything XML and Java developers need to write SAX applications. Specific examples show how to use SAX to solve XML parsing problems that are impractical to address with tree-based technologies-including real-time parsing, very large documents, and high-performance applications. The authors guide readers through the development of picoSAX, a functioning SAX 2.0 XML parser.

ISBN: 1886411778   Buy this book now!

Read the Book Review
Read Chapter 2: A Complete SAX Application from this book

Excerpted with permission from The Book of SAX, by W. Scott Means and Michael A. Bodie. ©2002 No Starch Press


Book Review:

The Book of SAX is an excellent tutorial and reference guide for all Java SAX developers. The first part of this book (nine chapters) focuses on teaching SAX programming with Java, and the second part is the complete SAX 2.0 reference guide.

The book begins with an introduction to SAX, and compares it with DOM. The second chapter shows the steps required to build XML parsing application using SAX. The third chapter focuses on error handling with SAX. The next three chapters (4, 5, and 6) discuss topics like character encoding issues, locator interface, capturing DTD information, and the namespace support in SAX. Chapter 7 discusses the XMLFilter interface in detail. Next chapter (8) outlines the steps required to migrate SAX 1.0 applications to SAX 2.0.

The book also shows how to write a simple SAX parser in Java. This parser is called picoSAX and is described in chapter 9.

Chapter 10 is an excellent SAX 2.0 API reference and discusses each and every SAX 2.0 interface and its methods and properties.

The final chapter (11) in the book talks about the deprecated SAX 1.0 API.

In summary, a must to have book for all Java SAX developers.

Buy this book for only $20.97

2

A Complete SAX Application

This chapter walks you through the process of writing your very first SAX application.

This chapter discusses the following topics:
    

What Is SAX?

In reality, SAX is nothing more than a set of Java class and interface descriptions that document a system for writing event-driven XML applications. The SAX specification (along with the source code for a set of Java interfaces and classes) lives on its own Web site (www.saxproject.org) and is still maintained and extended by the members of the XML-DEV mailing list. But to actually write a SAX application, you need an XML parser that has a concrete implementation of the various interfaces and classes that make up the org.xml.sax and org.xml.sax.helpers Java packages.
    

Where Do I Get It?

Before we get into the nitty-gritty details of writing a SAX application, you’ll need to install an XML parser distribution that supports SAX. Although SAX was originally designed for Java programmers, implementations of the SAX interfaces have been written for many other programming languages such as Perl, C++, and Visual Basic. For the rest of this book, we’ll assume that you’re writing Java applications that use the Xerces parser distribution that is supported by the Apache Foundation’s XML Project (http://xml.apache.org).

After downloading the Xerces package from the Apache XML Project Web site, you’ll need to follow the installation instructions that go with the specific version you will be using. The examples in this book have been tested with Xerces version 1.4.0, but they should work with any other parser that supports the SAX 2.0 standard. The installation process for 1.4.0 amounts to no more than unzipping the package in a directory on the local hard drive using either the Java jar  utility or the unzipping program of your choice (we use WinZip 8.0 on our Windows machines).

You will also need a copy of the Java 2 SDK to compile and execute your SAX application. The examples in this book were compiled using the JDK version 1.2.2 that is installed by default by the Borland JBuilder version 3.0 Java Integrated Development Environment (IDE).

What Version Number Was That?

Observant readers will notice that the version numbers for the products mentioned in this section are (in some cases) one or two major revision numbers out of date. This is actually a conscious decision (if a somewhat unconventional one). After reviewing the additional features and bug fixes that were available for newer versions of these products, the trouble spent upgrading seemed to outweigh the value that would be gained by using the more recent version. Since we may be using an older version of these tools than you are, they should be forward compatible with more recent versions. If we had used the most recent versions, the reverse would probably not have been true.

The examples in this book were developed using the Borland JBuilder Foundation 3.0 IDE. Although it was lacking in a few areas, it is an excellent value for the money (the lowest-end version of JBuilder is available for download from Borland at no cost). It can be obtained from the Borland Web site (www.borland.com). Although an IDE (such as JBuilder) can certainly make soft- ware development less tedious, there is no requirement that you use one. The examples in this book could just as well have been developed using a basic text editor and command-line tools.
    

Configuring Your Environment

The following instructions assume that you’re using the Xerces parser. If you’re not, you’ll need to follow the instructions that come with your parser to add the org.xml.sax  and org.xml.sax.helpers  packages to your system’s Java CLASSPATH environment variable.

If you are using Xerces, you should add the xerces.jar file in the root of the distribution directory to your system’s CLASSPATH environment variable. This can be done through the System applet of the Control Panel for most versions of Windows. This will allow java.exe (the JDK Java runtime) to locate the SAX classes at runtime without your needing to supply their location on the command line.
    

Basic SAX Application Flow

Every SAX application needs to go through the same basic steps to process XML documents:
  1. Obtain a reference to an object that implements the XMLReader interface.
  2. Create an instance of an application-specific object that implements one or more of the various SAX *Handler (DTDHandler, ContentHandler, and ErrorHandler) interfaces.
  3. Register the object instance with the XMLReader object so that it will receive notifications as XML parsing events occur.
  4. Call the XMLReader.parse() method for each XML document that needs to be processed by the application. The object instance that was registered in step 3 will receive notifications progressively as the document is parsed.
It is up to the application-specific object (or objects) to track and process the information that is delivered via the various event notification methods that it implements. For example, an application that wants to strip markup out of an XML document and leave only the text content would need to implement the ContentHandler interface and implement its specific processing in the characters()  event callback method. Even though the setup for every SAX application is almost identical, the data structures and algorithms that process the event notifications will vary widely depending on what the application is designed to do.
    

The XShell Sample Application

To illustrate the various features (and pitfalls) of SAX, we will be developing a complex example throughout the tutorial chapters. This example will implement a simple command-line shell interface (similar to command.com or cmd.exe) where the commands are encoded as XML elements. Keep in mind that our goal is not to build a full-featured command-line interpreter, but to show off the various features of SAX that make it ideally suited to interactive XML applications. Whenever a trade-off between application functionality and clarity of SAX usage needs to be made, SAX will always come out the winner.
The full source code for this application (as well as other goodies) is available at this book’s Web site (www.bookofsax.com). Throughout the book, we will call attention to complementary information that can be found at the Web site through the use of the Web site icon.
For now, we will define the XShell application as a Java program that accepts zero or more XML document URIs from the command line, and parses each given document, and executes the recognized batch commands within. If no URIs are given on the command line, the XML commands will be read from the System.in InputStream. Input will be batched one line at a time, until a blank line is detected. The batch will then be parsed using the same algorithm that is used to parse documents read from URIs. Input will be terminated by an empty batch (no XML content).

Listing 2-1 shows the basic skeleton of the XShell class, including the main() method that acts as the command-line entry point.

Listing 2-1: The XShell class skeleton
 public class XShell extends DefaultHandler { 
	. . . 
	/** 
	* Command line entry point. 
	*/ 
	public static void main(String[] args) 
	{ 
		// create a new XShell session object and set the output to go to stdout 
		XShell xbSession = new XShell(System.out); 
		
		try { 
			if (args.length > 0) { 
				// user provided document URIs on the command line 
				for (int i = 0; i < args.length; i++) { 
				InputSource is = new InputSource(args[i]); 

				xbSession.doBatch(is); 
			} 
			} else { 
				// no command line argument provided, read from stdin 
				xbSession.doCLI(System.in); 
			} 
		} catch (IOException ioe) { 
			System.err.println(ioe); 
		} catch (SAXException se) { 
			System.err.println(se); 
		} 
	} 

    

Using the XMLReader InterfaceXMLReader

The primary entry point to any SAX implementation is the XMLReader interface. This interface contains methods for:
  • Controlling how the underlying XML parser will operate (validating versus nonvalidating, and so on).
  • Enabling and disabling specific SAX features (such as namespace processing).
  • Registering object instances to receive XML parsing notifications (via the *Handler interfaces).
  • Initiating the parsing process on a specific document URI or input source (via the parse() methods).
Before an application can use the XMLReader interface, it must first obtain a reference to an object that implements it. How the XMLReader interface is sup- ported is a decision that is left up to the implementers of the particular SAX distribution. For instance, the Xerces package supplies a class called org.apache.xerces.parsers.SAXParser that implements the XMLReader interface. Any application that uses Xerces to provide SAX support can simply create a new instance of the SAXParser class and use it immediately.

The SAX specification does define a special helper class (from the org.xml.sax.helpers package) called XMLReaderFactory that is intended to act as a class factory for XMLReader instances. It has two static methods for creating a new XMLReader object instance:

	XMLReaderFactory.createXMLReader(); 
	XMLReaderFactory.createXMLReader(String className); 
Of course, both of these methods require that the class name of the class that supports the XMLReader interface be known in advance. Since all that these methods do is use the class name given to create a new object instance, this helper class is, at best, of questionable utility.

The XShell constructor obtains a reference to an XMLReader class instance by directly instantiating the SAXParser class from the Xerces package inside its constructor:
	m_xrShell = (XMLReader)new org.apache.xerces.parsers.SAXParser();
Now that we have an XMLReader object instance to work with, we can register our class to receive XML parse callback notifications.
    

Implementing the ContentHandler Interface

The most interesting interface in the SAX package, from the application author’s standpoint, is the ContentHandler interface. This interface provides notification of:
  • documents starting and ending
  • element start and end tags recognized
  • character data found
  • namespace prefixes going in and out of scope
  • processing instructions found
  • ignorable whitespace read
For normal XML applications, XML element and text content are the most important types of data that will be processed. That is why almost every SAX application will want to implement the ContentHandler interface and register to receive its notifications. It is a rather large interface, however, and many applications will neither need nor want to receive all of the notifications it offers.

To save application writers from having to implement half a dozen methods that they have no intention of using, SAX provides the DefaultHandler class. This class provides a default empty implementation of all of the callback inter faces supported by SAX 2.0, including the EntityResolver, DTDHandler, ContentHandler, and ErrorHandler interfaces. By deriving your SAX application object from DefaultHandler, you can immediately register to receive any of the callback notifications without writing empty methods to satisfy the various Java interface specifications. The XShell class is derived from the DefaultHandler class directly, like this:
	public class XShell extends DefaultHandler { . . .
Then, in the XShell constructor, it registers itself to receive ContentHandler notifications using the XMLReader object instance it just created:

	m_xrShell = (XMLReader)new org.apache.xerces.parsers.SAXParser();
	m_xrShell.setContentHandler(this);
Now that the XShell instance is ready to receive content notifications, the application is ready to parse an XML document.
    

Parsing and Processing

Whether the XML command-line instructions come from an external document or from System.in, both cases end up calling the doBatch() method:

	/** 
	* Given an InputSource object, parses and executes the commands given. 
	*/ 
	public void doBatch(InputSource is) throws IOException 
	{ 
		try { 
			m_xrShell.parse(is); 
		} catch (SAXException se) { 
		// we just eat it, we already notified the user 
		} 
	} 
This very simple method calls the parse() method of the XMLReader instance that was created by the XShell constructor. The parse() method does not return until either the document referenced by the passed InputSource has been completely parsed, or a fatal parsing error occurs. In either case, no indication of the result is returned to the application by the parse() method. Error handling is covered thoroughly in Chapter 3.

After the XML parser has been invoked, the SAX interface begins calling notification methods for the object instances that were registered by the set*Handler() methods of the XMLReader interface. To help you understand the sequence in which notifications are sent by SAX, we’ve developed a simple utility called com.bookofsax.SAXAnimate that prints the interface and method names of every SAX callback for a given XML document. Given the following simple XML document:
	<?xml version="1.0" encoding="UTF-8"?> 
	<root> 
	<element xmlns="http://namespaces.bookofsax.com/null"> 
	<message>Hello, SAX.</message> 
	</element> 
	</root> 
The SAXAnimate utility generates the following list of notification calls:

	ContentHandler.setDocumentLocator()
	ContentHandler.startDocument()
	ContentHandler.startElement()
	ContentHandler.characters()
	ContentHandler.startPrefixMapping()
	ContentHandler.startElement()
	ContentHandler.characters()
	ContentHandler.startElement()
	ContentHandler.characters()
	ContentHandler.endElement()
	ContentHandler.characters()
	ContentHandler.endElement()
	ContentHandler.endPrefixMapping()
	ContentHandler.characters()
	ContentHandler.endElement()
	ContentHandler.endDocument()
By reading through this trace and matching it to the content in the XML document, you should be able to more easily understand how SAX sequences notifications to ensure that your application has an accurate picture of the XML document at any point in time. For instance, the namespace prefix mapping notification is sent before the corresponding element declaration, since elements and attributes can use a namespace as soon as it is declared.

Now, given this understanding, how will the XShell application process notifications to ensure that it will have the appropriate information available when it needs it? How much (or how little) information needs to be preserved from the data that is delivered by the ContentHandler interface? To answer this question, we will need to define exactly how the XShell will function. The pseudo-code for the application is:
	while more-documents-to-process or more-interactive-commands
		while more-child-elements-of-root
			read-complete-child-element
			if element-name-recognized then
				execute-named-command
			end if
		end while
	end while
It’s a bit simplistic, but that’s basically all that most command-line interpreters do. The additional complication that we will face is that the read-complete-child-element step is not as straightforward as it seems. As we can see from the output of the SAXAnimate class, the information we need will be passed to us a little bit at a time. It is our job to determine how much and which of the given content we are to keep, and how to store it.

The hierarchical nature of XML data makes the stack a very popular structure for storing arbitrary levels of document information. The simplest possible notification sequence for a simple leaf element that contains some character data is:

ContentHandler.startElement()
ContentHandler.characters()
ContentHandler.endElement()
Based on this, our program will need to preserve the data that is passed in by the startElement() and characters() notifications, and act on that data when the corresponding endElement() notification is received. The method signature for the startElement() notification is:

public void startElement(java.lang.String namespaceURI,
		java.lang.String localName,
		java.lang.String qName,
		Attributes atts)
Of all of this information, the only part we really need to preserve are the attributes. The endElement() method will be given the same namespaceURI, localName, and qName parameters when the element close tag is parsed (or immediately, if it is an empty element). Since elements can be nested, we will need to use the java.util.Stack object to keep track of the attributes that are valid for the element we are currently processing. As long as the attributes are pushed on the stack in the startElement() method and popped off in the endElement() method, the stack is always guaranteed to be accurate. To support this, we will need to add two members to the XShell class and write a startElement() method that looks like this:

	/**
	* Attribute stack.
	*/
	Stack m_stAtts = new Stack();
	
	/**
	* Accumulated character data on stack.
	*/
	Stack m_stChars = new Stack();
	
	. . .
	/**
	* Sets up attribute and character data stacks for processing later by the
	* endElement() method.
	*/
	public void startElement(java.lang.String namespaceURI,
		java.lang.String localName, java.lang.String qName, Attributes atts)
	{
		// save the attributes for this element, for use later in endElement()
		m_stAtts.push(new AttributesImpl(atts));

		// create new, empty string buffer to capture character data
		m_stChars.push(new StringBuffer());
	}
Look at the startElement() method; the first push operation actually pushes a new AttributesImpl object rather than a reference to the object that was passed in as a parameter. The object that was passed in is not guaranteed to be valid outside the scope of the startElement() method. This is generally true for any of the data that is passed to a SAX notification method. Anything that you need to preserve will need to be copied to a new object instance that your program controls. In this case, the AttributesImpl object is an object provided in the org.sax.helpers package. It implements a copy constructor that can be used to duplicate the contents of an object that supports the Attributes interface for later use.

The second push operation provides a new, empty Java StringBuffer object instance that will be used to capture any character data this element might contain. Once again, since elements can be nested, a new StringBuffer instance must be created for every element open tag. The characters() notification method appends any character data it receives to the StringBuffer on the top of the m_stChars stack.

	/**
	* Appends new characters to the StringBuffer on top of the
	* character data stack.
	*/
	public void characters(char[] ch, int start, int length)
	{
		((StringBuffer)m_stChars.peek()).append(ch, start, length);
	}
Finally, when the endElement() notification is received, we can examine the element name and decide what action, if any, to take. At this point, we are saving all of the element content for every element we encounter. We are going on the assumption that this content is valid and will be used later. One possible future optimization would be to set flags and ignore attribute and character data for unrecognized elements, but since we will be expanding our application to incorporate additional commands in the future, we will capture everything now and deal with it later. The endElement() notification method looks like this:

	/**
	* Pops saved data off of stack and executes the given
	* command.
	*/
	public void endElement(String namespaceURI, String localName, String qName)
	{
		Attributes atts = (Attributes)m_stAtts.pop();
		String strCharData = ((StringBuffer)m_stChars.pop()).toString();

		// dispatch the message, based on the localName (XML tag name)
		if (localName.equals("echo")) {
			doEcho(atts, strCharData);
		} else if (localName.equals("dir")) {
			doDir(atts, strCharData);
		} else if (localName.equals("cd")) {
			doCD(atts, strCharData);
		}
	}
This method recognizes the three command elements that the shell currently supports (echo, dir, and cd) and dispatches the cached attribute and character data to the worker methods that implement the actual shell logic. In the interest of space, we will not show the implementations of all three of the doEcho(), doDir(), and doCD() methods here. They are available (along with the rest of the application source code) on this book’s Web site. We will show the implementa- tion of the doCD() method because it does show the basic usage of the Attributes interface.
    

Accessing Attribute Values Using the Attributes Interface

The attributes interface provides access to the XML attributes that are included (either directly or through implicit values in the DTD) in an element’s open tag. There are two basic methods for accessing attribute values: by name or by index.

It is illegal to have two attributes on a single element with the same qualified name (meaning that the namespace URIs and local parts match). Therefore, retrieving attribute values by name is very common when processing XML documents. The Attributes interface provides the getType() method for getting an attribute’s type (CDATA, ID, IDREF, and so on), and the getValue() method for getting an attribute’s string value. Both methods can use either the attribute’s qualified name or its namespace URI and local name. Since no guarantees are made that attribute values will be made available in the same order in which they appear in the element tag, referencing by name is the only safe way to ensure that your application is reading the attribute you intend.

When enumerating attributes, however, the getLength(), getQName(), and getValue() methods can be used to get information about all of the attributes that are available for a particular element. The Attributes interface is not com- pletely symmetric, and certain methods are accessible only using an attribute’s index (such as the getURI() method for retrieving the attribute’s namespace URI). The getIndex() methods are provided to perform a reverse lookup from an attribute’s qualified name or namespace/local part to its corresponding index number.

Many of these methods are not particularly useful until an application becomes namespace aware. Information about how SAX deals with namespaces and qualified names is thoroughly covered in Chapter 6, “Namespace Support.”
    

Wrapping Up

We have now built a completely functional SAX application. It parses docu- ments, handles notification events, and even does (some) useful work. Download the application (complete with source code) from this book’s Web site and try the test script you find there. Try running it in interactive mode and enter some invalid XML. You’ll probably find that the default error handling leaves some- thing to be desired. We’ll tackle that next, in Chapter 3, “Handling Errors.”





Buy this book now!The Book of SAX: The Simple API for XML
By: W. Scott Means and Michael A. Bodie

The Book of SAX includes everything XML and Java developers need to write SAX applications. Specific examples show how to use SAX to solve XML parsing problems that are impractical to address with tree-based technologies-including real-time parsing, very large documents, and high-performance applications. The authors guide readers through the development of picoSAX, a functioning SAX 2.0 XML parser.

ISBN: 1886411778   Buy this book now!

Read the Book Review
Read Chapter 2: A Complete SAX Application from this book

Excerpted with permission from The Book of SAX, by W. Scott Means and Michael A. Bodie. ©2002 No Starch Press


    
  Contact Us | E-mail Us | Site Guide | About PerfectXML | Advertise ©2004 perfectxml.com. All rights reserved. | Privacy