perfectxml.com
 Basic Search  Advanced Search   
Topics Resources Free Library Software XML News About Us
  You are here: home Info Bank Articles » Write a Simple XML Parser Saturday, 23 February 2008
 

Back to Articles Page      

        

Write a Simple XML Parser

Ever run into a problem where data is delivered in XML but the client has to be lightweight and as such cannot afford the baggage of the DOM and SAX parsers? An example of this could be an applet running on a web page. Most applets which are widely accepted use Java version 1.1.x. What do you do when you need to parse an XML document using such an applet? My answer, I would write a lightweight XML parser myself. Lets look at the kinds of parsers and decide on a design for our lightweight parser.

Types of parsers

  1. DOM: These kind of parsers read the complete XML file into memory and create a tree structure. The whole XML file is loaded in memory, this can become pretty memory intensive.

  2. SAX: These parsers read the XML file line by line and generate events as and when an element is found. These events can be handled and action can be taken on finding required elements. SAX parsers are stream-based, event-based, push parsers. This is a much more lightweight parser than a DOM parser but still is an overkill for small apps and applets.

  3. Poll Parser: This kind of parser looks for elements in an XML file and drills down to the specific elements that the application is interested in. This is very effective when the size of XML is not very large and no XML validation is required.

In our design, we will exploit the way XML is structured and assume certain things for the time being. Once we have a simple XML parser that works for us, we can extend it to handle special cases later. Lets examine an XML file structure first :

Sample XML file :

<?xml version="1.0"?>
<!DOCTYPE STUDENTS [
<!ELEMENT STUDENTS (STUDENT*)>
<!ELEMENT STUDENT (NAME, AGE, CLASS)>
<!ELEMENT NAME (#PCDATA)>
<!ELEMENT AGE (#PCDATA)>
<!ELEMENT CLASS (#PCDATA)>
]>


<STUDENTS>
        <STUDENT>               
                <NAME>HillyBilly</NAME> 
                <AGE>19</AGE> 
                <CLASS>10</CLASS> 
        </STUDENT>

        <STUDENT>               
                <NAME>Captain Kirk</NAME> 
                <AGE>20</AGE> 
                <CLASS>10</CLASS> 
        </STUDENT>

</STUDENTS>

Notice that in an XML file, everything exists as a tree structure. That means, each entity contains children, or holds data. For eg. The STUDENT tag contains three child nodes, namely, NAME, AGE and CLASS. Similarly, the STUDENTS tag contains one or more STUDENT tags. Due to this tree structure, it becomes easy to drill down to a particular entity easily. We will exploit this structure to write our parser.

Parser Limitations

Lets lay down certain limitations also. Since XML can get much more complex than the example selected, we need to know our limitations too. Our parser has the following limitations:
  1. Support only complete XML tags. i.e. It doesn't support empty elements which are terminated using the start tag itself. eg. <AGE />. This is equivalent to writng <AGE></AGE>. Due to the simple nature of our parser, lets only support the latter example.

  2. No support for attributes : XML tags may have associated attributes, which we will not support in this parser. eg. <CLASS division=”B”>10</CLASS>. Although our parser will discover 10 as the value for the CLASS tag in this example, it cannot be queried for the division attribute, because it has no idea about it.

So now, we have laid down the limitations and had a look at a sample XML to parse, we can start our design and implementation of this parser.

Implementation

We need to implement just one method. Yes, thats right, just ONE. If we look closely at the way XML is structured, we are always only concerned with the value of a given element or elements. For eg, in the above XML file, if I look at the value of the tag STUDENTS, it contains a list of STUDENT records. Therefore, the value of the STUDENTS tag is the text enclosed within <STUDENTS> and </STUDENTS>. Once we have this text, each STUDENT record is the text that comes in between <STUDENT> and </STUDENT> and so on to the leaf nodes NAME, AGE and CLASS.

Therefore, in order to parse an XML file, we need a single function that will return us the text that comes in between the start and end of any given tag. In this way, we can narrow down on the value contained in the tag we want, which is the data we can process. Therefore, using a single function it is possible to drill down to the data contained by any tag in an XML file. Lets define the prototype of the function :

public static Vector getXMLTagValue(String xmlFileString, String tagName);


The above function takes in the XML file or a subset of an XML file as a String and the tag name to extract. Values of all tags in the XML file String matching the tag name provided are extracted and filled into the Vector as Strings. Therefore, If one has to extract the list of STUDENT tags in the XML file, it is achieved by writing the following code :

// First get the XML file in a string.
String xmlFile = getXMLFile(“x.xml”); 
Vector v = getXMLTagValue(xmlFile, “STUDENT”);

This will give us the complete list of STUDENT tag values as Strings filled in the Vector. To drill down to a particular STUDENT tag, we need to extract each element of the Vector. Each element of the Vector can now be used as the xmlFile string for further calls to getXMLTagValue, in order to drill down further.

Now consider I want to extract the age of the first STUDENT, I can do the following :

Vector ageV = getXMLTagValue(v.elementAt(0), “AGE”);
System.out.println(“Age is  : “ + ageV.elementAt(0));

As you can see, this function is powerful enough to extract all values from an XML file governed by the limitations we have decided on. Lets get on to the code of the function.

package jr.util.xml;

import java.util.*;

public class XMLParser
{
	/**
	 * Pass only the name of the section for example "QUESTION" 
	 */
	public static Vector getXMLTagValue(String xml, String section) throws Exception
	{
		String xmlString = new String(xml);
		Vector v = new Vector();
		String beginTagToSearch = "<" + section + ">";
		String endTagToSearch = "</" + section + ">";
		
		// Look for the first occurrence of begin tag
		int index = xmlString.indexOf(beginTagToSearch);
		
		
		while(index != -1)
		{
                             // Look for end tag
			// DOES NOT HANDLE <section Blah />
			int lastIndex = xmlString.indexOf(endTagToSearch);


			// Make sure there is no error
			if((lastIndex == -1) || (lastIndex < index))
				throw new Exception("Parse Error");
			
			// extract the substring
			String subs = xmlString.substring((index + beginTagToSearch.length()), lastIndex) ;
			
                            // Add it to our list of tag values
			v.addElement(subs);

                            // Try it again. Narrow down to the part of string which is not 
                            // processed yet.
			try
			{
				xmlString = xmlString.substring(lastIndex + endTagToSearch.length());
			}
			catch(Exception e)
			{
				xmlString = "";
			}
			
                             // Start over again by searching the first occurrence of the begin tag 
                             // to continue the loop.

			index = xmlString.indexOf(beginTagToSearch);
		}		
		
		return v;
	}	
}

Summary

This simple XML parser can be used to extract values from an XML file when the XML file is small and when the application cannot take the baggage of the SAX or DOM parsers. This is a nice XML parser to use for applets which receive information in XML format. This XML parser may not work with any XML file since it has some limitations, therefore, it is most effective when both the XML file and the client application are written by the same developer or team.

Please let me know if there are any more limitations to this parser or if you are successful in using this in your project. I can be contacted at anandh@JavaReference.com


About the Author
Anand is a Senior Software Engineer at Veritas Software Corporation. He has a Masters degree in Computer Science from University of Pune, India. Anand started his career as a C++ programmer, shifting focus to Java. He is a server side buff, and believes Server-side Java is it. Over the years, Anand has designed and implemented numerous projects in Java, C++ and Visual Basic. Primary interests are networking and server side technologies, primarily J2EE. Anand is also interested in teaching and uses his spare time to teach Java. He can be reached at anandh@JavaReference.com.

  

Back to Articles Page      

  Contact Us | E-mail Us | Site Guide | About PerfectXML | Advertise ©2004 perfectxml.com. All rights reserved. | Privacy