perfectxml.com
 Basic Search  Advanced Search   
Topics Resources Free Library Software XML News About Us
  You are here: home Free Library New Riders » XML and PHP :: Chapter 3 Saturday, 23 February 2008
 
XML and PHP
XML and PHP
Vikram Vaswani

XML and PHP is a clear, concise guide to the synergies between XML and PHP, many of which are not immediately visible to intermediate developers. XML and PHP demonstrates how PHP and XML can be combined to build cutting-edge Web applications. It includes detailed explanations of PHP's XML extensions, together with illustrations of using PHP to parse, validate and transform XML markup, traverse XML data trees, exchange data between Web applications, overlay remote procedure calls over HTTP, and use free open-source tools to add new capabilities to your PHP/XML applications.

Buy this book!

Chapter 3: PHP and the Document Object Model (DOM)

"A fool sees not the same tree that a wise man sees."

~William Blake

If you've been paying attention, you now know the basics of parsing XML with PHP. As Chapter 2, "PHP and the Simple API for XML (SAX)" demonstrated, it's pretty simple—whip up some XML data and mix in a few callback functions. It's a simple yet effective recipe, and one that can be used to great effect for the rapid development of XML-based applications.

That said, although the event-driven approach to XML parsing is certainly popular, it's not the only option available. PHP also allows you to parse XML using the Document Object Model (DOM), an alternative technique that allows developers to create and manipulate a hierarchical tree representation of XML data for greater flexibility and ease of use.

In this chapter, this tree-based approach is explored in greater detail. First it is put under the microscope to see exactly how it works and then PHP's implementation of the DOM is introduced. The various methods exposed by PHP to simplify interaction with the DOM are also examined, together with examples and code listings that demonstrate its capabilities.

Both tree- and event-based approaches have significant advantages and disadvantages, and these can impact your choice of technique when implementing specific projects. To that end, this chapter also includes a brief discussion of the pros and cons of each approach in the hope that it will assist you in making the right choice for a particular project.

Let's get started!

Document Object Model (DOM)

The Document Object Model (DOM) is a standard interface to access and manipulate structured data.

As the name suggests, it does this by modeling, or representing, a document as a hierarchical tree of objects. A number of different object types are defined in the W3C's DOM specification; these objects expose methods and attributes that can be used by the application layer to navigate and process the DOM tree, exploit the relationships between the different branches of the tree, and extract information from it.

The W3C's DOM specification defines a number of different objects to represent the different structures that appear within an XML document. For example, elements are represented by an Element object, whereas attributes are represented by Attr objects.

Each of these different object types exposes specific methods and properties. Element objects expose a tagName property containing the element name and getAttribute() and setAttribute() methods for attribute manipulation, whereas Attr objects expose a value property containing the value of the particular attribute. These methods and properties can be used by the application layer to navigate and process the DOM tree, exploit the relationships between the different branches of the tree, and extract information from it.

The very first specification of the DOM (DOM Level 1) appeared on the W3C's web site in October 1998, and simply specified the "core" features of the DOM—the basic objects and the interfaces to them. The next major upgrade, DOM Level 2, appeared in November 2000; it examined the DOM from the perspective of core functions, event handling, and document traversal. DOM Level 3, which is currently under development, builds on past work, and incorporates additions and changes from other related technologies (XPath, abstract schemas, and so on).

As a standard interface to structured data, the DOM was designed from the get-go to be platform- and language-independent. It can be (and is) used to represent structured HTML and XML data, with DOM (or DOM-based) implementations currently available for Java, JavaScript, Python, C/C++, Visual Basic, Delphi, Perl, SMIL, SVG, and PHP. (The PHP implementation is discussed in detail in the next section.)

In order to better understand how the DOM works, consider Listing 3.1.

Listing 3.1 A Simple XML Document

<?xml version="1.0"?>
<sentence>What a wonderful profusion of colors and smells in the market
<vegetable color='green'>cabbages</vegetable>, <vegetable
color='red'>tomatoes</vegetable>, <fruit color='green'>apples</fruit>,
<vegetable color='purple'>aubergines</vegetable>, <fruit
color='yellow'>bananas</fruit></sentence>

Once a DOM parser chewed on this document, it would spit out the tree structure shown in Figure 3.1.

Figure 3.1 A DOM tree.

As you can see, the parser returns a tree containing multiple nodes linked to each other by parent-child relationships. Developers can then write code to move around the tree, access node properties, and manipulate node content.

This approach is in stark contrast to the event driven approach you studied in Chapter 2, "PHP and the Simple APIfor XML (SAX)." A SAX parser progresses sequentially through a document, firing events based on the tags it encounters and leaving it to the application layer to decide how to process each event. A DOM parser, on the other hand, reads the entire document into memory, and builds a tree representation of its structure; the application layer can then use standard DOM interfaces to find and manipulate individual nodes on this tree, in a non-sequential manner.

PHP and the DOM

PHP 4.0 comes with a primitive, though effective, implementation of the DOM, based on the libxml library. Created by Daniel Veillard, libxml (http://www.xmlsoft.org/) is a modular, standards-compliant C library that provides XML parsing capabilities to the GNOME project (http://www.gnome.org/).

If you're using a stock PHP binary, it's quite likely that you'll need to recompile PHP to add support for this library to your PHP build. (Detailed instructions for accomplishing this are available in Appendix A, "Recompiling PHP to Add XML Support.")

Under Construction

If you're planning on using PHP's DOM extension in your development activities, be warned that this extension is still under development and is, therefore, subject to change without notice. Consequently, DOM code that works with one version of PHP may need to be rewritten or retested with subsequent versions.

Note also that the examples in this chapter have been tested with the DOM extension that ships with PHP 4.1.1, and are not likely to work with earlier versions because PHP's DOM implementation underwent some fairly radical changes between the release of PHP 4.0.6 and PHP 4.1.1. If you're using an earlier PHP build, you might want to upgrade to PHP 4.1.1 in order to try out the examples in this chapter.

A Simple Example

When PHP parses an XML document, it creates a hierarchical tree structure (mirroring the structure of the document) that is composed of objects. Each of these objects has standard properties and methods, and you can use these properties and methods to traverse the object tree and access specific elements, attributes, or character data.

The best way to understand how this works is with a simple example. Take a look at Listing 3.2, which demonstrates the basic concepts of this technique by traversing a DOM tree to locate a particular type of element, and print its value.

Listing 3.2 Traversing a DOM Tree

<?php

// XML data
$xml_string = "<?xml version='1.0'?>
<sentence>What a wonderful profusion of colors and smells in the market - 
   <vegetable color='green'>cabbages</vegetable>, 
   <vegetable color='red'>tomatoes</vegetable>, 
   <fruit color='green'>apples</fruit>, 
   <vegetable color='purple'>aubergines</vegetable>, 
   <fruit color='yellow'>bananas</fruit>
</sentence>";

// create a DOM object from the XML data
if(!$doc = xmldoc($xml_string))
{
   die("Error parsing XML");
}

// start at the root
$root = $doc->root();

// move down one level to the root's children
$children = $root->children();

// iterate through the list of children
foreach ($children as $child)
{
  // if <vegetable> element   
  if ($child->tagname == "vegetable")
  {
     // go down one more level
     // get the text node
     $text = $child->children();
     // print the content of the text node
     echo "Found: " . $text[0]->content . "<br>"; 
  }
}

?>

Let's go through Listing 3.2 step-by-step:

  1. The first order of business is to feed the parser the XML data, so that it can generate the DOM tree. This is accomplished via the xmldoc() function, which accepts a string of XML as argument, and creates a DOM object representing the XML data. (You can use xmldocfile() to parse an XML file instead of a string. Check out Listing 3.5 for an example.) The following line of code creates a DOM object, and assigns it to the PHP variable $doc:

  2. if(!$doc = xmldoc($xml_string))
    {
       die("Error parsing XML");
    }
  3. This newly created DOM object has certain properties and methods. One of the most important ones is the root() method, which returns an object representing the document's root element.

  4. The following line of code returns an object representing the document element, and assigns it to the PHP variable $root:

    $root = $doc->root();
  5. This returned node is itself an object, again with properties and methods of its own. These methods and properties provide information about the node, and its relationship to other nodes in the tree: its name and type, its parent, and its children. However, the elements I'm looking for aren't at this level—they're one level deeper. And so I used the root node's children() method to obtain a list of the nodes below it in the document hierarchy:

  6. $children = $root->children();
  7. This list of child nodes is returned as an array containing both text and element nodes. All I need to do now is iterate through this node list, looking for vegetable elements. As and when I find these, I dive one level deeper into the tree to access the corresponding character data and print it (this is a snap, given that each text node has a content property).

  8. foreach ($children as $child)
    {
      // if <vegetable> element   
      if ($child->tagname == "vegetable")
      {
         // go down one more level
         // get the text node
         $text = $child->children();
         // print the content of the text node
         echo "Found: " . $text[0]->content . "<br>"; 
      }
    }

    When this script runs, it produces the following output:

    Found: cabbages
    Found: tomatoes
    Found: aubergines

As Listing 3.2 demonstrates, DOM tree traversal takes place primarily by exploiting the parent-child relationships that exist between the nodes of the tree. After traversal to a particular depth has been accomplished, node properties can be used to extract all required information from the tree.

Traversing the DOM with PHP's DOM Classes

Because PHP's DOM parser works by creating standard objects to represent XML structures, an understanding of these objects and their capabilities is essential to using this technique effectively. This section examines the classes that form the blueprint for these objects in greater detail.

DomDocument Class

A DomDocument object is typically the first object created by the DOM parser when it completes parsing an XML document. It may be created by a call to xmldoc():

$doc = xmldoc("<?xml version='1.0'?><element>potassium</element>");

Or, if your XML data is in a file (rather than a string), you can use the xmldocfile() function to create a DomDocument object:

$doc = xmldocfile("element.xml");

Treading the Right Path

If you're using Windows, you'll need to give xmldocfile() the full path to the XML file. Don't forget to include the drive letter!

When you examine the structure of the DomDocument object with print_r(), you can see that it contains basic information about the XML document—including the XML version, the encoding and character set, and the URL of the document:

DomDocument Object
(
  [name] => 
  [url] => 
  [version] => 1.0
  [standalone] => -1
  [type] => 9
  [compression] => -1
  [charset] => 1
)

Peekaboo!

You'll notice that many examples in this book (particularly in this chapter) use the print_r() function to display the structure of a particular PHP variable. In case you're not familiar with this function, you should know that it provides an easy way to investigate the innards of a particular variable, array, or object. Use it whenever you need to look inside an object to see what makes it tick; and, if you're feeling really adventurous, you might also want to take a look at the var_dump() and var_export() functions, which provide similar functionality.

Each of these properties provides information on some aspect of the XML document:

  • name—Name of the XML document

  • url—URL of the document

  • version—XML version used

  • standalone—Whether or not the document is a standalone document

  • type—Integer corresponding to one of the DOM node types (see Table 3.1)

  • compression—Whether or not the file was compressed

  • charset—Character set used by the document

The application can use this information to make decisions about how to process the XML data—for example, as Listing 3.3 demonstrates, it may reject documents based on the version of XML being used.

Listing 3.3 Using DomDocument Properties to Verify XML Version Information

<?php

// XML data
$xml_string = "<?xml version='1.0'?><element>potassium</element>";

// create a DOM object
if (!$doc = xmldoc($xml_string))
{
   die("Error in XML");
}
// version check
else if ($doc->version > 1.0)
{
   die("Unsupported XML version");
}
else
{
   // XML processing code here
}

?>

In addition to the properties described previously, the DomDocument object also comes with the following methods:

  • root()—Returns a DomElement object representing the document element

  • dtd()—Returns a DTD object containing information about the document's DTD

  • add_root()—Creates a new document element, and returns a DomElement object representing that element

  • dumpmem()—Dumps the XML structure into a string variable

  • xpath_new_context()—Creates an XPathContext object for XPath evaluation

While parsing XML data, you'll find that the root() method is the one you use most often, whereas the add_root() and dumpmem() methods come in handy when you're creating or modifying an XML document tree in memory (discussed in detail in the "Manipulating DOM Trees" section).

X Marks the Spot

In case you're wondering, XPath, or the XML Path Language, provides an easy way to address specific parts of an XML document. The language uses directional axes, coupled with conditional tests, to create node collections matching a specific criterion, and also provides standard constructs to manipulate these collections.

PHP's XPath implementation is discussed in detail in the upcoming section titled "Traversing the DOM with PHP's XPath Classes."

In Listing 3.4, the variable $fruit contains the root node (the element named fruit).

Listing 3.4 Accessing the Document Element via the DOM

<?php

// create a DomDocument object
$doc = xmldoc("<?xml version='1.0' encoding='UTF-8'
standalone='yes'?><fruit>watermelon</fruit>"); // root node $fruit = $doc->root(); ?>

To DTD or Not to DTD

The dtd() method of the DomDocument object creates a DTD object, which contains basic information about the document's Document Type Definition. Here's what it looks like:

Dtd Object
(
  [systemId] => weather.dtd
  [name] => weather
)

This DTD object exposes two properties: the systemId property reveals the filename of the DTD document, whereas the name property contains the name of the document element.

DomElement Class

The PHP parser represents every element within the XML document as an instance of the DomElement class, which makes it one of the most important in this lineup. When you view the structure of a DomElement object, you see that it has two distinct properties that represent the element name and type, respectively. You'll remember from Listing 3.2 that these properties can be used to identify individual elements and extract their values. Here is an example:

DomElement Object
(
  [type] => 1
  [tagname] => vegetable
)

A special note should be made here of the type property, which indicates the type of node under discussion. This type property contains an integer value mapping to one of the parser's predefined node types. Table 3.1 lists the important types.

Table 3.1 DOM Node Types

Integer

Node type

Description

1

XML_ELEMENT_NODE

Element

2

XML_ATTRIBUTE_NODE

Attribute

3

XML_TEXT_NODE

Text

4

XML_CDATA_SECTION_NODE

CDATA section

5

XML_ENTITY_REF_NODE

Entity reference

7

XML_PI_NODE

Processing instruction

8

XML_COMMENT_NODE

Comment

9

XML_DOCUMENT_NODE

XML document

12

XML_NOTATION_NODE

Notation


If you plan to use the type property within a script to identify node types (as I will be doing shortly in Listing 3.5), you should note that it is considered preferable to use the named constants rather than their corresponding integer values, both for readability and to ensure stability across API changes.

The DomElement object also exposes a number of useful object methods:

  • children()—Returns an array of DomElement objects representing the children of this node

  • parent()—Returns a DomElement object representing the parent of this node

  • attributes()—Returns an array of DomAttribute objects representing the attributes of this node

  • get_attribute()—Returns the value of an attribute of this node

  • new_child()—Creates a new DomElement object, and attaches it as a child of this node (note that this newly created node is placed at the end of the existing child list)

  • set_attribute()—Sets the value of an attribute of this node

  • set_content()—Sets the content of this node

Again, the two most commonly used ones are the children() and attributes() methods, which return an array of DomElement and DomAttribute objects, respectively. The get_attribute() method can be used to return the value of a specific attribute of an element (refer to Listing 3.8 for an example), whereas the new_child(), set_attribute(), and set_content() methods are used when creating or modifying XML trees in memory, and are discussed in detail in the section entitled "Manipulating DOM Trees."

Note that PHP's DOM implementation does not currently offer any way of removing an attribute previously set with the set_attribute() method.

Choices

Most of the object methods discussed in this chapter can also be invoked as functions by prefixing the method name with domxml and passing a reference to the object as the first function argument. The following snippets demonstrate this:

<?php
// these two are equivalent
$root1 = $doc->root();
$root2 = domxml_root($doc);

// these two are equivalent
$children1 = $root1->children();
$children2 = domxml_children($root2);
?>

Listing 3.5 demonstrates one of these in action by combining the children() method of a DomElement object with a recursive function and HTML's unordered lists to create a hierarchical tree mirroring the document structure (similar in concept, though not in approach, to Listing 2.5). At the end of the process, a count of the total number of elements encountered is displayed.

Listing 3.5 Representing an XML Document as a Hierarchical List

<?php

// XML file
$xml_file = "letter.xml";

// parse it
if (!$doc = xmldocfile($xml_file))
{
   die("Error in XML document");
}

// get the root node
$root = $doc->root();

// get its children
$children = get_children($root);

// element counter
// start with 1 so as to include document element
$elementCount = 1;

// start printing
print_tree($children);

// this recursive function accepts an array of nodes as argument,
// iterates through it and prints a list for each element found
function print_tree($nodeCollection)
{
   global $elementCount;

   // iterate through array
   echo "<ul>";

   for ($x=0; $x<sizeof($nodeCollection); $x++)
   {
     // add to element count
     $elementCount++;
     
     // print element as list item
     echo "<li>" . $nodeCollection[$x]->tagname;

     // go to the next level of the tree
     $nextCollection = get_children($nodeCollection[$x]);

     // recurse!
     print_tree($nextCollection);

   }

   echo "</ul>";
}

// function to return an array of children, given a parent node
function get_children($node)
{
   $temp = $node->children();
   $collection = array();

   // iterate through children array
   for ($x=0; $x<sizeof($temp); $x++)
   {
     // filter out all nodes except elements
     // and create a new array
     if ($temp[$x]->type == XML_ELEMENT_NODE)
     {
        $collection[] = $temp[$x];
     }
   }

   // return array containing child nodes
   return $collection;
}

echo "Total number of elements in document: $elementCount";
?>

Listing 3.5 is fairly easy to understand. The first step is to obtain a reference to the root of the document tree via the root() method; this reference serves as the starting point for the recursive print_tree() function. This function obtains a reference to the children of the root node, processes them, and then calls itself again to process the next level of nodes in the tree. The process continues until all the nodes in the tree have been exhausted. An element counter is used to track the number of elements found, and to display a total count of all the elements in the document.

DomText Class

Character data within an XML document is represented by the DomText class. Here's what it looks like:

DomText Object
(
   [type] => 3
   [content] => cabbages
)

The type property represents the node type (XML_TEXT_NODE in this case, as can be seen from Table 3.1), whereas the content property holds the character data itself. In order to illustrate this, consider Listing 3.6, which takes an XML-encoded list of country names, parses it, and puts that list into a PHP array.

Listing 3.6 Using DomText Object Properties to Retrieve Character Data from an XML Document

<?php

// XML data
$xml_string = "<?xml version='1.0'?>
<earth>
   <country>Albania</country>
   <country>Argentina</country>
   <!-- and so on -->
   <country>Zimbabwe</country>
</earth>";

// create array to hold country names
$countries = array();

// create a DOM object from the XML data
if(!$doc = xmldoc($xml_string))
{
   die("Error parsing XML");
}

// start at the root
$root = $doc->root();

// move down one level to the root's children
$nodes = $root->children();

// iterate through the list of children
foreach ($nodes as $n)
{
   // for each <country> element
   // get the text node under it
   // and add it to the $countries[] array
   $text = $n->children();
   if ($text[0]->content != "")
   {
      $countries[] = $text[0]->content; 
   }
}

// uncomment this line to see the contents of the array
// print_r($countries);
?>

Fairly simple—a loop is used to iterate through all the <country> elements, adding the character data found within each to the global $countries array.

Taking up Space

It's important to remember that XML, unlike HTML, does not ignore whitespace, but treats it as literal character data. Consequently, if your XML document includes whitespace or line breaks, PHP's DOM parser identifies them as text nodes, and creates DomText objects to represent them. This is a common cause of confusion for DOM newbies, who are often stumped by the "extra" nodes that appear in their DOM tree.

DomAttribute Class

A call to the attributes() method of the DomElement object generates an array of DomAttribute objects, each of which looks like this:

DomAttribute Object
(
   [name] => color
   [value] => green
)

The attribute name can be accessed via the name property, and the corresponding attribute value can be accessed via the value property. Listing 3.7 demonstrates how this works by using the value of the color attribute to highlight each vegetable or fruit name in the corresponding color.

Listing 3.7 Accessing Attribute Values with the DomAttribute Object

<?php

// XML data
$xml_string = "<?xml version='1.0'?>
<sentence>
What a wonderful profusion of colors and smells in the market - <vegetable
color='green'>cabbages</vegetable>, <vegetable color='red'>tomatoes</vegetable>,
<fruit color='green'>apples</fruit>, <vegetable
color='purple'>aubergines</vegetable>, <fruit color='yellow'>bananas</fruit> </sentence>"; // parse it if (!$doc = xmldoc($xml_string)) { die("Error in XML document"); } // get the root node $root = $doc->root(); // get its children $children = $root->children(); // iterate through child list for ($x=0; $x<sizeof($children); $x++) { // if element node if ($children[$x]->type == XML_ELEMENT_NODE) { // get the text node under it $text = $children[$x]->children(); $cdata = $text[0]->content; // check its attributes to see if "color" is present $attributes = $children[$x]->attributes(); if (is_array($attributes) && ($index =
is_color_attribute_present($attributes))) { // if it is, colorize the element content echo "<font color=" . $index . ">" . $cdata . "</font>"; } else { // else print it as is echo $cdata; } } // if text node else if ($children[$x]->type == XML_TEXT_NODE) { // simply print the content echo $children[$x]->content; } } // function to iterate through attribute list // and return the value of the "color" attribute if available function is_color_attribute_present($attributeList) { foreach($attributeList as $attrib) { if ($attrib->name == "color") { $color = $attrib->value; break; } } return $color; } ?>

There is, of course, a simpler way to do this—just use the DomElement object's get_attribute() method. Listing 3.8, which generates equivalent output to Listing 3.7, demonstrates this alternative (and much shorter) approach.

Listing 3.8 Accessing Attribute Values (a Simpler Approach)

<?php

// XML data
$xml_string = "<?xml version='1.0'?>
<sentence>
What a wonderful profusion of colors and smells in the market - <vegetable
color='green'>cabbages</vegetable>, <vegetable color='red'>tomatoes</vegetable>,
<fruit color='green'>apples</fruit>, <vegetable
color='purple'>aubergines</vegetable>, <fruit color='yellow'>bananas</fruit> </sentence>"; // parse it if (!$doc = xmldoc($xml_string)) { die("Error in XML document"); } // get the root node $root = $doc->root(); // get its children $children = $root->children(); // iterate through child list for ($x=0; $x<sizeof($children); $x++) { // if element node if ($children[$x]->type == XML_ELEMENT_NODE) { // get the text node under it $text = $children[$x]->children(); $cdata = $text[0]->content; // check to see if element contains the "color" attribute if ($children[$x]->get_attribute("color")) { // "color" attribute is present, colorize text echo "<font color=" . $children[$x]->get_attribute("color") . ">" .
$cdata . "</font>"; } else { // otherwise just print the text as is echo $cdata; } } // if text node else if ($children[$x]->type == XML_TEXT_NODE) { // print content as is echo $children[$x]->content; } } ?>

A Composite Example

Now that you know how it works, how about seeing how it plays out in real life? This example takes everything you learned thus far, and uses that knowledge to construct an HTML file from an XML document.

I'll be using a variant of the XML invoice (Listing 2.21) from Chapter 2, adapting the SAX-based approach demonstrated there to the new DOM paradigm. As you'll see, although the two techniques are fundamentally different, they can nonetheless achieve a similar effect. Listing 3.9 is the marked-up invoice.

Listing 3.9 An XML Invoice (invoice.xml)

<?xml version="1.0"?>

<invoice>

   <customer>
      <name>Joe Wannabe</name>
      <address>
         <line>23, Great Bridge Road</line>
         <line>Bombay, MH</line>
         <line>India</line>
      </address>
   </customer>

   <date>2001-09-15</date>

   <reference>75-848478-98</reference>
   
   <items>
      <item cid="AS633225">
         <desc>Oversize tennis racquet</desc>
         <price>235.00</price>
         <quantity>1</quantity>
         <subtotal>235.00</subtotal>
      </item>
   
      <item cid="GT645">
         <desc>Championship tennis balls (can)</desc>
         <price>9.99</price>
         <quantity>4</quantity>
         <subtotal>39.96</subtotal>
      </item>

      <item cid="U73472">
         <desc>Designer gym bag</desc>
         <price>139.99</price>
         <quantity>1</quantity>
         <subtotal>139.99</subtotal>
      </item>

      <item cid="AD848383">
         <desc>Custom-fitted sneakers</desc>
         <price>349.99</price>
         <quantity>1</quantity>
         <subtotal>349.99</subtotal>
      </item>
   </items>

   <delivery>Next-day air</delivery>
   
</invoice>

Listing 3.10 parses the previous XML data to create an HTML page, suitable for printing or viewing in a browser.

Listing 3.10 Formatting an XML Document with the DOM

<html>
<head>
<basefont face="Arial">
</head>
<body bgcolor="white">

<font size="+3">Sammy's Sports Store</font>
<br>
<font size="-2">14, Ocean View, CA 12345, USA
http://www.sammysportstore.com/</font> <p> <hr> <center>INVOICE</center> <hr> <?php // arrays to associate XML elements with HTML output $startTagsArray = array( 'CUSTOMER' => '<p> <b>Customer: </b>', 'ADDRESS' => '<p> <b>Billing address: </b>', 'DATE' => '<p> <b>Invoice date: </b>', 'REFERENCE' => '<p> <b>Invoice number: </b>', 'ITEMS' => '<p> <b>Details: </b> <table width="100%" border="1" cellspacing="0"
cellpadding="3"><tr><td><b>Item description</b></td><td><b>Price</b></td><td><b>
Quantity</b></td><td><b>Sub-total</b></td></tr>', 'ITEM' => '<tr>', 'DESC' => '<td>', 'PRICE' => '<td>', 'QUANTITY' => '<td>', 'SUBTOTAL' => '<td>', 'DELIVERY' => '<p> <b>Shipping option:</b> ', 'TERMS' => '<p> <b>Terms and conditions: </b> <ul>', 'TERM' => '<li>' ); $endTagsArray = array( 'LINE' => ',', 'ITEMS' => '</table>', 'ITEM' => '</tr>', 'DESC' => '</td>', 'PRICE' => '</td>', 'QUANTITY' => '</td>', 'SUBTOTAL' => '</td>', 'TERMS' => '</ul>', 'TERM' => '</li>' ); // array to hold sub-totals $subTotals = array(); // XML file $xml_file = "/home/sammy/invoices/invoice.xml"; // parse document $doc = xmldocfile($xml_file); // get the root node $root = $doc->root(); // get its children $children = $root->children(); // start printing print_tree($children); // this recursive function accepts an array of nodes as argument, // iterates through it and: // - marks up elements with HTML // - prints text as is function print_tree($nodeCollection) { global $startTagsArray, $endTagsArray, $subTotals; foreach ($nodeCollection as $node) { // how to handle elements if ($node->type == XML_ELEMENT_NODE) { // print HTML opening tags echo $startTagsArray[strtoupper($node->tagname)]; // recurse $nextCollection = $node->children(); print_tree($nextCollection); // once done, print closing tags echo $endTagsArray[strtoupper($node->tagname)]; } // how to handle text nodes if ($node->type == XML_TEXT_NODE) { // print text as is echo($node->content); } // PI handling code would come here // this doesn't work too well in PHP 4.1.1 // see the sidebar entitled "Process Failure" // for more information } } // this function gets the character data within an element // it accepts an element node as argument // and dives one level deeper into the DOM tree // to retrieve the corresponding character data function getNodeContent($node) { $content = ""; $children = $node->children(); if ($children) { foreach ($children as $child) { $content .= $child->content; } } return $content; } ?>

Figure 3.2 shows what the output looks like.

Figure 3.2 Sammy's Sports Store invoice.

As with the SAX example (refer to Listing 2.23), the first thing to do is define arrays to hold the HTML markup for specific tags; in Listing 3.10, this markup is stored in the $startTagsArray and $endTagsArray variables.

Next, the XML document is read by the parser, and an appropriate DOM tree is generated in memory. An array of objects representing the first level of the tree—the children of the root node—is obtained and the function print_tree() is called. This print_tree() function is a recursive function, and it forms the core of the script.

The print_tree() function accepts a node list as argument, and iterates through this list, examining each node and processing it appropriately. As you can see, the function is set up to perform specific tasks, depending on the type of node:

  • If the node is an element, the function looks up the $startTagsArray and $endTagsArray variables, and prints the corresponding HTML markup.

  • If the node is a text node, the function simply prints the contents of the text node as is.

Additionally, if the node is an element, the print_tree() function obtains a list of the element's children—if any exist—and proceeds to call itself with that node list as argument. And so the process repeats itself until the entire tree has been parsed.

As Listing 3.10 demonstrates, this technique provides a handy way to recursively scan through a DOM tree and perform different actions based on the type of node encountered. You can use this technique to count, classify, and process the different types of elements encountered (Listing 3.5 demonstrated a primitive element counter); or even construct a new tree from the existing one.

Process Failure

If you've been paying attention, you will have noticed that the XML invoice in Listing 3.9 is not exactly the same as the one shown in Listing 2.21. Listing 2.21 included an additional processing instruction (PI), a call to the PHP function displayTotal(), which is missing in Listing 3.9.

Why? Because the DOM extension that ships with PHP 4.1.1 has trouble with processing instructions, and tends to barf all over the screen when it encounters one. Later (beta) versions of the extension do, however, include a fix for the problem.

Traversing the DOM with PHP's XPath Classes

The DOM classes discussed in the previous section are more than adequate for most common tasks, but getting used to them can take awhile.

Additionally, for long and complex documents containing a large number of elements and/or levels, progressing from node to node in an iterative manner can often be tedious. You can use a recursive function to simplify the process, but you'll still have to write a fair amount of code to create and manipulate node collections, which are at different levels of the tree.

It's precisely to simplify this process that PHP also comes with a couple of XPath classes. XPath, as you may already know, is an addressing mechanism for XML documents, designed to allow XML document authors to quickly access node collections on the basis of both location and condition.

A discussion of XPath is beyond the scope of this book, so I'll assume that you already know the basics of axes, predicates, and node tests. In case you don't, you might want to brush up on the basics before proceeding with this section. Go to the companion web site for this book http://www.xmlphp.com/ or http://www.newriders.com) to find a list of reference material to get you started.

XPathContext and XPathObject Classes

PHP's XPath classes add flexibility to the DOM parser by freeing developers from the standard parent-child paradigm when constructing node collections. The XPath classes allow developers to quickly build node collections matching specific criteria—for example, every third element or every element containing the attribute shape=square—with scant regard for their position in the hierarchical document tree.

The XPathContext class is used to set up a context for all XPath evaluations, and is created by a call to the xpath_new_context() function. This function must be passed a reference to a DomDocument object. For example:

<?php

// create a DomDocument object
$doc = xmldoc($xml_string);

// create an XPath context
$xpath = $doc->xpath_new_context();

?>

If you think this doesn't look very interesting, you're right—it's not. The XPathContext object merely sets up a context for all future XPath evaluations. These XPath evaluations usually result in instances of the XPathObject class, which are far more interesting.

An instance of the XPathObject class may be created with a call to the xpath_eval() method of the XPathContext object, which requires an XPath address for evaluation. If the XPath evaluates successfully, xpath_eval()returns an instance of the XPathObject class containing a collection of nodes matching the specified XPath expression. Take a look at Listing 3.11, which uses an XPath address to isolate all the vegetable elements in the document:

Listing 3.11 Creating Node Collections with XPath

<?php

// XML data
$xml_string = "<?xml version='1.0'?>
<sentence>What a wonderful profusion of colors and smells in the market - 
  <vegetable color='green'>cabbages</vegetable>, 
  <vegetable color='red'>tomatoes</vegetable>, 
  <fruit color='green'>apples</fruit>, 
  <vegetable color='purple'>aubergines</vegetable>,
  <fruit color='yellow'>bananas</fruit>
</sentence>";

$doc = xmldoc($xml_string);

// create an XPath context
$xpath = $doc->xpath_new_context();

// get all the "vegetable" elements
$vegetables = $xpath->xpath_eval("//vegetable");

// uncomment the next line to see the node collection
// print_r($vegetables);
?>

When you examine the structure of the XPathObject object instance with print_r(), here's what you'll see:

XPathObject Object
(
  [type] => 1
  [nodeset] => Array
    (
      [0] => DomElement Object
        (
          [type] => 1
          [tagname] => vegetable
        )

      [1] => DomElement Object
        (
          [type] => 1
          [tagname] => vegetable
        )

      [2] => DomElement Object
        (
          [type] => 1
          [tagname] => vegetable
        )

    )

)

As you can see, the object contains an array of DomElement objects, representing the element nodes matching the XPath expression. These DomElement objects can now be accessed and manipulated using standard class methods and properties.

A Composite Example

In order to demonstrate just how powerful XPath can be, consider the following situation. In a research project conducted to study the effect of temperature on bacterial culture growth, researchers publish their findings as XML data. Listing 3.12 contains a sample of this data.

Listing 3.12 A Compilation of Experiment Readings (data.xml)

<?xml version="1.0"?>
<project id="49">

   <!-- data for 3 cultures: Alpha, Beta and Gamma, tested at temperatures
ranging from 10C to 50C --> <!-- readings indicate cell counts 4 hours after start of experiment --> <record> <culture>Alpha</culture> <temperature>10</temperature> <reading>25000</reading> </record> <record> <culture>Beta</culture> <temperature>10</temperature> <reading>4000</reading> </record> <record> <culture>Alpha</culture> <temperature>10</temperature> <reading>23494</reading> </record> <record> <culture>Alpha</culture> <temperature>20</temperature> <reading>21099</reading> </record> <record> <culture>Gamma</culture> <temperature>40</temperature> <reading>768</reading> </record> <record> <culture>Gamma</culture> <temperature>10</temperature> <reading>900</reading> </record> <!-- snip --> </project>

It now becomes necessary to compile this raw data into an easily understandable table so that the results can be analyzed. Ideally, what is required is a 2x2 table displaying the temperature scale on the Y-axis and the culture type on the X-axis. The intersection of the two axes should be an average of all readings made for that culture at that temperature.

With XPath, this is a snap to accomplish. Listing 3.13 demonstrates the script.

Listing 3.13 Creating Node Collections with XPath

<html>
<head>
<basefont face="Arial">
</head>
<body bgcolor="white">
<?php

// XML file
$xml_file = "data.xml";

// parse document
$doc = xmldocfile($xml_file) or die("Could not read file!");

// create arrays to hold culture/temperature list 
$cultures = array();
$temperatures = array();

// create XPath context
$xpath = $doc->xpath_new_context();

// get a list of "culture" nodes
$obj = $xpath->xpath_eval("//culture");
$nodeset = $obj->nodeset;

// ...and create an array containing 
// the names of all available cultures
for ($x=0; $x<sizeof($nodeset); $x++)
{
   $children = $nodeset[$x]->children();
   $cultures[] = $children[0]->content;
}

// strip out duplicates
$cultures = array_unique($cultures);

// do the same thing for temperature points
$obj = $xpath->xpath_eval("//temperature");
$nodeset = $obj->nodeset;

for ($x=0; $x<sizeof($nodeset); $x++)
{
   $children = $nodeset[$x]->children();
   $temperatures[] = $children[0]->content;
}

$temperatures = array_unique($temperatures);

// sort both arrays
natsort($temperatures);
natsort($cultures);
?>
<table border="1" cellspacing="5" cellpadding="5">

<tr>
   <td>&nbsp;</td>
<?php
// first row of table, print culture names
foreach($cultures as $c)
{
   echo "<td>$c</td>";
}
?>
</tr>

<?php   
foreach($temperatures as $t)
{
   // create as many rows as there are temperature points
   echo "<tr>";
   echo "<td>$t</td>";

  // for each intersection (culture, temperature)
  // print average of available readings
  foreach($cultures as $c)
  {
     echo "<td>" . intersection($t, $c) . "</td>";
  }
   
   echo "</tr>";
}
?>

</table>

<?php
// this function collects all readings for
// a particular culture/temperature
// totals them and averages them
function intersection($temperature, $culture)
{
   // get a reference to the XPath context
   global $xpath;
   
   // set up variables to hold total and frequency
   $total = 0;
   $count=0;
   
   // get a list of "reading" nodes
   // for records with culture c and temperature t
   $obj = $xpath->xpath_eval("//record[culture='" . $culture . "' and
temperature='" . $temperature . "']/reading"); // if XPath evaluation successful if ($obj) { $nodeset = $obj->nodeset; // iterate through nodeset if (is_array($nodeset)) { // add the readings foreach ($nodeset as $reading) { $children = $reading->children(); $total += $children[0]->content; $count++; } } } // and then average them if ($count > 0) { return $total/$count; } return 0; } ?> </body> </html>

I've used three different XPath expressions here. The first two are used to create a list of available cultures and temperature points; these are required for the row and column headings of the table. The third XPath returns a list of nodes matching a specific culture and temperature. Now, all I need to do is add the readings associated with each of these nodes to reach a total number, and divide that total number by the number of nodes (readings) to obtain an average cell count.

Figure 3.3 shows what the output looks like.

Figure 3.3 Statistical analysis with Xpath.

This kind of thing comes in particularly handy when you need to perform statistical analysis of sampling data; it provides a simple and easy way to bring together different elements of each sample, perform calculations on these elements, and relate them to each other in two or three dimensions. XPath's conditional expressions are a boon here—although you can certainly do the same thing without XPath (and I encourage you to try, just so you have a better understanding of the difference), the process would be far more tedious.

Manipulating DOM Trees

The preceding section discussed traversing an already-extant DOM tree, demonstrating how the nodes of the tree can be processed in a recursive manner. That's not all you can do with the DOM, though; it's also possible to programmatically construct DOM trees from scratch, or modify existing tree structures, and save the result as one or more XML documents. This section discusses the details.

Creating New DOM Trees

If you go back to the section dealing with PHP's DOM classes, you'll see that both the DomDocument and DomElement objects include functions to create new documents, nodes, and attributes. The first of these is the new_xmldoc() method of the DomDocument object, which constructs and returns a new instance of the DomDocument object.

After a DomDocument instance is available, it becomes possible to add new element and text nodes with the add_root() and new_child() methods. And why stop at elements? The set_attribute() method allows you to define and add attributes to specific elements as well. The following code snippet (see Listing 3.14) demonstrates this by creating a complete XML document tree on the fly with the add_root() and new_child() methods:

Listing 3.14 Creating an XML Document Tree

<?php

// create DomDocument object
$doc = new_xmldoc("1.0");

// add root node
$root = $doc->add_root("article");

// set attribute for root node
$root->set_attribute("id", "567");

// add children to the root
$title = $root->new_child("title", "Goat milk for dinner can cause insomnia");
$author = $root->new_child("author", "K. Kittle");

// note how I can programatically generate node values!
$date = $root->new_child("date", date("d-M-Y", mktime()));

// dump the tree as a string
echo $doc->dumpmem();
?>

After the tree is constructed to your satisfaction, you need to output it, either for display or storage. The DomDocument object's dumpmem() method returns a representation of the current tree as a string. You can then format it for printing, save it to a file, or transmit it to another agent.

Here, Pretty!

Note that if you intend to print the dynamically generated DOM tree, it might be a good idea to run your own formatting functions on it first to pretty it up a little. This is because dumpmem() outputs the document tree as a single string, without formatting or indenting it; in the case of long and/or complex XML documents, it can be fairly difficult to read.

The ability to construct new DOM trees on the fly comes in particularly handy if you need to build a customized DOM tree from another data source. This data source may be a text file that needs to be parsed, a database that needs to be queried, or even another XML document tree that needs to be pruned or combined with other data. Consider Listing 3.15, which uses MySQL database records to construct an XML book catalog and display it to the user.

Listing 3.15 Constructing a DOM Tree from a MySQL Resultset

<?php

// create DomDocument object
$doc = new_xmldoc("1.0");

// add root node
$root = $doc->add_root("collection");

// query database for records
$connection = mysql_connect("localhost", "us8749", "secret") or die ("Unable to
connect!"); mysql_select_db("db633") or die ("Unable to select database!"); $query = "SELECT id, title, author, price FROM books"; $result = mysql_query($query) or die ("Error in query: $query. " . mysql_error()); // iterate through resultset while($row = mysql_fetch_object($result)) { $record = $root->new_child("record", ""); $record->set_attribute("id", $row->id); $record->new_child("title", $row->title); $record->new_child("author", $row->author); $record->new_child("price", $row->price); } // close connection mysql_close($connection); // dump the tree as a string echo $doc->dumpmem(); ?>

Nothing too complicated here—I'm connecting to the database, extracting a list of titles and authors, and creating an XML document from the result. After the document tree has been created in memory, I can either display it (which is what I've done) or save it to a file (demonstrated in Listing 3.17).

Manipulating Existing DOM Trees

It's also possible to use the functions described previously to modify an existing DOM tree. Consider the XML document in Listing 3.16, which contains the outline for a book chapter.

Listing 3.16 A Book Chapter Marked up in XML (ch9.xml)

<?xml version="1.0"?>

<chapter id="9">

   <!-- chapter 9 of a really bad pulp fiction novel -->

   <title>Luke Gets Angry</title>

   <para>As the black-suited warriors swarmed off the HUMVEE, Luke turned to Jo
and said quietly, "Don't go anywhere. I'll just be a minute."</para> <para>The first warrior reached Luke and aimed a roundhouse kick at his head.
Luke ducked easily, twisting under the leg and breaking with a sharp crack.
The warrior moaned and tumbled backwards. Luke grinned. "Bring it on, " he
hollered.</para> <para>The second soldier approached more cautiously. Moving carefully, he
crept up behind Luke and leaped at him. Sensing movement, Luke moved aside
at the last moment, knocked the soldier unconscious with a well-placed
punch and stripped him of his portable grenade launcher. A few seconds
later, the HUMVEE was in flames, and the soldiers had fled in panic.</para> <para>Luke laughed crazily. He was just beginning to enjoy himself.</para> </chapter>

Now, let's suppose the author decides that "Luke" is actually a pretty wimpy name for the lead character. Instead, he decides to go with "Crazy Dan," which has a much more macho ring to it. Because he's already nine chapters into the book, he needs to change every previous occurrence of "Luke" to "Crazy Dan." All he needs to do is write a PHP program to construct a DOM tree from the XML file, scan through it for every occurrence of "Luke," alter it to "Crazy Dan," and save the result to a new file (see Listing 3.17).

Search and Destroy

I know, I know, he could use any text editor's search-and-replace function. But this chapter's about the DOM, smart guy.

Listing 3.17 Performing a Search-and-Replace Operation on a DOM Tree

<?php
// XML file
$xml_file = "/tmp/ch9.xml";

// parse document
if(!$doc = xmldocfile($xml_file))
{
   die("Error in XML document");
}

// get the root
$root = $doc->root();

// children of the root
$children = $root->children();
   
// start traversing the tree
search_and_replace($children, "Luke", "Crazy Dan");   

// all done, save the new tree to a file
// or display it if file write not possible
if (is_writable(dirname($xml_file)))
{
   $filename = dirname($xml_file) . "/_new_" . basename($xml_file);
   $fp = fopen($filename,"w+");
   fputs($fp,$doc->dumpmem());
   fclose($fp);
}
else
{
   echo $doc->dumpmem();
}

// this is a recursive function to traverse the DOM tree
// when it finds a text node, it will look for the search string and replace with
// the replacement string   
function search_and_replace($nodeCollection, $search, $replace)
{
   for ($x=0; $x<sizeof($nodeCollection); $x++)
   {

     if ($nodeCollection[$x]->type == XML_ELEMENT_NODE)
     {
        // if element, it may contain child text nodes
        // go one level deeper
        $nextCollection = $nodeCollection[$x]->children();
        search_and_replace($nextCollection, $search, $replace);
     }
     else if ($nodeCollection[$x]->type == XML_TEXT_NODE)
     {
        // if text node, perform replacement
        $str = str_replace($search, $replace, $nodeCollection[$x]-
>content); // remember to write the value of the text node back to the tree! $nodeCollection[$x]->set_content($str); } } } ?>

This example is similar to Listing 3.10, in that it too uses a recursive function to process the DOM tree. In this case, though, the recursive function limits its activities to two types of nodes: element nodes and text nodes. If the node is an element node, I ignore it, and call the recursive function again to move one level deeper into the tree; if it's a text node, I scan it for a match to the search string, substitute the replacement text, and write the new string back to the tree.

After the process has concluded, the new DOM tree is written to a file (or, in the event that the directory is not accessible, displayed to the user).

If you examine the resulting output, you'll notice one interesting thing about the set_content() method—it automatically replaces special characters (such as the double quotation marks in Listing 3.16) with the corresponding XML entities (in this case, &quot;).

Going Native

You may sometimes come across situations that require you to convert raw XML markup into native data structures such as variables, arrays, or custom objects. For these situations, PHP's DOM parser includes a very specialized little function named xmltree().

The xmltree() function parses an XML string, and constructs a hierarchical tree of PHP objects representing the structured markup. This tree includes many of the same objects you've become familiar with—instances of the DomDocument, DomElement, DomText, and DomAttribute objects.

xmltree() provides an easy way to quickly see the structure of a complete XML document. For the moment, though, that's all it's useful for; it's not possible to write the tree back to a file, or to memory, after manipulating it.

Note also that, as of this writing, xmltree() only accepts an XML string. You cannot pass it a file name or file reference.

DOM or SAX?

Now that you've seen (and hopefully understood) the two most common approaches to parsing XML with PHP, you're probably wondering: Which one do I use?

It's a good question, and one that doesn't have a one-size-fits-all answer. Both DOM and SAX approaches have advantages and disadvantages, and your choice of technique must depend on the type of data being parsed, the requirements of your application, and the constraints under which you are operating.

The SAX approach is linear: It processes XML structures as it finds them, generating events and leaving the event handlers to decide what to do with each structure. The advantage of this approach is that it is resource-friendly; because SAX does not build a tree representation of the document in memory, it can parse XML data in chunks, processing large amounts of data with very little impact on memory. This also translates into better performance; if your document structure is simple, and the event handlers don't have anything too complicated to do, SAX-based applications will generally offer a speed advantage over DOM-based ones.

The downside, though, is an inability to move around the document in a non- linear manner. SAX does not maintain any internal record of the relationships between the different nodes of an XML document (as the DOM does), making it difficult to create customized node collections or to traverse the document in a non-sequential manner. The only way around this with SAX is to create your own custom object model, and map the document elements into your own custom structures—a process that adds to complexity and can possibly degrade performance.

Where SAX flounders, though, the DOM shines. The DOM creates a tree representation of the document in memory, making it possible to easily travel from one node to another, or even access the same node repeatedly (again, not something you can do easily in SAX). This tree representation is based on a standard, easy-to-understand model, making it easier to write code to interact with it.

This flexibility does, however, come with an important caveat. Because the DOM builds a tree in memory, DOM processing cannot begin until the document has been fully parsed (SAX, on the other hand, can begin parsing a document even if it's not all available immediately). This reduces a developer's ability to "manage" the parsing process by feeding data to the parser in chunks, and also implies greater memory consumption and consequent performance degradation.

Consequently, the choice of technique depends significantly on the type of constraints the application will be performing under, and the type of processing it will be expected to carry out. For systems with limited memory, SAX is a far more efficient approach. On the other hand, complex data-processing requirements can benefit from the standard object model and API of the DOM.

Summary

This chapter demonstrated an alternative approach to the event-based method of parsing XML data. The Document Object Model (DOM) builds a representation of the document structure in memory, and provides the application layer with a standard API to traverse and manipulate this tree.

PHP implements the DOM via a number of standard classes; this chapter examined these classes in detail, together with examples of how they can be used to format XML data, construct XML documents on the fly, and create customized representations of XML data.

Finally, the concluding section of this chapter examined the pros and cons of the SAX and DOM approaches, discussing them vis-a-vis with each other, in an attempt to help developers with the correct approach for their specific requirements.

In the next chapter, I will be discussing Extensible Stylesheet Language Transformations (XSLT), which provides developers with yet another option when it comes to formatting and processing XML data.

  Contact Us | E-mail Us | Site Guide | About PerfectXML | Advertise ©2004 perfectxml.com. All rights reserved. | Privacy