perfectxml.com
 Basic Search  Advanced Search   
Topics Resources Free Library Software XML News About Us
  You are here: home Info Bank Articles XML for Absolute Beginner - A Short Tutorial Saturday, 23 February 2008
 

Back to Articles Page      

        


XML for Absolute Beginner - A Short Tutorial

  • Introduction
    Welcome to the wonderful world of Extensible Markup Language (XML). This short article is an effort to explain what XML is, what are it's benefits, and briefly describe some of standards evolving around it.


  • A Practical Problem
    I have two credit cards from two different credit unions and a checking bank account. Almost every day I logon to their Web sites to check the account statements and spend 30 to 40 minutes (I do lots of shopping!) consolidating them and see the status as to where I am. Being a smart programmer, I wrote a Visual Basic application which connected to all three Web sites, downloaded account statements in HTML format, parsed them and generated a nice merged report for me. Few days later one credit union announced their new redesigned Web site and my application stopped working immediately - because I was getting data tagged inside totally different HTML. I took one day to update and test the code to parse hundreds of lines of HTML text just to get account data tagged inside HTML page. I was happy again until my bank redesigned their Web site making my application useless one more time.

    All the three Web sites have my account data, then why isn't there an easy way to get to it and use it the way I want. I then realized that it's all because of HTML - which is only focussed at presentation (how the page should look) rather than the data structure and content. It would be really great to separate the actual data content from the presentation. The benefits include: easy access to data, easily change the presentation, get data in multiple formats, easy searching/querying of data, etc.

    This is where XML comes into picture. XML, like HTML is a markup language. That said, here also you'll have to deal with tags, but unlike HTML, XML does not have any fixed set of tags (hence is extensible!). One more important difference is that XML is case-sensitive. Hence in an XML document <account>, <Account> and <ACCOUNT> are three different tags. Let's now have a little formal introduction to XML.


  • What is XML?
    Extensible Markup Language (XML) is a powerful mechanism of data exchange. It is an ideal solution for transferring structured data from server-to-client, server-to-server or an application-to-application. The W3C External link is the organization in charge of the development and maintenance of most Web standards. XML, like HTML, is also one of the standards defined and developed by W3C. Today, XML 1.0 (Second Edition) is an official W3C Recommendation External link . For a complete list of XML-related standards and their status, check my previous article Standards Reference.

    Now, XML being formally defined let's have a look at how does XML document looks like.
    <?xml version="1.0" ?>
    <!--My first XML document -->
       <BankAccount>
            <Number>1234</Number>
            <Name>Darshan Singh</Name>
            <Type>Checking</Type>
            <OpenDate>11/04/1974</OpenDate>
            <Balance>25382.20</Balance>
       </BankAccount>

    Simply save above nine lines into a text file (copy and paste in Notepad) and save the file as c:\1.xml. Now open this file (c:\1.xml) in Internet Explorer 5.0 or later and see how your first XML document looks like.

    The Internet Explorer browser uses MSXML and default stylesheet (both discussed later) to display a nice tree-like view of XML documents.
    First XML document as viewed in Internet Explorer


  • Document Structure
    An XML document has an hierarchical structure with two sections: Prolog and Body. The XML document starts with an optional Prolog, which constitutes of XML declaration statement (<?xml version...), Processing Instructions, Document Type Declaration and Comments. We'll look at Processing Instructions (PIs) and Document Type Declarations later. Comments in XML document are exactly same as in HTML. Comments are used for some useful description/inline documentation. Comments begin with a <!-- and end with a -->.

    If an XML document has <?xml version="1.0" ...?> declaration statement, it should be the fist line in the document, without even space preceeding it. The XML declaration statement identifies the current document as an XML document, and has three attributes version (compulsary), encoding and standalone (both are optional). XML documents are text documents. Here "text" does not refer to ASCII-7, rather to Unicode. The use of Unicode allows XML documents to be created in international languages. The encoding attribute in XML declaration identifies which encoding is used to represent the characters in the document. In above example, the XML declaration statement only defines the compulsory version attribute, which for now can have only 1.0 as a value. The standalone attribute can have value as either "yes" or "no" telling the processor if the XML document is dependent on any other external references or no. Here is one example of XML declaration statement:

    <?xml version="1.0" encoding="UTF-8" standalone="no" ?>

    Let's now move our focus to the rest of XML document, the body. As described earlier, XML documents have hierarchical (tree-like) structure. XML documents have one (and always only one) root element, also called as document element. All other elements are children to this element. In above example, BankAccount is the document element, with Number, Name, etc as child elements. Elements are delimited by an start tag (ex: <Type>) and an end tag (</Type>). In HTML it is not necessary to have end tags (browser take care of that), but XML is strict here and needs all tags to be properly closed. Also tags should be properly nested (<A><B><C></C></B></A> is legal, but <A><B><C></B></C></A> is not.). No overlapping tags are allowed. And once again, remember that, XML documents are case-sensitive.

    Elements can also have attributes, which are essentially name-value pairs inside start tag (ex: <Student RollNo="2">). XML requires all attributes to be encloded within quotation marks ' or ".

    There are some (actually five) characters that have special meaning. They are <, >, ', ", and &. If you wish to use any of these directly (and not for markup), you should escape them using &lt;, &gt;, &apos;,&quot;, and &amp; respectively.

    If an XML document adheres to above rules, it's called an "Well Formed" XML document.

    Document Type Definition or DTD can be used to describe the structure of an XML document. It let's you specify what should be document element, parent-children relationship, element attributes and their default values, etc. An XML document can refer to an external DTD and/or can have it inline.

    If an XML document is Well-formed and also refers to an DTD, it is said to be a "Valid XML" document.

    Alright, now we have data stored in XML document, what next? If you were to navigate or manipulate this data, you can use what is called as XML Parsers.


  • Parsers
    XML Parsers, also called as XML Processors provide you with an standard API to navigate and manipulate XML documents. In other words, if you have some XML documents, and need to get data out of it or change it, or modify the XML document structure, you don't have to write code to load and parse the XML files, rather you can use XML Parser, which'll load the document and give you access to it as objects. For a complete list of available parsers, check our Software section.

    Some parsers load an entire XML document into the memory and provide you access to the document as nodes and elements, they are called DOM-based parsers. Now, if you have very large XML files, this method is not preferred since they'll need a lot more memory/resources (because they load the entire document as tree in memory), and also it's not possible to cancel once the load is started. There is another kind of standard API, known as SAX or Simple API for XML, which is based on events model. The SAX-based parsers, do not load the entire document at once, but load the document sequentially and send events to the application and it's up to the application to how to respond to document load events.

    Coming back to my original problem with "account statements" discussed in the beginning of the article, my credit unions and bank agreed to return the data in XML format when I logon to their Web sites from my Visual Basic application. The credit unions and bank defined XML structure/schema and promised that they'll not change the structure of XML documents. Now only thing I had to do was load those document using XML parser and get to the data directly, which was very easy (just 5-6 lines of Visual Basic code). Only problem now is all three XML documents have different structure and hence I'll have to write code to deal with three different XML documents (having similar data). There was an easy solution to this: XSLT.


  • Transformation
    XSLT or XSL Transformation, is once again a W3C standard External link to transform XML document from one format to the other. Using XSLT you can transform source XML documents to another XML documents, to HTML documents (for browsers) or to text format.

    What does it mean when I say using XSLT? It means that you'll have to learn a declarative language (as contrast to regular procedure language using which you write C++/Javasctipt code), and write stylesheets. The stylesheets are once again XML documents with rules (templates) and actions. And like XML parsers, there are stylesheet processors available, which you can readily use in your application. The XSL processors load the source XML document as tree, read the stylesheet and apply templates, to generate the result output document. XSLT makes use of another standard, XPath.

    So, I wrote two XSLT stylesheets and applied them to XML documents returned from bank/credit union Web sites and here I am with to deal with single structure XML document.

    Here is an example XSLT stylesheet for above bank account XML document. Following sample stylesheet transforms source XML to HTML format.


    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
       <xsl:output method="html" version="4.0" indent="yes" />

       <xsl:template match="/">
         <HTML>
         <BODY>
         <TABLE BORDER="1" bgcolor="#EEEEEE">
            <xsl:apply-templates select="BankAccount" />
         </TABLE>
         </BODY>
         </HTML>
       </xsl:template>

       <xsl:template match="BankAccount">
         <TR>
          <xsl:apply-templates select="Number" />
          <xsl:apply-templates select="Name" />
          <xsl:apply-templates select="Type" />
          <xsl:apply-templates select="OpenDate" />
          <xsl:apply-templates select="Balance" />
        </TR>
       </xsl:template>


         <xsl:template match="Number">
           <TD STYLE="font-size:12pt font-family:serif">
             <xsl:apply-templates />
           </TD>
         </xsl:template>

         <xsl:template match="Name">
           <TD STYLE="font-size:12pt font-family:serif">
             <xsl:apply-templates />
           </TD>
         </xsl:template>

         <xsl:template match="Type">
           <TD STYLE="font-size:12pt font-family:serif">
             <xsl:apply-templates />
           </TD>
         </xsl:template>

         <xsl:template match="OpenDate">
           <TD STYLE="font-size:12pt font-family:serif">
             <xsl:apply-templates />
           </TD>
         </xsl:template>

         <xsl:template match="Balance">
           <TD STYLE="font-size:12pt font-family:serif">
             <xsl:apply-templates />
           </TD>
         </xsl:template>

    </xsl:stylesheet>

    We'll not go into details on above stylesheet, if you wish to learn more about XSLT, Wrox Press published an excellent book - XSLT Programmer's Reference, and you can read sample chapter from this book, right here at perfectxml.com.

    If you wish to try out this stylesheet, simply save above stylesheet as c:\1.xsl. Start Visual Basice 6.0 and create a new standard EXE project, add reference (Project | References) to Microsoft XML, v3.0 (MSXML3.DLL) and write following lines in Form_Load() method, and step through (F8) the code to understand it better.

    Dim xmlobj As New MSXML2.DOMDocument30
    Dim xslobj As New MSXML2.DOMDocument30

    xmlobj.Load "c:\1.xml"
    xslobj.Load "c:\1.xsl"

    strOutput = xmlobj.transformNode(xslobj)

    MsgBox strOutput



    The above Visual Basic code loads XML and XSL files, and uses DOMDocument's transformNode method to apply transformation, and returns the transformed text a string. See, how easy it is to write and use stylesheets.

    We said that this will be a short tutorial, and hopefully by now you've some idea as to what XML is, and how you can use it, what are XML parsers, XSLT, etc. Hence now it's time for summary.


  • Summary
    XML or Extensible Markup Language is a W3C standard for a new way to exchange sturctured data. This article was a brief introduction to XML. There are many other standards that are being developed around and for XML. At perfectxml.com, we have categorized all these under our focus sections, be sure to visit all the focus sections and keep yourself up-to-date with this wonderful world of extensibility.

    By the way, my "accounts statement" application is working great and am sure it will never break again, thanks to XML!

    If you have any questions or comments, feel free to contact author of this article at darshan@perfectxml.com.


Back to Articles Page      

All information on this site is for training only. We do not warrant its correctness or its fitness to be used. The risk of using it remains entirely with the user.

 

  Contact Us | E-mail Us | Site Guide | About PerfectXML | Advertise ©2004 perfectxml.com. All rights reserved. | Privacy