Sample chapter from:
XSLT 2.0 Web Development
Reproduced from XSLT 2.0 Web Development (Kirsanov) by permission of Prentice Hall PTR. ISBN 0131406353, copyright 2004. All rights reserved. See www.phptr.com for more information.
About the book:
Leverage the incredible power of XSLT 2.0 to create sites that are easy to build, maintain, and extend!
This book teaches you exactly how to build state-of-the-art web sites with XSLT 2.0 and XPath 2.0. XML developer and web designer Dmitry Kirsanov drills down to real development challenges and specific tools and techniques for solving them.
You'll learn from a hands-on project that gives you insights you just won't find anywhere else. Kirsanov covers the entire project lifecycle: schema creation, validation, transformation, testing, and maintenance... offering outstanding code and markup examples every step of the way.
XSLT 2.0 Web Development is an indispensable resource for...
This chapter is a practical complement for
I cannot claim to cover everything: Your web site may well contain
unique elements that won't fit common schemes. Here, only the
most general and frequently used constructs are covered, and the approaches
described in this chapter may not be optimal for all
situations. Many examples are given, but rather than copy them
over, try to use the reasoning behind these examples to analyze your
The first part of the chapter deals with markup constructs commonly used
in page documents, including headings, paragraphs and
paragraph-like elements, links, images and other non-XML objects,
In this and subsequent sections, we look at the informational
core of a web page, stored in its own source document (page
document). Peripheral components such as navigation, parameters
of the site environment, and metadata are stored in the master
document, the subject of
Every XML document has a root element, and since we're talking
about page documents here, there's no reason not to call this
element page. Its attributes and children are the
natural place to store the
In addition to its primary content, each page document includes certain metadata. Some of it may end up as a visible part of the web page, some may be hidden in HTML metadata constructs (keywords and descriptions in meta elements), some may be used during transformation but not included in the resulting HTML code, and some may not be used at all except for reference or source annotation purposes. Common examples of metadata include page creation date, change log, author(s) and editor(s), copyright and licensing information, and the language of the page.
Note that only information specific to this particular
page must be stored in it; if some metadata bits are shared by more
than one page, their proper place is in the master document
The most important piece of metadata is the page's unique
identifier used to resolve internal links (
Page coordinates.The same applies to the information on the position occupied by this page
in the site's hierarchy. As we'll see later (
Everything else.Any other page metadata is normally stored in the page document. Simple values can be stored in attributes of the page's root element. More complex constructs that require their own elements can be placed either directly under the root or inside an umbrella parent element (e.g., metadata) that is a child of the root element.
You'll likely need some intermediate structural layers between
Sections or blocks?The traditional document hierarchy - sections, subsections, subsubsections, and so on - is not often seen on the Web. Instead, information is more commonly broken into relatively small blocks with few or no hierarchical relations between them. Different sites may call these blocks "stories," "blurbs," "columns," "modules," "writeups," and myriad other names.
Among these names, one which appears to be the most intuitive is
the one you should use for your block construct's element type
name. Contentwise, a block is a unit with mostly fixed structure
that may include both obligatory (e.g., heading and body) and
optional (e.g., icon, heading links, author byline) components
in both parallel and sequential (
<block id="unique" icon="block_icon" type="story"> <head link="address">Block heading</head> <subhead>Optional block subheading</subhead> <p>And here goes a paragraph of text.</p> <p>Possibly one more paragraph.</p> <author>An optional author byline</author> </block>
Block types. It is likely that you will have more than one type of block construct - for example, front page news blocks, subpage body blocks, and ad blocks. In the simplest case, everything on the page can be treated as one big block, so the page's root element can be considered the root element of a block.
Different types of blocks will likely have many common structural features - in part because they all belong to one site with its common information architecture and visual design. Only if different types of blocks have clearly distinct structures can you use different element types for them; otherwise it is best to use the same generic element type (e.g., block) with different values of the type attribute. This provides two major benefits:
Your validation code will be simpler to write and maintain.
Management of orthogonal content (
In general, analogous but different structures should only differ by a minimum number of obvious features; avoid random, meaningless differences.
Brief highlighted text fragments that preface or summarize longer pieces of text are very common on web pages. A heading may apply to the entire page, a section within the page, or even a single sentence or link - but it must apply to something, for a heading only exists as a member of a "head and body" pair.
Look up the number. HTML has long used the h1 to h6 element types for six levels of headings. You can borrow these names, or you can make them less cryptic by using head1, head2, and so on. In any case, this approach only works if you really need several levels of headings and if these levels are free of any additional semantics - that is, if you can more or less freely move a branch of your headings tree upward or downward in the hierarchy.
If this is not true - for example, if your third-level headings are reserved specifically for sidebars that cannot be promoted to second-level sections - then the number-based naming scheme is not a good idea at all. Imagine that one day you need to add sections inside a sidebar - this will look ugly if your sidebar headings are marked up, say, as h4, while sections are h2.
Ask my parent who I am.
It is vastly more convenient to use
<section> <head>This is a section heading</head> ... <subsection> <head>And this is a subsection heading</head> ... </subsection> </section>
This scheme is intuitive, easy to remember, and therefore easy to use. Even though there is only one heading element type, XSLT or Schematron will have no problem determining the role of each particular heading by checking its parent element. At the same time, implementing processing that is common to all headings is very straightforward with this approach.
XHTML 2.01 implements a similar scheme except that its element type for a heading, h, is always a child of a section (although sections can nest). This is understandable - XHTML cannot realistically cover all possible kinds of structural units that might require headings. On the other hand, this brings us back to an "anonymous" naming scheme that is only slightly better than the old h1...h6: Now you can easily move sections around with their headings, but still no useful semantics is attached to each heading. You can, however, use the CSS class attribute to designate exactly what kind of a heading or section this is.
The next question is, what is the auxiliary information to be stored with your headings?2 In most cases, the plain text of the heading itself is sufficient, but there are exceptions. For example, a heading usually has a unique (either within the page or, more usefully, within the entire site) id attribute used in cross-references or hyperlinks to this section from elsewhere.
In fact, a typical reference is supposed to refer to the section (or other structural unit) to which the heading belongs, not to the heading itself. Still, most authors prefer to use headings for linking, partially due to the HTML inertia (there are no sections in today's HTML) and partially because this allows them to more easily reuse the text of the heading in the textual part of the link.
For example, if your heading is marked up as
and you have a reference to it from somewhere, written as
...see <link to="attrib"/> for more on this.
this can be easily transformed into
...see 2.1, "Attributes" for more on this.
in plain text, or to
...see <a href="#attrib">2.1, Attributes</a> for more on this.
in HTML (here, "2.1" comes from an automatic count of preceding and ancestor sections). On the other hand, given that XSLT can easily traverse from a heading element to its parent, there's no real reason to use headings for linking in XSLT-based projects. The same link rendering could just as well be obtained from
<section id="attrib"> <head>Attributes</head> ... </section>
which looks less tautological and better reflects the fact that both the head and the id are properties of the section.
If necessary for your site's design, you may need to store a
reference to a graphic file for each heading (see
The question of what children to allow within headings boils down to the question of how far beyond plain text you are willing to go. Would you need textual emphasis within headings? What about links? The laziest approach is to allow everything that is allowed within a paragraph of text - and it will work fine in most cases. Only if you think you may encounter problems with complex markup in headings and want to guard against them, might a different content model for headings be necessary.
Depending on your requirements, other children may be
necessary for heading elements. For example, you may want to store
the same heading in two or more languages, with the stylesheet
selecting one of the languages for presentation depending on a
global language parameter (see also
You may also want to keep both full and abridged versions of a heading. For example, newspapers often use a specific abbreviated English syntax for their headlines (as in U.S. Patriot Act attacked as threat to freedom), but for the purposes of automatic indexing and natural language processing, the fully grammatical version of the same heading might be required (The U.S. Patriot Act was attacked as a threat to freedom).
A special kind of a heading specific to HTML documents is the title of a page, normally displayed in the title bar of a web browser window as well as in bookmarks or search results listing this page. Even though, as a general rule, your target vocabulary must not influence your semantic source vocabulary, you should plan ahead as to what source element(s) will be transformed into the web page title.
If each of your pages has a visible on-page heading that applies
to the entire page, it is natural to duplicate it as a web page
title. Otherwise, it is always a good idea to provide a heading for
any sufficiently large information unit. Even if in your
target rendition this heading will only be used for a peripheral
element such as page title or not used at all, the very act of
christening a piece of data disciplines your thinking and serves as
an additional checkpoint to ensure the consistency
A web page title is often used for orientation within the site. A
sequence of parent sections' headings, culminating in the name of
the entire site, may be appended or prepended to the current page's
A paragraph is a sequence of sentences that traditionally represents
a complete, single thought. Today, however, paragraphs are often
used for structuring the text flow visually, rather than for organizing the
flow of ideas within it. Online, paragraphs tend to be smaller than in
print, and other means of text organization (such as blocks,
Still, whenever you have a container for more than just a small bit of text, your schema should permit inserting one or more intermediate paragraph elements between this container and its text. In most cases, this intermediate level may be optional; for example, your block elements could be allowed to contain either direct text content (for short fragments less than a paragraph) or a sequence of paragraph elements (for longer pieces of text). This approach adds a degree of laxity to your schema but is very convenient in daily markup practice.
As for the element type name, there is no reason not to use HTML's p, although para would be more appropriate for users who might find p too cryptic.
Lists are a special construct that is closely related to paragraphs. Two common types of lists offered by HTML are unordered (bulleted) and ordered (numbered), differing in how the items in the list are adorned. For our XML markup, we could borrow HTML's model, with a parent element (e.g., ordered-list) enveloping the entire list and children elements (e.g., item) marking up individual items.
The only possible ambiguity with regard to list markup is how to correlate list items with paragraphs. Often, each list item is a paragraph, so you may be tempted to consider paragraph elements redundant and disallow them from list markup completely. However, as soon as you run into an item of two or more paragraphs, you may regret this decision. I recommend using the convention discussed in the last section: Allow both paragraphs and direct text content within list item elements.
In fact, this is what is implemented by HTML 4; its li element can contain both inline and block content (i.e., both text children and paragraph elements, among others).
Most links refer to entire web pages, but sometimes you need to pinpoint a particular location within a page. In HTML, you can make a link target from as small a piece of text as you like, down to a single sentence or word (by enclosing it in an a element with the name attribute).3 In most graphic browsers, however, the only visible result of jumping to an in-page link is the page being scrolled down so that the linked point is at the top edge of the window.
This means that - unless your linked sentence happens to start at the beginning of a screen line - the visible portion of the newly loaded page starts in mid-sentence. This result is quite confusing and makes it nearly impossible to guess what exactly the link referred to. For this reason alone, it is advisable to only allow anchoring links to block-level elements, including paragraphs.
With XML, it is easy to enforce that rule because you most likely won't have any target enveloping element (like a in HTML). What you need instead is an attribute, only applicable to block-level elements, that turns its element into a link target. It often makes sense to reuse the almost-standard id attribute for this purpose (in addition to its numerous other uses); it won't do any harm if some of the elements with ids will create HTML link targets but will never be linked to.
Sometimes, you'll need to present an object that breaks the paragraph flow, but doesn't necessarily start a new paragraph. Often, this is a mathematical formula or a programming code fragment that must start on a new line.
Such a piece of displayed material is a block-level element from an HTML perspective; semantically, however, it is often an inseparable part of an adjacent paragraph containing the introductory or explanatory text for this displayed item. Therefore, it makes sense to allow the displayed material elements to be used only as children of paragraph elements.
The "HT" in HTML stands for HyperText, and the early historical Web was very much textual. Despite all the graphic and multimedia advances of recent years, this textual foundation has not eroded. The advance of XML has, if anything, only strengthened it.
Any text markup language must provide a sufficient inventory of markup constructs for in-flow text fragments that for some reason must be differentiated from their context. Examples of such fragments include emphasized words or phrases, names or identifiers, quotes, and foreign language citations.
Block and inline elements. HTML (as well as other presentation-oriented vocabularies, for instance XSL-FO) differentiates between block-level and inline-level objects. This distinction has to do mostly with visual formatting, as block-level elements are supposed to be stacked vertically, while inline elements are part of the horizontal flow of text.4 Therefore, it is not really relevant for your semantic XML markup, which must reflect content structure, not formatting. Still, since HTML is your primary target format, the block/inline distinction may sometimes have repercussions for your source definition.
Thus, it may be difficult to handle situations where a
source element that normally transforms into an inline-level target
element has to apply to a larger fragment of a document (
Existing vocabularies. DocBook5 is an established standard dating from 1991 that is used mostly for technical books and documentation. It may well be the most widely used XML vocabulary after XHTML; when somebody tells you, "My documents are in XML," chances are it's actually DocBook. Software support for this vocabulary is also quite good.
DocBook is vast but not too deep, so it is simple to learn despite its large number of element types (epigraphs, bibliographies, programming code, glossaries, and so on). If you don't understand what a particular element type is supposed to do, probably you don't need it (yet). For those constructs you do need, however, DocBook may be a rich source of text markup and structuring wisdom.
TEI6 (Text Encoding Initiative)
is an older and bigger beast, developed for markup of all kinds of
scientific and humanities texts. Compared to DocBook, it is focused
more on low-level text markup than on high-level book
structures. The TEI DTD offers many modules
Your source XML must be semantic; that is, it must reflect the meaning of text-level constructs, not their presentation. The em and i element types, both present in HTML, provide a canonic illustration of this principle. While an i element dictates using an italic face in visual media, an em only designates an emphasis, which is a semantic concept rendered differently in different media. For example, a fragment of text inside em can be set in italic in a graphic browser, but it can also be highlighted in a text-mode browser or read aloud emphatically by a speech browser.
Modern HTML deprecates i and other
presentation-oriented element types; instead, you are supposed to use
appropriate semantic element types such as em, possibly in
combination with CSS. In your XML source, however, deprecating
anything is not an option - you have to make sure that with your
schema, no presentation-oriented markup is possible at
all. Formatting hints (
The same visible formatting may result from different source markup. For example, you may use the same italic font face for both emphasis and citations, but they must be marked up differently in your source. What only a human reader can distinguish in the formatted result should, ideally, be automatically distinguishable in the source.
In general, semantic markup in the source should be richer
and more detailed than the resulting HTML markup after
transformation. For example, it is often a good idea to use special
element types to mark up all dates, person names, or company names
in your source, even though in the resulting web pages they are not
formatted in any
Why mark up what you don't need right here and now? Because your XML source is more than just an undeveloped (as in "undeveloped film") version of the web site. Rather, it is the start of a project that will keep growing and changing, sprouting new connections and renditions over time. For example, you may want to reuse your web site material in PDF brochures, interactive CDs, archival and search applications, and more.
This means that your XML source must be able to serve as the semantic foundation not only for your current site but also for everything it can potentially become. You may not need any extra markup right now, but it may come in very handy when you extend your site or reuse the source documents for anything beyond the web site pages.
Imagine that one day you need to convert all dates on your site from one format to another (e.g., from MM/DD/YY to DD/MM/YY). Dealing with dates scattered in the text is so much easier if all of them are marked up consistently - for example,
... which happened on <date><month>09</month><day>04</day><year>2003</year></date>.
instead of simply
... which happened on 09/04/2003.
With rich markup, you can change dates' rendition (e.g., reorder date components or use a different separator character) without touching the source at all, simply by modifying the stylesheet.
On another occasion, you may decide to paint all company names (or only your own company's name) green on your web pages. Or, you may find it a good idea to automatically compile an index of all persons' names mentioned on your site. All of these tasks are only possible if your source XML has these elements consistently and unambiguously marked up.
The need for rich text markup obviously depends on the quality, value, and planned longevity of your material. You don't need rich markup for short-lived stuff, but if you want your material to remain useful in the long term, you should always try to think in terms of "what markup is perfect for this content" rather than "what markup is sufficient for the task at hand." Examples of long-lived or otherwise valuable content include standards, specifications, historical texts, etc.
Existing vocabularies. As an example (and a good source of ideas), consider NITF8 (News Industry Text Format), which is a standard vocabulary for rich markup of news stories. Only a necessary minimum of NITF markup may be used in a story that goes directly to press; however, for exchange, syndication, or archival use, a complete enriched NITF markup is required. A properly prepared NITF news story uses rich markup to answer questions such as who the story is about, when and where the described event occurred, and even why it is considered newsworthy by the story author.
The text elements we've discussed in this section would be termed inline in HTML, meaning they are only allowed within block elements such as paragraphs. However, this limitation does not always make sense. For example, a rich markup element such as emphasis may need to be applied to more than one complete paragraph.
Usually, this is an indication that these paragraphs constitute some logical entity, such as a quotation, which (rather than the emphasis itself) you need to mark up. However, there may be situations where no such element exists, but inline text markup still has to spread across one or more block elements. What are we to do in such cases?
Inserting a separate inline markup element within each paragraph is the least elegant solution:
<p><em>This is the first paragraph using emphasis throughout.</em></p> <p><em>And this is the second emphasized paragraph.</em></p>
This leads to unnecessary duplication of markup, poor maintainability, and just plain ugliness. This is the only option, however, if your emphasis spans one paragraph and a half.
The simplest approach is to just do away with the inline/block distinction and allow any text markup to be applied at any level of the hierarchy, both below and above the paragraph level. This will allow you to enclose all affected paragraphs into a common parent element specifying emphasis:
<em> <p>This is the first paragraph using emphasis throughout.</p> <p>And this is the second emphasized paragraph. Note that we can use <em>nested emphasis</em>.</p> </em>
This might make sense, especially in contexts where you want to
allow both paragraphs and short non-paragraph text fragments
It might be argued, on the other hand, that the emphasis spanning
one or more paragraphs is semantically different from the emphasis that
spans one or more words. Therefore, they could use different
<emphasis> <p>This is the first paragraph using emphasis throughout.</p> <p>And this is the second emphasized paragraph. Note that we can use <em>nested emphasis</em>, but this time it is a different element type for the inline level.</p> </emphasis>
If the paragraph-level emphasis is semantically connected with the paragraph element, you can instead add an attribute to those paragraphs that fall within its scope:
<p type="emphasis">This is the first paragraph using emphasis throughout.</p> <p type="emphasis">And this is the second emphasized paragraph. Again, <em>nested emphasis</em> is possible.</p>
Among these options, there is perhaps no single winner suitable for all situations. Your choice will depend on the semantics of the element in question, the frequency of its use at inline and block levels, and the possible connections between its semantics and that of the standard block-level element (paragraph).
Another issue with text markup is whether nesting elements of one type is to be allowed. Presentation-oriented markup never uses, for instance, i within i - but for semantic markup, a similar structure may be meaningful. Thus, emphasis within emphasis or a quote within a quote are both perfectly valid semantically, even though in an HTML rendition, nesting of the corresponding formatting elements may have no visible effect.
Therefore, to properly transform nested semantic markup, you must use different formatting depending on the nesting level of the semantic element. For example, if you use italic face for emphasis, nested emphasis can be rendered either as regular face ("toggle" approach, where you switch between regular and italic faces for each new nesting level) or as bold italic face ("additive" approach, in which the italic rendition of the parent is augmented by the bold formatting of the child).
A hyperlink is a very rich concept, even though its implementation in HTML is rather primitive. Basically, an HTML link consists of two parts: the address that tells the browser where to go and the link element itself that (with its attributes and children) defines the link's presentation and behavior. However, in HTML, all possible address types are limited to a single syntax (URI), and all possible link types are served by one element type (a) with a limited set of attributes. Let's see how we can improve this scheme.
Note that this section only covers inline links that are
part of the body of a page and thus need to be specified in the
page's XML source. Navigational links, created by the
stylesheet based on the master document data, are discussed in
When deciding how to cast your linking semantics into XML constructs, it is natural to reuse the HTML approach with a link consisting of an element (signaling the link) and its attributes (providing the address and other link properties). For example, you might write
This was <link address="address">reviewed</link> elsewhere.
However, this only looks good when you're linking text fragments within a text flow. As soon as you have a separate element representing some object that may have a link property (among others), it is much more convenient to designate the link by an extra attribute of that element rather than a wrapper element. For example, it is easier to create a linked image like this:
<image src="button" link="address"/>
compared to the HTML-inspired approach:
<link address="address"><image src="button"/></link>
Not only the address but other properties of a link as well (such as its title, behavior, or classification) might similarly attach as attributes to an element that represents a nontextual link.
So, we see that it is natural to express linking semantics via a
set of attributes that may apply to many different elements (or even
to any element at all) instead of an element type with its own fixed
attributes. This is because a link is most often an attribute of
some object rather than an object in itself. This approach was
implemented in W3C's XLink
standard,9 and you may
consider incorporating XLink into your source definition for link
markup (however, please read the rest of
For in-flow textual links, you still need a generic linking element type (such as link in the example above) that only serves as a markup container for the same set of linking attributes. Most schema languages have no problem defining a separate set of attributes that can be used in different element types.
Along with the href attribute with the link's URI,
an a element in HTML may provide a target
attribute for specifying the target window or frame for the linked
program various aspects of the link's behavior (e.g., actions
performed when the link is activated), your
source XML may also need to provide link properties
Categorizing links.For example, analysis may reveal that your links fall into one of the following categories:
internal links (links to other pages within the site);
external links (links to other sites);
dictionary links (links to a script on an online dictionary site providing definitions for linked words); and
thumbnail links (thumbnails linked to pop-up windows with larger versions of images or pages).
Both thumbnail links and dictionary links may be either internal
or external. However, they need to be classified separately because
of their special role on the pages, resulting in different
formatting and behavior. On the other hand, you may not be planning
any formatting or behavior differences between internal and external
links, but separating them into different types is still a good idea
because it is a natural classification and because this lets you make
your address abbreviations (discussed in
Classifier attributes. To differentiate these link types, we could add a classifier attribute, e.g. linktype, specifying the type of the link:
...available on the <link linktype="external" link="www.kirsanov.com/te/">original site</link> and <link linktype="internal" link="mirror/te">mirrored here</link>.
This approach works both with standalone link elements and with any other elements that may need to use these linking attributes (e.g., image). Note that we used linktype rather than type and link rather than address for the attribute names so that the common prefix, link, will help you keep track of these attributes as a group without the risk of confusing them with their parent elements' native attributes. You can also separate all linking attributes into a namespace of their own, but this is not really necessary unless you plan to use them with different document vocabularies.
It's also advisable to make all linking attributes but the address (i.e., link) optional and provide sensible default values. For example, you can mandate that the missing linktype attribute in a linked element implies that the link is internal.
Classifier element types. For in-flow links, instead of (or, better, in addition to) the bulky classifier attribute, a separate element type for each link type is more convenient. As these element types will be used quite often, each should have a short but clear name:
...available on the <ext link="www.kirsanov.com/te/">original site</ext> and <int link="mirror/te">mirrored here</int>.
Separate element types have the additional advantage of being easier to validate with grammar-based schema languages like DTD or XML Schema.
Advanced link types.
Other link types may have their own sets of required and optional
attributes and may perform other functions, besides creating a
link. For instance, dictionary links from the above classification
are likely to be used only within text flow, so we can introduce a
special element type for them and declare that whenever the address
attribute is missing, the element's content is taken as the
...was going to <def>disembogue</def> profusely. ...at which point it <def word="disembogue">disembogued</def> itself...
Here, two occurrences of the obscure word disembogue are
linked to a dictionary site, so that a pop-up window or floating
tooltip with the word's definition could be displayed when the link
is activated in some way (e.g., clicked or hovered over). You don't
need to specify the dictionary site to use, or the complete URL for
the pop-up; all this is taken care of by the stylesheet. The only
thing you may need in the source is the word attribute
that optionally provides the base form of the linked word or phrase;
if it is absent, the contents of the def element
For generality, this special kind of link can also be given by a link element with linktype="dictionary" and the link attribute playing the role of word.
Similarly, a thumbnail link could be created by a
thumb element with a single attribute
(e.g., image). This attribute would provide the identifier of the
corresponding image, with the stylesheet doing all the rest: inserting
and formatting the thumbnail, creating a display page with the full-size
version of the image, and linking it to the thumbnail. The stylesheet
can even automatically create the thumbnail from a full-size image
When creating a link, we usually want to specify a certain piece of content that the link will point to. What a URL allows us to specify, however, most often is a file that can be moved, renamed, or deleted even if the content we are interested in is still out there somewhere. Moreover, a URL includes a lot of technical information (protocol, file extension) that is not relevant for our purpose of establishing a content-level link.
All this invites the idea of using abbreviated addresses that would hide the underlying technical complexity of URLs and provide an abstraction layer protecting our semantic XML from URL changes. For each address, we will create an identifier to be used in the XML source; at transformation time, the stylesheet will resolve this identifier into the actual URL to be put into the corresponding HTML link element.
Example: RFC links. Suppose you often need to link to enumerated documents such as RFCs.10 Such links could use a special value of the link classifier attribute and/or an element type of their own. However, to make them even more convenient, it is natural to use only the RFC number as an abbreviation for the complete URI:
...as per <rfc num="1489"/>.
Or, the same could be spelled out in a generic fashion:
...as per <link linktype="rfc" link="1489"/>.
This latter variant uses generic linking attributes that can be applied to different elements to make links out of the corresponding objects, whereas the num attribute is only recognized in an rfc element.
The XSLT stylesheet will have to recognize this type of link, possibly apply some special formatting to it, and most importantly resolve (unabbreviate) the abbreviated address. In this example, unabbreviation would supply the complete URL of the referenced document for the HTML link:
...as per <a href="ftp://ftp.rfc-editor.org/in-notes/rfc1489.txt">RFC 1489</a>.
You could also allow an rfc element to enclose character content:
...which was <rfc num="1489">defined</rfc> in 1993.
which would give the following in HTML:
...which was <a href="ftp://ftp.rfc-editor.org/in-notes/rfc1489.txt">defined</a> in 1993.
Mnemonic addressing.Abbreviated addresses in your source XML must be unique only within your site, as opposed to URLs that are globally unique. This means you can make them easier to remember and more meaningful (to you) than are URLs. The abbreviated addresses are also completely devoid of irrelevant technical details and can be arbitrarily long (i.e., detailed and readable) or arbitrarily short (i.e., quick to type and quick to read).
You can use as many independent abbreviation schemes as necessary. Each more or less complete and logical group of addresses can be served by its own abbreviation algorithm (and the corresponding resolver in the stylesheet). For example, links to an online dictionary or search engine might be abbreviated to just the word you want to look up; links to W3C standards can be represented by their unique identifiers as used by the W3C site (e.g., xslt20 for XSLT 2.0, which unabbreviates into http://www.w3.org/TR/xslt20/). Any address domain whose URLs can be "losslessly compressed" into a shorter or easier-to-remember form is ripe for abbreviation.
With multiple abbreviation schemes, the stylesheet must be able
to know which one to use for each link. This is where link types
Along with resolving the address, your stylesheet can perform other
processing tasks, such as retrieving the title of the referenced RFC
to be displayed in the link's floating tooltip. A Schematron schema
for your source definition, in addition to performing link syntax
validation, can also check for broken links (
To expand the abbreviated addresses, your stylesheet may use any
sources of information, such as local or remote database queries or
even web search. It's easiest, however, to create simple algorithmic
abbreviations that map to the corresponding URLs through some
Thus, for external links, the most obvious and perhaps the only sensible abbreviation is dropping the protocol specification (usually http://) from the URLs. Even this simple provision can make address input somewhat easier by allowing you to type www.kirsanov.com instead of http://www.kirsanov.com.
Note, however, that in this case the stylesheet must be able to recognize the protocol part of an address and only add http:// if it is missing. Addresses that already contain a protocol specification (be it http://, https://, or ftp://) must not be modified in any way.
An address abbreviation may contain more than one component. This is often necessary to link to scripts (as opposed to static pages) that require a number of parameters in the request URI. Some of these parameters (e.g., the partner's ID or formatting options) are static and can therefore be filled in by the stylesheet, but the key information pointers (e.g., the date and the number of the article within that date) must be present in the source of the linking page. Here's an example of a link with a multicomponent abbreviated address:
As <foonews date="02-12-2003" num="6490">reported</foonews> by FooBarNews...
which could be expanded into the HTML link:
As <a rel="nofollow" href="http://foonews.com/news?date=02-12-2003&num=6490">reported</a> by FooBarNews...
One highly recommended abbreviation scheme that makes sense for almost any site is using page identifiers, defined in the master document, instead of pathnames11 for internal links. This will make your site's structure much more flexible because you will be able to rename a page or move it around without changing all the other pages that link to it.12
Linking a foobar. For example, suppose you have a page on your site describing a product called Foobar Plus. You don't want to spell out the complete pathname each time you link to that page, as it may be quite long (e.g., /products/personal/foobar_plus). Much more convenient would be using that page's unique (within your site) and easy-to-remember identifier. Since you don't, in all probability, have another Foobar Plus on your web site, it is natural to use an abbreviated name of the product as the identifier:
Check out our new <int link="fb+">Foobar Plus</int>!
The correspondence between web pages and their identifiers
is to be set in the master document (
To make life even easier for site maintainers, you can allow them
to use any of a number of aliases referring to the same
page. For example, the Foobar Plus page might just as well be linked
to as fb+, foobar+, or
foobar-plus. All you need to do is register all such aliases
in the master document (see
Linking translations. In multilingual sites, a special kind of link that must be present on every page is the link(s) to the other language version(s) of the same page. The absolute minimum of information needed to construct such a link is, obviously, the identifier of the language we are linking to. Thus, if we write on the Foobar Plus page
<lang link="de">This page in German</lang>
then the stylesheet will use the current page's pathname to construct the proper HTML link - for example,
<a href="/products/personal/foobar_plus.de.html">This page in German</a>
<a href="/products/personal/foobar_plus.html?lang=de">This page in German</a>
or any other variant, depending on your web site setup. Once again, the correspondence between languages and link URIs is deduced from the master document's data.
The majority of static images, Java applets, and Flash animations on web pages are not independent objects. Most often, they are components of higher-level content constructs. An image may be a visual accompanying a section heading, a background of a table or the entire page, or a navigation button that is part of a larger navigation system.
In all these cases, your source XML will not contain any image references at all: It is the stylesheet's responsibility to know what images to use with what content structures, where to take these images, and how to format them. Much less frequently, usually within text flow, you might need to display an image for its own sake - such as a photo, a technical illustration, or a map. It's only these standalone objects that you'll have to specify explicitly in the semantic XML source of a page.
This section covers both static images and various embedded objects such as Java applets, ActiveX controls, and Flash animations. All of these are similar from the viewpoint of XML source markup; below we talk mostly about images, but you should keep in mind that the same applies to most non-HTML external objects used on web pages.
Element type names.
The name of the element type for including standalone images in your
documents may be either generic (e.g.,
Images as attributes. An image object may be quite complex, with additional components, such as a photo caption or credit, stored in attributes or child elements. However, quite often all you need to specify is a source location or an identifier for an image that is an attribute of some other object rather than a standalone object in its own right. For the image types that can be used this way, you can use an attribute of the same name as the standalone image's element type. For example, if your sections may feature a photo next to the section's heading, it is more convenient to write
<section image="location"> <head>Section heading</head> ... </section>
than to write
<section> <image src="location"/> <head>Section heading</head> ... </section>
even though your stylesheet may be programmed to create identical formatting for these two inputs.
Just as a link's main attribute is the destination address, an image element must, before all, specify the location of the image resource. And, just as we used abbreviated addresses in links, it is natural to use mnemonic identifiers instead of complete image locations. For example, by writing
you make your XML source more readable, easier to edit manually, and less prone to errors.
In the simplest case, an abbreviated image reference can be made
from its filename by removing the path and extension (which is
supposed to remain constant for all images). In more complex cases,
an abbreviation might be composed of several parts expressed as
attributes, such as a date or a classifier. Finally, your master
document could simply store a list of all image locations associated
with arbitrary identifiers and possibly aliases (compare
Abbreviating aggressively. Along with stripping directory and extension, filename-based abbreviations can be made even more convenient by programming the stylesheet to perform case folding (converting everything to lower- or uppercase) and to remove all whitespace and punctuation. With these provisions, to reference img/maps/nymap.png in the above example, we could use any of nymap, ny map, N.Y. Map, and so on.
The goal of using abbreviations is to have your image references named intuitively and consistently and to provide just enough information in XML for the stylesheet to be able to reconstruct the complete pathname or URI.
Standalone images may be particularly difficult to separate into
independent aspects of content and formatting. The idea of
specifying an image identifier and possibly its role in the XML
source and then letting the stylesheet figure out all the formatting
parameters is attractive, but the reality may be not so
neat. Sometimes, you'll have no choice but to add ugly formatting
clues to the XML source to get the
An example is a layout where several images are placed on a page, interspersed with text, and aligned alternately against the left or right margin. It is natural to have the stylesheet do the alternating alignment so that only the image identifiers need to be supplied in the source. However, sometimes you may want to force a particular image to a particular margin in the middle of a page. Adding align="right" to your XML source is hardly semantic but may be unavoidable if, for example, a left-aligned image visually conflicts with a nearby left-aligned heading.
Think ahead. It is much easier to prevent a disease than to cure it. Thus, it is preferable to design your page layout in such a way that it can be created strictly automatically based on nothing but the semantic XML source. Avoid situations where only manual interaction can produce acceptable formatting.
For example, if you plan to use alternating alignment of images, you could either use centered headings (which will not conflict with either image alignment) or mandate that any image be at least one paragraph away from the nearest heading (this restriction is easy to enforce automatically using Schematron).
Separate namespaces. However, there are situations where adding manual formatting hints to your XML source cannot be avoided. This may happen not only with images, although they are a frequent source of problems. It is advisable to use a separate namespace for all hints that pertain to the same output format (e.g., HTML):
<page xmlns:forhtml="http://www.kirsanov.com/formatting-hints-html"> <p forhtml:column-break="true"> ... <image src="solid wood table" forhtml:align="right"/> ... </p> </page>
Here, a hint is added to the p element specifying that this paragraph must start a new column in a multicolumn layout (assuming the stylesheet cannot figure this out automatically). Another hint floats an image within that paragraph to the right margin.
Now, if you want to render the same XML source into a different format, such as PDF, the new stylesheet will have no problems ignoring anything from the "for HTML" namespace. It is also very easy to strip all HTML formatting hints to produce a purely semantic version of the source. You can store several sets of formatting hints in the same source documents, each in its own namespace, and have the stylesheet select the set corresponding to the current output format (such as "HTML with columns," "HTML without columns," "printable HTML," "PDF," etc.).
HTML documents often use the height and
width attributes in img elements as
spatial hints to speed up rendering of the page in a browser. You
don't need to supply these values in XML; a stylesheet can find
out the dimensions of all referenced images itself (
Besides the location (full or abbreviated) and possibly formatting hints, an image element may contain various other information.
Textual descriptions. The XHTML specification requires that each image be provided with a piece of text describing what the image is. Traditionally, the alt attribute of an img element has been used for short descriptions, but in HTML 4.01 and XHTML the longdesc ("long description") attribute was added to complement alt. Normally, an image description should contain:
nothing (empty string) for purely decorative images (such as components of frames, backgrounds, and separators);
the text visible on the image for images that display text (thus, the alt of a graphic button must contain exactly the button's label and nothing else);
a short description of the image's role or content for meaningful images (e.g., John's photo).
It's only in the last of the above cases that the image
description may need to be supplied in the XML source, preferably in
the content of an image element (
Captions. Often, a standalone image must be accompanied by a visible descriptive piece of text (as opposed to alt descriptions that are normally not shown by graphic browsers). This may be a caption, a photo credit, a copyright notice, or anything else that semantically belongs to this image.
Since this content may need further inline markup, it is
better to store it in children of your image element rather than in
<photo src="sight"> <caption>A rare sight.</caption> <credit>Dmitry Kirsanov</credit> </photo>
Upon encountering a photo element, the stylesheet
would expect to format its caption child element as a photo
caption and the credit child element, if present, as
credit (e.g., separately from the caption, in a smaller font size, and
with "Photo by" prepended to the
A simple linked image can be created by adding linking attributes
The quick-and-dirty approach. It is natural to reuse the generic link element type for specifying multiple links inside an imagemap, by placing link elements in the image and adding coordinate attributes to define the linked area:
<image src="chart 3"> <link link="address1" shape="rect" x1="0" y1="0" x2="100" y2="20"/> <link link="address2" shape="circle" x="50" y="50" radius="5"/> </image>
In HTML, all coordinates for an imagemap area are cramped into one comma-separated attribute value string. You don't need to reproduce that in your XML - instead, you can specify one value per attribute and use descriptive attribute names. It's a good idea to use your schema to check that the set of coordinate attributes in each link element corresponds to the value of shape.
The thoroughly semantic approach.The syntax shown above may work for an occasional imagemap, but it is still not semantic enough and needs to be improved if you routinely use imagemaps (or other interactive objects). Namely, do the pixel values in the link attributes really belong in the source? Probably not, as they are closely bound to the image's "presentation" and tell us nothing about its "content." A better approach is to use each link element to associate the identifier of an image area with a link address - for example,
<image src="chart 3"> <link link="address1" area="block1"/> <link link="address2" area="central-blob"/> </image>
The correspondence between the area identifiers (block1 and central-blob in this example) and the actual pixel coordinates may be stored in the site's master document. If, however, you want an imagemap to be truly orthogonal to everything else on the site and easily portable to other sites, consider creating a separate XML document for each imagemap storing its active areas and their identifiers.
Accessibility. Interactive objects such as Java applets and Flash movies may also incorporate multiple links (one example is an animated Flash menu). Even though you don't have to specify these links in the HTML code embedding the object, it still makes sense to list them in the XML source of a page so that the stylesheet can construct an alternative access mechanism for those users who cannot (or don't want to) peruse this interactive object.
Tables are perhaps the most abused feature of HTML, with the vast
majority of tables on web pages being used for layout purposes, not
for presenting inherently tabular data. If (like
most web designers) you are going to use HTML tables for web page
layout, you cannot reflect that in the semantic XML source of a page
in any way. It's only the stylesheet that needs to be concerned with
Sometimes, however, you may have some genuinely tabular data that you want to format into some sort of a table on a web page. Still, this does not mean that you have to think in terms of rows and columns when creating a semantic source for such a table.
If you have something you can name, do it. For example, consider a sales data table listing sales figures for several products across several years. The XML way of marking up this data would be to forget that you're working on a table and simply list all available data in an appropriately constructed element tree:
<sales-table> <product> <name>Foobar</name> <sold><year>1999</year><number>123</number></sold> <sold><year>2000</year><number>140</number></sold> <sold><year>2001</year><number>142</number></sold> </product> <product> <name>Barfoo</name> <sold><year>1998</year><number>89</number></sold> <sold><year>1999</year><number>14</number></sold> </product> </sales-table>
This approach frees you from worrying about column alignment, sort order, or empty cells - just dump all your data and you're done. All the rest will be performed automatically by the stylesheet: It can filter out a subset of the provided data, group values in rows and columns, sort them, and fill in "N/A" for missing values. Thus, the above example might come out as follows:
1998 1999 2000 2001 Barfoo 89 14 N/A N/A Foobar N/A 123 140 142
Tables from triplets. In some cases, such a data-centric approach may also make your source significantly more compact than the table rendition. Thus, a sparse table with mostly empty cells can be represented in the source by triplets consisting of a row name, a column name, and the corresponding value at their intersection. Since such a source does not contain separate lists of all columns and rows, the stylesheet will compile them from the triplets.
Is it worth it? Granted, for an occasional table or two, this may be too much work: You'll have to program your stylesheet to recognize various element types and perform various operations (such as normalizing dates) that may be necessary for your tabular data. For simple isolated tables, you may be better off more or less directly reproducing in XML the structure of the target HTML table. However, if you have a lot of simple tables (or a few complex ones) with similar data, or if your tables are updated often, the benefits of the semantic data-centric approach may easily outweigh the simplicity of the straightforward HTML imitation.
Also, the tabular data on your web site is likely to be coming from some external source, such as a database or a spreadsheet. When you write the code to update your tables automatically, it is usually much easier to first transform the external data into a semantic XML tree and then let the stylesheet do table layout.
Interactive elements in HTML are grouped into forms. Simple forms
such as site search or email newsletter subscription are often used
on many pages of the site, and your XML does not need to detail the
structure of these forms. Instead, in your source you can treat such
a form as an indivisible entity - for example, as a special type of
orthogonal block (
Sometimes, even this is not required. For example, if all pages on your site contain a search field in the page footer, you don't need to mention it in the XML source at all. Your stylesheet will simply add this form to every page it produces, just as it adds all other page components that remain the same from page to page.
What if you need to build something more complex, such as a shipping address input form or a survey form? In these cases you'll need to create an appropriate element type for each variety of the form's input controls (such as text fields, radio buttons, and drop-down lists) as well as for any higher level semantic constructs within the form. This work can be made much easier by reusing some of the existing form vocabularies.
Existing vocabularies. An obvious choice for the existing vocabulary from which you could borrow form-related markup is XHTML, especially if it is your target vocabulary. The Forms module,13 available starting from XHTML 1.1, may be a good first approximation. It covers many widget types and allows for proper logical structuring of your form.
However, in many cases the XHTML form markup may be too presentation-oriented to be useful for your semantic XML - or simply too awkward. This is mostly due to the historic baggage of older HTML versions. Modern HTML and XHTML had to pile their logical markup provisions on top of the old - limited and inflexible - form components.
For instance, in XHTML you have to write
<label for="firstname">First name:</label> <input type="text" id="firstname"/>
instead of the more natural
<textfield id="firstname"> <label>First name:</label> </textfield>
HTML 4.0 had to define a separate
label element that is linked to its input
by a for attribute simply because it had to stay
compatible with older HTML versions that did not allow any children
In your source definition, you are free from these concerns and
can therefore mark up your forms in a more logical and
readable way. It is also important that your own markup may be better
integrated with other parts of the system; for example, you could
use an abbreviation (
Another existing vocabulary worth looking at is
XForms,14 recently developed by the W3C (see
Formatting hints. Form presentation is a difficult task. Even with full manual control, it's not always easy to lay out a form so that it looks perfect and remains usable for any data that may be filled into it. Even more difficult is to automate form layout, enabling the stylesheet to consistently build good-looking form pages from the semantic description of the forms' structure. To add insult to injury, different browsers on different platforms often render form controls in wildly different ways.
The key is keeping the layout simple and flexible. Don't strive for precise placement or alignment of controls, as this is impossible to achieve given the vastly different font and screen sizes in browsers. (Also, do not tie the position of other parts of the page to the size or placement of a form - this often results in a broken page layout.) Take advantage of the form structure described in the source by separating groups of form controls into independent layout blocks.
All that said, adding formatting hints (
In the previous chapter (
You may find the sample master document described here (see
A master document represents a new document type, with
its root element type different from that of a page document, and
most other element types usable only in a master document. However,
if you don't use DTDs (
The role of the master document is that of a hub that all other documents refer to when they need to figure out a wider context of the web site or establish mutual links. Whenever the stylesheet needs some information that is not supplied by the currently processed document, it will consult the master document to find either that information or a link to it.
Therefore, the most important part of a master document is the
site directory - a collection of information about all pages of
the site and their organization. This directory is used for building
the site's navigation as well as for resolving abbreviated internal links
Besides pages, other components of the site may also be mentioned
in the master document, such as all Flash animations you have or all
images of a specific kind used on the site. Units of orthogonal
content must be listed in the master document as well (
A flat list of all pages is not sufficient for building a usable site. We also need to represent the structure of the site's menu and the correspondence between menu items and pages.
A simple site's menu may be little more than a linear list of links to each of its pages. However, most sites require more complex menu structures. Common are hierarchical menus where some of the top-level items encompass multiple subpages and/or nested submenus. Such a structure is straightforward to express in XML.
Some sites may have more than one menu. For example, there may be a menu of topics (content sections) and another independent menu of tools (pages that help navigate the site, such as search and site map). Such orthogonal menu hierarchies can be stored in independent XML subtrees within the master document.
What do we need to store in the master document for each menu
item? To build a clickable menu element, we must know at least its
label (the visible text displayed in the menu) and the page that
it is linked to. A label may contain inline markup and should
therefore be stored in a child element. As for the link, it is
natural to use the general linking attributes with abbreviated
addresses that we've developed for in-flow links on site pages (
Items vs. pages. A menu item is not the same as a page of the site. Some pages may not be available through the menu, while others may be linked from more than one menu item. Therefore, the page itself must be represented by a separate element that the menu item element will link to.
However, that does not mean that these page elements must be
stored in a different part of the master document. You can still
categorize all your pages under the branches of the menu tree: Even
if a page is not linked from the menu, usually you can find a branch
where it logically belongs (unless it is orthogonal content,
Here's a possible representation of a menu item:
<item link="products"> <label>Products</label> <page id="products" title="Our products" src="products/"/> <page id="software" title="Our software" src="products/software/"/> <page id="hardware" title="Our hardware" src="products/hardware"/> </item>
In addition to a label and one or more
pages, an item may also contain other
item children. A complete menu description would thus
consist of a hierarchy of items
under one parent, e.g. menu. Note that in each
page element, the id attribute provides a
unique identifier of not only that element, but of the page
itself. It is these identifiers that are used as abbreviated
How unabbreviation works. When resolving a link, the stylesheet translates the page identifier into the location of that page taken from the src attribute. However, that attribute's value is also somewhat "abbreviated" in that it omits irrelevant technical information such as the filename extension and the default filename (usually index.html) in a directory. These omitted parts are easy to restore by applying simple rules, so the three page elements in the above example would yield these page locations:
/products/index.html /products/software/index.html /products/hardware.html
Note that a location ending with a "/" is considered
a directory and has
Accessing the source.
There is one more reason to store page pathnames without
extensions. When locations are resolved for the purpose of accessing
the source XML documents rather than creating an HTML link, the
same src values are transformed into *.xml
file locations (assuming the directory structure of the site source
is similar to that of the transformed site,
Storing page metadata. Sometimes, a more complex layout for the page elements may be necessary. For example, if your bilingual site provides two language versions of each page, a page element could hold both metadata that is common to all language versions of the page (e.g., the page's identifier and source location) and language-specific metadata (e.g., title):
<page id="software" src="products/software/"> <translation lang="en">Our software</translation> <translation lang="fr">Nos logiciels</translation> </page>
Some of the metadata (
Along with all pages, a master document should also list all the
units of orthogonal content that your site will use (
<blocks> <block id="news" src="news/latest"/> <block id="subscribe" src="scripts/subscribe"/> <block id="donate" src="scripts/donate"/> </blocks>
Now if the stylesheet processing a page document encounters a block that has no content of its own but references some orthogonal content unit - for example, by specifying idref="news" - the document at news/latest.xml will be retrieved and inserted into the current document, formatted as appropriate for an orthogonal content block.
It is important that the id and
src attributes of a master document's block
element have the same names and semantics as the attributes of
page elements (
Extracting orthogonal content.
In the last example, each orthogonal block was stored in its own file - but
this is not always the best approach. You may want to reuse parts of
regular pages as orthogonal
For instance, the news page of a site is often a list of news items in reverse chronological order. You may want to automatically extract the most recent news item and display it in an orthogonal content block on other pages of the site. Another example is a "featured product" blurb extracted from that product's own page and reused on the front page of the site.
For these situations, what we need is a way to specify what part of the original page document is to be reused as orthogonal content on other pages. Since this part will most likely also be a block, we only need to indicate the id of the block we are interested in. Thus, if the most recent news block on the news page always has id="last", we could write in the master document:
<block id="last-news" src="news/" select="last"/>
Now any page can place a copy of the latest news item by referencing the corresponding orthogonal block by its identifier, last-news. For example, your page document might contain
Likewise, the featured product blurb could be extracted from the block with id="blurb" on that product's page:
<block id="feature" src="products/foobar" select="blurb"/>
Here, the featured product is identified by the path to the
corresponding document (products/foobar.xml). When you
want to feature a different product, all you need to do is change
this value so it points to another product's page (assuming each
product page has exactly one block with
id="blurb"; see also
will (after you rerun the transformation) display the blurb for the
Logically, without the select attribute, a master
document's block will reference the entire content of
the document pointed to by the src attribute. Your
Schematron schema could also check that the referenced elements
actually exist in the referenced documents (see
No perfection in this world. It would be even more natural to use XPath expressions for extracting orthogonal blocks. Then we could use not only the id attribute value but any XPath test for identifying the block we need. For instance, for the first block on the page, we would write
<block id="news" src="news/" xpath="//block"/>
Selecting the last block that has a section inside would be as simple as
<block id="lastsection" src="dir/page" xpath="//block[section][last()]"/>
There's only one problem with this kind of selector: In XSLT, you can't take a string and treat it as an XPath expression - and what the master document (or any other document) stores in its attributes is always just strings from the XSLT processor viewpoint.
Saxon offers the saxon:evaluate() extension function
Recall our discussion of dynamic sites in
One way of many.
There exist different ways to aggregate dynamic content and static
templates. Some of them come before XSLT transformation, which is
usually the last stage in a dynamic XML web site workflow; in these
cases, you don't need any special source markup because your
stylesheet will get complete seamless page source with both static
and dynamic content. However, in some situations (notably offline
Reusing blocks. An orthogonal content block that the stylesheet extracts from another document may be considered a special case of a composite dynamic value. Therefore, it makes sense to extend our blocks' markup constructs so that they cover the "truly dynamic" content as well - content that is calculated or compiled by some external process and not just stored in a static document.
We can define a number of block conventions that will allow us to use blocks not only for enveloping independent bits of content but also as links to external sources of information. Once again, our guiding principle is: Let the page author use short mnemonic identifiers and hide all the gory details of accessing data in the master document and/or stylesheet.
Calling a process. Suppose we want to build a site map page that automatically compiles a hierarchical list of all pages of the site. The first thing we need is the static part of that page - a document that stores all the static bits unique to the page, such as an introductory paragraph and heading(s). This is a normal page that is listed in the menu hierarchy in the master, just like any other page.
Wherever we want to insert our dynamic content into that static frame, we place a block reference, e.g.:
In the master, however, we cannot associate the sitemap identifier with any source file, since no such file exists - the list of pages is generated dynamically.
Instead, we must associate our dynamic block identifier
(sitemap) with an identifier of some abstract
process that generates its data. You can think of a process
as a kind of a script or application; it may accept some parameters
that affect its output. Thus, if we write in the
<block-process id="sitemap" process="sitemap" mode="text" depth="2"/>
then the stylesheet will know that a sitemap block
needs to be filled in with data generated by the
sitemap process with parameters
mode="text" and depth="2". This process
can be, for example, a callable template within the stylesheet
Watching a directory.
A stylesheet can access external files even if the list of these
files is changing dynamically. For example, an external process
(which may or may not be another stylesheet) might be dropping its
output XML documents into a directory. Your stylesheet would then
read the list of files in that directory (
If, for example, your stylesheet implements a list-titles process that takes a directory as a parameter and returns the list of title elements from all XML documents in the directory, then you could define a block to perform this operation on all (dynamically updated) documents in the news directory by writing in the master document
<block-process id="news-list" process="list-titles" dir="news/"/>
In a page document that wants to use this list, you would then
XML, not HTML.
Note that processes similar to sitemap or
list-titles should only aggregate content, not format
it. This means that the corresponding templates or functions in your
stylesheet must produce valid XML data (nodesets), not HTML
renditions. You would then feed these nodesets to the regular
formatting templates in the same stylesheet (see
On a typical web site, all pages contain bits of information that either remain the same or change predictably from page to page. Some of this repeating data, such as the company logo or tag line, actually belongs to the domain of presentation rather than content and therefore needs to be filled in by the stylesheet rather than stored in the source. Other components, such as webmaster email links, "designed by" signatures, copyright or legal notices, etc., are natural to store in the master document.
It is recommended that you envelop all such bits of content in one or more umbrella elements, each containing data with similar roles or positions on the pages. Here's a master document fragment defining the footer to be placed at the bottom of each page:
<page-footer> <designed-by>Site design: <ext link="www.kirsanov.com">Dmitry Kirsanov Studio</ext></designed-by> <legal linktype="internal" link="legal">Legal notices</legal> <contact linktype="internal" link="contact">Contact us</contact> </page-footer>
Note that the elements inside page-footer
may have mixed content with any of the text markup, linking, or
other elements that were developed for page documents. In
particular, we see internal and external links used in this example,
each with its own address abbreviation scheme (
The page-footer parent element makes the stylesheet simpler and more bullet-proof: Instead of providing templates for each of the individual footer elements, you can program the stylesheet to process all items within a page-footer in turn, and only provide separate templates for those that differ from others in formatting. With this approach, you'll be able to add a new element type for a new footer object even without changing the stylesheet.
Similarly, we can create an envelope for storing metadata that
applies to the entire site. Examples of such metadata include
site-wide keyword lists (which could be merged with page-specific
keywords supplied by the page documents,
Your stylesheet will need to know some parameters of the
environment in which it is run as well as the environment where its
HTML output will be placed. The most frequently required processing
parameter is the base URI that the stylesheet will prepend to all the image and
link pathnames. By changing this parameter, you can turn all
internal link URIs from relative to absolute with an arbitrary base,
which is useful for testing the site in different environments.
Other parameters may provide the path to the source tree
and the operating system under which the stylesheet is run (which,
in turn, may affect the syntax
Grouping parameters into environments. It is important that the same set of source files may be processed on different computers - for example, on a developer's personal system, then in a temporary (staging) location on the server, and finally in the publicly accessible area on the target server. Each of these environments will require its own set of processing parameters. It is therefore convenient to define several groups of parameter values, one for each environment, and select only one of the groups by its identifier when running the transformation.
Where to store the environment groups?
Obviously, the need to group parameters and assign a unique
identifier to each group makes using XML very convenient - as
opposed to, say, storing the values within scripts used to run the
site build process (
Your stylesheet is more likely to be shared (in whole or in part) among different projects, so it is not wise to use it for storing information that is too project-specific. Also, even though you can use XSLT variables for storing processing parameters, it is more convenient to use custom element hierarchies for structuring and accessing this data. For these reasons, the master document emerges as the most natural storage for processing parameters.
This does not mean that your master document will differ among environments. Instead, all identical copies of it will have information on all environments, and each environment will extract the relevant set of data by passing a parameter to the stylesheet.
Here's an example group of parameters that define the processing
<environment id="staging"> <os>Linux</os> <src-path>/var/website/src/</src-path> <out-path>/var/website/out/</out-path> <target-path>/test/</target-path> <img-path>img</img-path> </environment>
Normally, formatting of web pages is created by the stylesheet. Sometimes, however, formatting is dependent on certain parameters that, being more content than style, belong in the site's source and not in the stylesheet. Also, sometimes the stylesheet may need to create objects that are used on many pages but do not belong to any one page in particular. In both these situations, the master document is a convenient place to store data.
An example of such an object is a pair of graphic buttons -
"next" and "prev" - used on sequential pages (such as chapters in an
online book). If your stylesheet generates other graphic buttons on
the site (
These buttons are not specific to any particular page; moreover, pages that use them don't even need to mention the buttons in the source because the stylesheet can automatically create the page sequence, including appropriate navigation. All we need is to store the button labels somewhere so the stylesheet can generate the buttons. It makes sense to use the master document for this.
You can store the button labels in a separate element in the master and program the stylesheet to regenerate the buttons when run with the corresponding parameter. For example,
<buttons> <button>prev</button> <button>next</button> </buttons>
This section presents examples of complete documents that bring together everything we've discussed in the last two chapters (and more). The content is fictitious, but the structure and markup are from real web site projects (somewhat abridged for readability).
Compared to the master document example (
The main rule of thumb is: If you can move a bit of information away from a page document to the master or to the stylesheet, do that.
Example 3.1. en/team/index.xml
<?xml version="1.0" encoding="us-ascii"?> <page keywords="team, people, staff, competences, skills"> <title>Our team</title> <!-- Main content block: --> <block type="body"> <p>With backgrounds in technology and communications, FooBar's experienced management team has - you guessed it - <em>the right combination of skills for success</em>.</p> <section image="mike"> <head>Mike M. Anager</head> <subhead>CEO</subhead> <p>CEO and Co-Founder, Mike leads FooBar towards bringing the vision of "personal foobar" to reality. He previously served as Chief Architect at <ext link="www.barfoo.com">BarFoo Corporation</ext>.</p> </section> <section image="ed"> <head>Ed N. Gineer</head> <subhead>VP, Engineering</subhead> <p>Ed has over 30 years of foobar design experience under his belt. He has personally contributed to the most acclaimed of our <int link="solutions">products</int>, including the famous <int link="fbplus">Foobar Plus</int>.</p> </section> <section image="jack"> <head>Jack J. Anitor</head> <subhead>Senior Janitor</subhead> <p>Jack's expert janitorial skills and experience have been critical in the success of FooBar.</p> </section> </block> <!-- Orthogonal content blocks: --> <block idref="subscribe"/> <block idref="feature"/> </page>
Our example site is bilingual (English and
German), so all titles and labels are provided in two
languages, and the languages themselves are listed in a
languages element. We add an internal
For every installation where the site can be built, an
environment element with a unique id
supplies the following
src-path is the base directory of the XML source documents tree.
out-path is the directory where the output files
will be placed (used in batch mode,
img-path is where all the images (both static and generated) are stored. This path is relative to out-path.
target-path is the common part of all URIs used
in the resulting HTML files to refer to images or other pages of the
site. Thus, if you transform and view your pages locally at
out-path, then target-path may be the same
as out-path. If, however, you
On Windows, all absolute paths must be given in the file:/ URL format.17 This is the only standard and reliable way to represent an absolute pathname that includes a drive letter. In HTML, URLs with file:/ work for both links and image references in all browsers we tested. Other platforms may use absolute pathnames without the file:/.
Menu. The menu lists all the pages of the site. For each page, the src attribute contains the page's pathname (add .xml for source files or .html for output files) relative to the site's root directory.
Each page has an id attribute used to link to it. To make life easier, you can also provide a space-separated list of aliases in the alias attribute. In internal links to this page, you can use either its id or any of the aliases.
Each menu item has a label child storing the item's visible label. In the menu on a web page, each item is supposed to be linked to its first page child, so there's no need to specify a link in an item.
It is assumed that the English and German versions of the source files are named the same but stored in different directory trees under the root directory. The corresponding directories are named after the language designations defined in languages. So, for instance, the complete path to the German fbplus source page in the staging environment would be constructed as follows:
The blocks element holds a list of orthogonal
content blocks with their identifiers (id), source
document locations (src), and block selectors
Finally, the master document lists the common part to be prepended to
page titles (
Note that the mailto links used in page-footer
represent a special link type (
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE site [ <!ENTITY auml "ä"> <!ENTITY ouml "ö"> <!ENTITY uuml "ü"> ]>
<site> <!-- Environments: --> <environment id="local"> <os>Windows</os> <src-path>file:/C:/Work/Website/XML/</src-path> <out-path>file:/C:/Work/Website/Out/</out-path> <target-path>file:/C:/Work/Website/Out/</target-path> <img-path>Images</img-path> </environment> <environment id="staging"> <os>Linux</os> <src-path>/home/d/web/</src-path> <out-path>/home/d/web/out/</out-path> <target-path>/</target-path> <img-path>img</img-path> </environment> <environment id="final"> <os>BSD</os> <src-path>/var/tomcat/webapps/cocoon/foobar/</src-path> <out-path>/var/tomcat/webapps/cocoon/foobar/</out-path> <target-path>/cocoon/foobar/</target-path> <img-path>img</img-path> </environment> <!-- Languages: --> <languages> <lang>en</lang> <lang>de</lang> </languages> <!-- Menu: --> <menu> <item> <label> <translation lang="en">Home</translation> <translation lang="de">Home</translation> </label> <page id="home" alias="index front fp frontpage" src="index"/> </item>
<item> <label> <translation lang="en">Solutions</translation> <translation lang="de">Lösungen</translation> </label> <page id="solutions" src="solutions/intro_solutions"/> <item> <label> <translation lang="en">Life</translation> <translation lang="de">Das Leben</translation> </label> <page id="life" src="solutions/life"/> <page id="fbplus" alias="foobar_plus fb+ foobar+" src="solutions/foobar_plus"/> <page id="fbminus" src="solutions/foobar_minus"/> </item> <page id="universe" src="solutions/universe"/> <page id="everything" src="solutions/everything"/> </item> <item> <label> <translation lang="en">Our team</translation> <translation lang="de">Unser Team</translation> </label> <page id="team" src="team/index"/> <page id="history" src="team/history"/> <page id="hire" src="team/hire"/> </item> <item> <label> <translation lang="en">Contact</translation> <translation lang="de">Kontakt</translation> </label> <page id="contact" src="contact/contact"/> <page id="map" src="contact/map"/> </item> </menu>
<!-- Orthogonal and dynamic blocks: --> <blocks> <!-- Extract the 'summary' block from the product page: --> <block id="feature" src="solutions/foobar_plus" select="summary"/> <!-- Extract the 'last' block from the front page: --> <block id="news" src="index" select="last"/> <!-- Take the entire subscribe.xml: --> <block id="subscribe" src="subscribe"/> <!-- Run site map generation: --> <block-process id="sitemap" process="sitemap" mode="text" depth="2"/> <!-- Run list-titles on all files in news/: --> <block-process id="news-list" process="list-titles" dir="news/"/> </blocks> <!-- The common part of the page titles: --> <html-title> <translation lang="en">Foobar Corporation AG</translation> <translation lang="de">Foobar Corporation AG</translation> </html-title> <!-- Page footer content: --> <page-footer> <copyright> <translation lang="en">© 2003 by Foobar Corporation AG. All rights reserved.</translation> <translation lang="de">© 2003 by Foobar Corporation AG. All rights reserved.</translation> </copyright> <language-switch> <translation lang="en"> <lang link="de">Diese Seite in deutsch</lang> </translation> <translation lang="de"> <lang link="en">This page in English</lang> </translation> </language-switch>
<contact-webmaster> <translation lang="en"> Problems using this site? Contact the <mailto link="firstname.lastname@example.org">Webmaster</mailto>. </translation> <translation lang="de"> Probleme mit dieser Web-Site? Kontaktieren Sie bitte unseren <mailto link="email@example.com">Webmaster</mailto>. </translation> </contact-webmaster> </page-footer> <!-- Sequence navigation buttons: --> <buttons> <button id="prev"> <translation lang="en">prev</translation> <translation lang="de">zurück</translation> </button> <button id="next"> <translation lang="en">next</translation> <translation lang="de">vorwärts</translation> </button> </buttons> </site>
The schema in
Languages. The lang-check abstract rule checks that the element being checked contains exactly as many translation children as there are languages defined in the languages element. This rule can then be reused for any element that provides information in two languages. A separate rule with context="translation" additionally checks that the lang attributes correspond to the defined languages and that each language version is provided only once.
In this schema, many element-presence checks are lumped together
for simplicity (e.g., all children of an environment are
checked in one assert). This does not have to be that
way; if you want your schema to be really helpful, you can write a
separate check with its own diagnostic message for each element type, explaining its
role and the possible consequences of its being missing from
Context-sensitive checks. Note that there are two different page element types: One is used in the master document, and the other is the root element type in a page document. The same applies to blocks. The schema, however, has no problems differentiating between these element types based on the context.
One function of a schema is to check for unknown element type
names (most often resulting from typos). In Schematron, this can be
implemented by providing a dummy rule with no tests,
listing all defined element types as possible contexts. Following
that, a rule with context="*" signals error whenever
the rule is activated. This technique is possible because each
context will only match one rule per pattern; if an element was not
matched by the dummy rule, it is caught by the next rule and
It's only a beginning.This example schema demonstrates only the basic, most critical
checks. Your own schema may be significantly larger and more
detailed than this, although it will likely use mostly the same
techniques. Consider this schema a phrasebook with common
expressions for typical situations. Several advanced tricks for
validating complex constraints are discussed in
Example 3.3. schema.sch
<schema xmlns="http://www.ascc.net/xml/schematron"> <!-- Checks for the master document: --> <pattern name="master"> <rule context="site"> <report test="count(//environment) = 1"> Only one 'environment' found; you will need to create more if you want to build the site in a different environment. </report> <report test="count(//environment) = 0"> No 'environment' elements found; the stylesheet will be unable to figure out pathnames. </report> <assert test="languages and menu and html-title and page-footer and blocks"> One of the required elements not found inside 'site'. </assert> </rule> <rule context="page-footer"> <assert test="copyright and language-switch and contact-webmaster"> One of the required elements not found inside 'page-footer'. </assert> </rule> <rule context="environment"> <assert test="src-path and out-path and target-path and img-path and os"> One of the required elements not found inside 'environment'. </assert> <assert test="@id"> An 'environment' must have an 'id' attribute. </assert> <assert test="count(//environment/@id[. = current()/@id]) = 1"> The 'id' attribute value of an 'environment' must be unique. </assert> </rule>
<rule context="src-path | img-path | out-path | target-path"> <report test="*"> The '<name/>' element cannot have children. </report> <report test="(normalize-space(.) = '') and not(name() = 'target-path')"> The '<name/>' element cannot be empty. </report> </rule> <rule context="languages"> <assert test="count(lang) = count (*)"> The 'languages' element can only have 'lang' children. </assert> <assert test="count(lang) > 0"> The 'languages' element must have at least one 'lang' child. </assert> </rule> <rule context="languages/lang"> <assert test="count(//languages/lang[. = current()]) = 1"> Each language must be specified only once. </assert> </rule> <rule context="menu"> <assert test="count(item) = count (*)"> The 'menu' element cannot contain elements other than 'item'. </assert> </rule> <rule context="item"> <assert test="label" diagnostics="label-element"> A 'label' element is missing. </assert> <report test="count(label) > 1" diagnostics="label-element"> There is an extra 'label' element. </report> <assert test="page"> At least one 'page' element should be specified within an 'item'. </assert> </rule>
<rule context="menu//page"> <assert test="@src"> Each 'page' must have an 'src' attribute. </assert> <assert test="@id"> Each 'page' must have a unique 'id' attribute. </assert> <assert test="count(//page/@id[. = current()/@id]) = 1"> The 'id' attribute value of a 'page' must be unique. </assert> </rule> <!-- Abstract rule to check 'transformation' children: --> <rule abstract="true" id="lang-check"> <assert test="count(translation) = count(//languages/lang)"> The number of 'translation' children in '<name/>' must correspond to the number of defined languages. If this element does not exist in one of the languages, use an empty 'translation' element. </assert> <assert test="count(translation) = count(*)"> There must be no child elements here other than 'translation'. </assert> </rule> <!-- Applying the abstract rule to all bilingual elements: --> <rule context="label | html-title | copyright | language-switch | contact-webmaster | button"> <extends rule="lang-check"/> </rule> <rule context="translation"> <assert test="@lang"> Each 'translation' must have a 'lang' attribute. </assert> <assert test="@lang = //languages/lang/text()"> The value of the 'lang' attribute must correspond to one of the defined languages. </assert> <report test="@lang = preceding-sibling::translation/@lang"> There is another 'translation' element under this parent with the same value of the 'lang' attribute. </report> </rule>
<rule context="blocks"> <report test="*[not(self::block or self::block-process)]"> A 'blocks' element must only contain one or more 'block' or 'block-process' elements. </report> </rule> <rule context="blocks/block"> <assert test="@id and @src"> A 'block' defined in the master document must have both 'id' and 'src' attributes. </assert> <assert test="count(//blocks/block/@id[. = current()/@id]) = 1"> The 'id' attribute value of a 'block' must be unique. </assert> </rule> </pattern> <!-- Checks for page documents: --> <pattern name="page"> <rule context="/page"> <assert test="@keywords"> Please consider adding a list of keywords to the page. Use a 'keywords' attribute for that. </assert> <assert test="title"> Each 'page' must have a 'title'. </assert> <assert test="count(title) < 2"> A 'page' may have only one 'title'. </assert> <assert test="block"> Each 'page' must have at least one 'block'. </assert> </rule> <rule context="page//block"> <assert test="@idref or *"> A block must have either an 'idref' attribute (referring to an orthogonal block) or children. </assert>
<report test="@idref and *"> A block cannot have both an 'idref' attribute and children. </report> <report test="count(p | section) < count(*)"> A block can only have 'p' or 'section' children. </report> </rule> <rule context="section"> <assert test="head"> A section must have a 'head'. </assert> <assert test="p"> A section must have at least one 'p' (paragraph). </assert> <assert test="normalize-space(text()) = ''"> A section cannot contain text. Use a 'p' element to include a paragraph of text. </assert> </rule> </pattern> <!-- Rules common for master and page documents: --> <pattern name="common"> <rule context="int | link[@linktype='internal']"> <assert test="@link"> An internal link must use a 'link' attribute to specify the page being linked. </assert> </rule> <rule context="p"> <report test="(normalize-space(text()) = '') and not(*)"> A paragraph cannot be empty. If you want to increase vertical spacing here, modify the stylesheet. </report> </rule>
<!-- Dummy rule listing all defined element types: --> <rule context=" block | block-process | blocks | button | buttons | contact-webmaster | copyright | environment | em | ext | head | html-title | img-path | int | item | label | lang | language-switch | languages | link | mailto | menu | os | out-path | p | page | page-footer | site | section | src-path | subhead | target-path | title | translation"/> <!-- Report error if an element was not matched by the above: --> <rule context="*"> <report test="true()"> Unrecognized element: '<name/>'. </report> </rule> </pattern> <diagnostics> <diagnostic id="label-element"> Every 'item' element must contain exactly one 'label' element specifying the corresponding top menu label. </diagnostic> </diagnostics> </schema>
2. Formatting attributes such as font, color, or numbering style are out of the question - the whole point of semantic XML is that these must be abstracted away.
3. A linked element is often called an anchor, and HTML uses this term for both the source of the link (source anchor) and its destination (destination anchor); hence the use of the a element for both ends of a link.
4. In Western writing systems, of course.
10. An RFC (Request for Comments) is one of the series of standards created by the Internet Engineering Task Force (IETF) and governing most of the underlying technical structure of the Internet.
11. Strictly speaking, HTML links to URIs, not pathnames, but links within a site almost always use relative or absolute pathnames (without a server part) that are also valid URIs.
12. Unfortunately, this only works for your own site. Visitors coming from another site linking to yours will still get a 404 for a moved page.
15. It might be argued that the size of an input field is one of its essential semantic aspects and not a superficial formatting property.
17. The single slash character in this URL means that the file is available locally and not on a network host.