Friday 1 August 2003

XSLT, CDATA and disable-output-escaping

This document elaborates on my ideas about XSLT in response to Kevin Davis’s experiments with Movable Type at Alazanto.

Be warned that really understanding this document will require a good knowledge of XML and XSLT, although I have tried to make the explanation and examples as clear as possible.

The Problem

Kevin has been experimenting with using XSLT to format simple XML output from Movable Type into a complete web page. Originally, he included the data for each weblog entry in a CDATA section containing literal XHTML, in much the same way as many RSS feeds. See the following example, reformatted for clarity:

<entry>
 <title>entry with images</title>
 <date>August 09, 2003</date>
 <author>Kevin</author>
 <idnum>000033</idnum>
 <permalink>http://alazanto.org/xml/archives/000033.xml</permalink>
 <body xmlns:html="http://www.w3.org/1999/xhtml"><![CDATA[<p><img
  class="archive" align="right" src="http://alazanto.org/images/sample.jpg"
  alt="photograph of a flower, just for show"/>Mauris felis elit, varius
  quis, pulvinar vel, sodales vehicula, mi. Nunc elementum pharetra elit.
  </p>]]>
 </body>
 <more xmlns:html="http://www.w3.org/1999/xhtml"><![CDATA[]]></more>
 <comment-link>http://alazanto.org/xml/archives/000033_comments.xml</comment-link>
 <comment-count>6</comment-count>
</entry>

The XML CDATA markup indicates that the data between <![CDATA[ and ]]> should not be interpreted as XML with elements and entity references resolved. Instead, the data is included as a literal string, exactly as if each <, > and & had been encoded as &lt;, &gt; and &amp; respectively. The result is a DOM tree like the following:

  • <entry>
    • <title>
      • 'entry with images'
    • <date>
      • 'August 09, 2003'
    • <author>
      • 'Kevin'
    • <idnum>
      • '000033'
    • <permalink>
      • 'http://alazanto.org/xml/archives/000033.xml'
    • <body>
      • '<p><img class="archive" align="right" src="http://alazanto.org/images/sample.jpg" alt="photograph of a flower, just for show"/>Mauris felis elit, varius quis, pulvinar vel, sodales vehicula, mi. Nunc elementum pharetra elit.</p>'
    • <more>
    • <comment-link>
      • 'http://alazanto.org/xml/archives/000033_comments.xml'
    • <comment-count>
      • '6'

Note that in this DOM, the child text node of the <body> element is just a string, with no special meaning to an XML parser or an XSLT processor, even if it looks to you like a paragraph from an XHTML document.

We can write fairly simple XSLT templates to turn this XML into XHTML for the browser. To include the literal XHTML in the result, we can try the XSLT disable-output-escaping attribute, with a template something like this:

<xsl:template match="entry">
  <div class="entry">
    <h2><xsl:value-of select="title"/></h2>
    <xsl:value-of select="body" disable-output-escaping="yes"/>
  </div>
</xsl:template>

Without the disable-output-escaping attribute, the string value of the <body> element would be written to the output so that it could be read in again by another XML parser. In other words, each < would be escaped as &lt;, each & as &amp;, and each > as &gt;.

When processed in Internet Explorer, or a stand-alone XSLT processor, the disable-output-escaping attribute disables this escaping step, so that the text child of the <body> node is included literally in the output file as shown below, which is what Kevin expected:

<div class="entry">
 <h2>entry with images</h2>
 <p><img class="archive" align="right"
  src="http://alazanto.org/images/sample.jpg"
  alt="photograph of a flower, just for show"/>Mauris felis elit,
 varius quis, pulvinar vel, sodales vehicula, mi. Nunc elementum
 pharetra elit... </p>
</div>

The problem occurs when trying to use the same templates in Mozilla. The Mozilla XSLT processor doesn’t support disable-output-escaping, since it transforms directly from the source DOM to a destination DOM tree, without an output step in which to disable escaping. The DOM that Mozilla constructs is quite predictable, but not what Kevin wanted:

  • <div>
    • @class='entry'
    • <h2>
      • 'entry with images'
    • '<p><img class="archive" align="right" src="http://alazanto.org/images/sample.jpg" alt="photograph of a flower, just for show"/>Mauris felis elit, varius quis, pulvinar vel, sodales vehicula, mi. Nunc elementum pharetra elit... </p>'

This means that Mozilla displays the markup to the user, complete with <p> and <img> tags, instead of the paragraph text with a floating image. Mozilla bug 98168 is about this behaviour, and comment 11 states quite clearly that it is expected and will not be changed.

The Solution

The solution for Kevin is to create the original XML file without enclosing the paragraph in a CDATA section, making the image and paragraph tags real elements in the source XML DOM, so that they can be copied directly to the destination XHTML DOM. This small change to the source XML gives us a very different source DOM tree:

  • <entry>
    • <title>
      • 'entry with images'
    • <date>
      • 'August 09, 2003'
    • <author>
      • 'Kevin'
    • <idnum>
      • '000033'
    • <permalink>
      • 'http://alazanto.org/xml/archives/000033.xml'
    • <body>
      • <p>
        • <img>
          • @class='archive'
          • @align='right'
          • @src='http://alazanto.org/images/sample.jpg'
          • @alt='photograph of a flower, just for show'
        • 'Mauris felis elit, varius quis, pulvinar vel, sodales vehicula, mi. Nunc elementum pharetra elit.'
    • <more>
    • <comment-link>
      • 'http://alazanto.org/xml/archives/000033_comments.xml'
    • <comment-count>
      • '6'

With this input, the XSLT to copy the nodes can be just as simple, using xsl:copy-of to copy all the elements under the source <body> element, but not the element itself:

<xsl:template match="entry">
 <div class="entry">
  <h2><xsl:value-of select="title"/></h2>
  <xsl:copy-of select="body/*"/>
 </div>
</xsl:template>

By copying elements instead of literal XHTML source code, Mozilla displays the page from the correct DOM tree, and it works just as well with Internet Explorer and external XSLT processors. The result looks very similar to the input DOM:

  • <div>
    • @class='entry'
    • <h2>
      • 'entry with images'
    • <p>
      • <img>
        • @class='archive'
        • @align='right'
        • @src='http://alazanto.org/images/sample.jpg'
        • @alt='photograph of a flower, just for show'
      • 'Mauris felis elit, varius quis, pulvinar vel, sodales vehicula, mi. Nunc elementum pharetra elit.'

You can see the result of this in Kevin’s example XML weblog, in any web browser that supports XSLT.

(Finally, thanks to Kevin for using the <xsl:copy-of> element, which I’d managed to miss in four years of reading the XSLT spec.)

Further Reading

No comments:

Post a Comment

Please be polite. I will delete any comments I wouldn’t want my mother to read.