Fun(ctional programming) with fold-left and transform

Over on StackOverflow someone asked how to apply an XSLT 1.0 stylesheet for merging two XML documents to all XML files in a directory. The approach there in the question as well as the answer is to use a shell script to run the stylesheet with Saxon on two files, then merge the result with a third file and so on.

However, given that since XPath 3.1 we have the uri-collection function to process a sequence of files and that XPath 3.1 even offers a transform function to perform a transformation directly in XSLT/XPath I thought it should also be possible to solve the problem completely in XSLT 3.0. Additionally the algorithm to process a sequence of input files, applying the merge transformation repeatedly to each file, accumulating the result, looked like an opportunity to use the fold-left function supported also in XSLT 3.0 since being included in the XPath 3 functions and operators specification.

The original XSLT 1.0 stylesheet takes a primary input document and a with parameter to provide the URL of the second file to be merged. The transform function provided in XSLT/XPath 3 allows us to call such a stylesheet by passing in a single map argument with three items, one being the stylesheet-node, one being the source-node and for the parameters stylesheet-params we need to use a further map containing the single with parameter as an xs:QName -> xs:string key -> value pair. So to encapsulate that into a single function I have come up with the following function mf:merge taking the primary input document as a node, the secondary input doc URL as a string and assuming the already loaded stylesheet being present as a global variable $merge-sheet:

<xsl:function name="mf:merge" as="node()*">
<xsl:param name="doc1" as="document-node()"/>
<xsl:param name="doc2-uri" as="xs:string"/>
<xsl:sequence select="transform(map {
'stylesheet-node' : $merge-sheet,
'source-node' : $doc1,
'stylesheet-params' : map { xs:QName('with') : $doc2-uri }
})?output"/>
</xsl:function>
view raw mf:merge.xsl hosted with ❤ by GitHub

The transform function, as well as taking a map argument, also returns a map, as you can see, inside the mf:merge function we simply access the main transformation result in that map with the key output directly, using the lookup operator ?output, and return it.

With that setup, what's left is to process a sequence of input files, passing in the first file as a node to the above function, the second as a URL and then to merge the result with the third file and so on. That is where fold-left comes in handy as follows:

<xsl:function name="mf:merge" as="node()*">
<xsl:param name="input-uris" as="xs:anyURI*"/>
<xsl:sequence select="fold-left(tail($input-uris), doc(head($input-uris)), mf:merge#2)"/>
</xsl:function>
view raw mf:merge#1.xsl hosted with ❤ by GitHub

As you can see, that function simply takes a sequence of input file URIs and then calls fold-left with the tail of the sequence as the first argument, passing in the loaded first file with doc(head($input-uris)) and a named function reference to the function mf:merge#2 shown earlier.

The whole XSLT 3.0 stylesheet then looks as follows:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math"
xmlns:mf="http://example.com/mf"
exclude-result-prefixes="xs math mf"
version="3.0">
<xsl:param name="input-dir" as="xs:string?" select="'.'"/>
<xsl:param name="file-selection-pattern" as="xs:string" select="'?select=*.xml'"/>
<!-- saved merge.xslt from http://web.archive.org/web/20160809092524/http://www2.informatik.hu-berlin.de/~obecker/XSLT/#merge as original-merge.xslt -->
<xsl:param name="merge-code-uri" as="xs:string" select="'original-merge.xslt'"/>
<xsl:param name="merge-sheet" as="document-node()" select="doc($merge-code-uri)"/>
<!--
Call Saxon 9.8 with option -it to start with below template that allows merging a collection of files
as specified by the parameters $input-dir and $file-selection-pattern.
-->
<xsl:template name="xsl:initial-template">
<xsl:variable name="input-uris" as="xs:anyURI*" select="uri-collection($input-dir || $file-selection-pattern)"/>
<xsl:sequence select="mf:merge($input-uris)"/>
</xsl:template>
<xsl:function name="mf:merge" as="node()*">
<xsl:param name="input-uris" as="xs:anyURI*"/>
<xsl:sequence select="fold-left(tail($input-uris), doc(head($input-uris)), mf:merge#2)"/>
</xsl:function>
<xsl:function name="mf:merge" as="node()*">
<xsl:param name="doc1" as="document-node()"/>
<xsl:param name="doc2-uri" as="xs:string"/>
<xsl:sequence select="transform(map {
'stylesheet-node' : $merge-sheet,
'source-node' : $doc1,
'stylesheet-params' : map { xs:QName('with') : $doc2-uri }
})?output"/>
</xsl:function>
</xsl:stylesheet>

As you can see, it uses the uri-collection function with a Saxon specific collection URI extension <xsl:param name="file-selection-pattern" as="xs:string" select="'?select=*.xml'"/> to read in the URIs of all .xml files in a folder and then simply has to call mf:merge($input-uris).

The stylesheet can be run with Saxon 9.8 PE or EE, unfortunately it does not work with 9.8 HE for two reasons, first, higher-order functions like fold-left and named function references are not supported in HE, and secondly, the original stylesheet is an XSLT 1.0 stylesheet which 9.8 HE does not support. So to use the original stylesheet with 9.8 HE, we would need to edit the original merge code to use version 3.0, and we would need to rewrite the functions and implement the recursion that fold-left provides us without using a named function reference.

For the time being, here are three sample input documents to be merged and the result that Saxon 9.8 executing the also shown original XSLT 1.0 stylesheet creates:

<?xml version="1.0" encoding="UTF-8"?>
<!-- file1.xml -->
<themes>
<theme id="appl">
<title xml:lang="nl">Toepassingen</title>
</theme>
</themes>
view raw input1.xml hosted with ❤ by GitHub
<?xml version="1.0" encoding="UTF-8"?>
<!-- file2.xml -->
<themes>
<theme id="doc" />
<theme id="appl">
<title xml:lang="en">Applications</title>
</theme>
</themes>
view raw input2.xml hosted with ❤ by GitHub
<?xml version="1.0" encoding="UTF-8"?>
<!-- file3.xml -->
<themes>
<theme id="doc" />
<theme id="appl">
<title xml:lang="es">aplicación</title>
</theme>
</themes>
view raw input3.xml hosted with ❤ by GitHub
<?xml version="1.0" encoding="UTF-8"?><!-- file1.xml --><!-- file2.xml --><!-- file3.xml --><themes>
<theme id="doc"/>
<theme id="appl">
<title xml:lang="nl">Toepassingen</title>
<title xml:lang="en">Applications</title>
<title xml:lang="es">aplicación</title>
</theme>
</themes>
<?xml version="1.0"?>
<!--
Merging two XML files
Version 1.6
LGPL (c) Oliver Becker, 2002-07-05
obecker@informatik.hu-berlin.de
-->
<xslt:transform version="1.0"
xmlns:xslt="http://www.w3.org/1999/XSL/Transform"
xmlns:m="http://informatik.hu-berlin.de/merge"
exclude-result-prefixes="m">
<!-- Normalize the contents of text, comment, and processing-instruction
nodes before comparing?
Default: yes -->
<xslt:param name="normalize" select="'yes'" />
<!-- Don't merge elements with this (qualified) name -->
<xslt:param name="dontmerge" />
<!-- If set to true, text nodes in file1 will be replaced -->
<xslt:param name="replace" select="false()" />
<!-- Variant 1: Source document looks like
<?xml version="1.0"?>
<merge xmlns="http://informatik.hu-berlin.de/merge">
<file1>file1.xml</file1>
<file2>file2.xml</file2>
</merge>
The transformation sheet merges file1.xml and file2.xml.
-->
<xslt:template match="m:merge" >
<xslt:variable name="file1" select="string(m:file1)" />
<xslt:variable name="file2" select="string(m:file2)" />
<xslt:message>
<xslt:text />Merging '<xslt:value-of select="$file1" />
<xslt:text />' and '<xslt:value-of select="$file2"/>'<xslt:text />
</xslt:message>
<xslt:if test="$file1='' or $file2=''">
<xslt:message terminate="yes">
<xslt:text>No files to merge specified</xslt:text>
</xslt:message>
</xslt:if>
<xslt:call-template name="m:merge">
<xslt:with-param name="nodes1" select="document($file1,/*)/node()" />
<xslt:with-param name="nodes2" select="document($file2,/*)/node()" />
</xslt:call-template>
</xslt:template>
<!-- Variant 2:
The transformation sheet merges the source document with the
document provided by the parameter "with".
-->
<xslt:param name="with" />
<xslt:template match="*">
<xslt:message>
<xslt:text />Merging input with '<xslt:value-of select="$with"/>
<xslt:text>'</xslt:text>
</xslt:message>
<xslt:if test="string($with)=''">
<xslt:message terminate="yes">
<xslt:text>No input file specified (parameter 'with')</xslt:text>
</xslt:message>
</xslt:if>
<xslt:call-template name="m:merge">
<xslt:with-param name="nodes1" select="/node()" />
<xslt:with-param name="nodes2" select="document($with,/*)/node()" />
</xslt:call-template>
</xslt:template>
<!-- ============================================================== -->
<!-- The "merge" template -->
<xslt:template name="m:merge">
<xslt:param name="nodes1" />
<xslt:param name="nodes2" />
<xslt:choose>
<!-- Is $nodes1 resp. $nodes2 empty? -->
<xslt:when test="count($nodes1)=0">
<xslt:copy-of select="$nodes2" />
</xslt:when>
<xslt:when test="count($nodes2)=0">
<xslt:copy-of select="$nodes1" />
</xslt:when>
<xslt:otherwise>
<!-- Split $nodes1 and $nodes2 -->
<xslt:variable name="first1" select="$nodes1[1]" />
<xslt:variable name="rest1" select="$nodes1[position()!=1]" />
<xslt:variable name="first2" select="$nodes2[1]" />
<xslt:variable name="rest2" select="$nodes2[position()!=1]" />
<!-- Determine type of node $first1 -->
<xslt:variable name="type1">
<xslt:apply-templates mode="m:detect-type" select="$first1" />
</xslt:variable>
<!-- Compare $first1 and $first2 -->
<xslt:variable name="diff-first">
<xslt:call-template name="m:compare-nodes">
<xslt:with-param name="node1" select="$first1" />
<xslt:with-param name="node2" select="$first2" />
</xslt:call-template>
</xslt:variable>
<xslt:choose>
<!-- $first1 != $first2 -->
<xslt:when test="$diff-first='!'">
<!-- Compare $first1 and $rest2 -->
<xslt:variable name="diff-rest">
<xslt:for-each select="$rest2">
<xslt:call-template name="m:compare-nodes">
<xslt:with-param name="node1" select="$first1" />
<xslt:with-param name="node2" select="." />
</xslt:call-template>
</xslt:for-each>
</xslt:variable>
<xslt:choose>
<!-- $first1 is in $rest2 and
$first1 is *not* an empty text node -->
<xslt:when test="contains($diff-rest,'=') and
not($type1='text' and
normalize-space($first1)='')">
<!-- determine position of $first1 in $nodes2
and copy all preceding nodes of $nodes2 -->
<xslt:variable name="pos"
select="string-length(substring-before(
$diff-rest,'=')) + 2" />
<xslt:copy-of
select="$nodes2[position() &lt; $pos]" />
<!-- merge $first1 with its equivalent node -->
<xslt:choose>
<!-- Elements: merge -->
<xslt:when test="$type1='element'">
<xslt:element name="{name($first1)}"
namespace="{namespace-uri($first1)}">
<xslt:copy-of select="$first1/namespace::*" />
<xslt:copy-of select="$first2/namespace::*" />
<xslt:copy-of select="$first1/@*" />
<xslt:call-template name="m:merge">
<xslt:with-param name="nodes1"
select="$first1/node()" />
<xslt:with-param name="nodes2"
select="$nodes2[position()=$pos]/node()" />
</xslt:call-template>
</xslt:element>
</xslt:when>
<!-- Other: copy -->
<xslt:otherwise>
<xslt:copy-of select="$first1" />
</xslt:otherwise>
</xslt:choose>
<!-- Merge $rest1 and rest of $nodes2 -->
<xslt:call-template name="m:merge">
<xslt:with-param name="nodes1" select="$rest1" />
<xslt:with-param name="nodes2"
select="$nodes2[position() &gt; $pos]" />
</xslt:call-template>
</xslt:when>
<!-- $first1 is a text node and replace mode was
activated -->
<xslt:when test="$type1='text' and $replace">
<xslt:call-template name="m:merge">
<xslt:with-param name="nodes1" select="$rest1" />
<xslt:with-param name="nodes2" select="$nodes2" />
</xslt:call-template>
</xslt:when>
<!-- else: $first1 is not in $rest2 or
$first1 is an empty text node -->
<xslt:otherwise>
<xslt:copy-of select="$first1" />
<xslt:call-template name="m:merge">
<xslt:with-param name="nodes1" select="$rest1" />
<xslt:with-param name="nodes2" select="$nodes2" />
</xslt:call-template>
</xslt:otherwise>
</xslt:choose>
</xslt:when>
<!-- else: $first1 = $first2 -->
<xslt:otherwise>
<xslt:choose>
<!-- Elements: merge -->
<xslt:when test="$type1='element'">
<xslt:element name="{name($first1)}"
namespace="{namespace-uri($first1)}">
<xslt:copy-of select="$first1/namespace::*" />
<xslt:copy-of select="$first2/namespace::*" />
<xslt:copy-of select="$first1/@*" />
<xslt:call-template name="m:merge">
<xslt:with-param name="nodes1"
select="$first1/node()" />
<xslt:with-param name="nodes2"
select="$first2/node()" />
</xslt:call-template>
</xslt:element>
</xslt:when>
<!-- Other: copy -->
<xslt:otherwise>
<xslt:copy-of select="$first1" />
</xslt:otherwise>
</xslt:choose>
<!-- Merge $rest1 and $rest2 -->
<xslt:call-template name="m:merge">
<xslt:with-param name="nodes1" select="$rest1" />
<xslt:with-param name="nodes2" select="$rest2" />
</xslt:call-template>
</xslt:otherwise>
</xslt:choose>
</xslt:otherwise>
</xslt:choose>
</xslt:template>
<!-- Comparing single nodes:
if $node1 and $node2 are equivalent then the template creates a
text node "=" otherwise a text node "!" -->
<xslt:template name="m:compare-nodes">
<xslt:param name="node1" />
<xslt:param name="node2" />
<xslt:variable name="type1">
<xslt:apply-templates mode="m:detect-type" select="$node1" />
</xslt:variable>
<xslt:variable name="type2">
<xslt:apply-templates mode="m:detect-type" select="$node2" />
</xslt:variable>
<xslt:choose>
<!-- Are $node1 and $node2 element nodes with the same name? -->
<xslt:when test="$type1='element' and $type2='element' and
local-name($node1)=local-name($node2) and
namespace-uri($node1)=namespace-uri($node2) and
name($node1)!=$dontmerge and name($node2)!=$dontmerge">
<!-- Comparing the attributes -->
<xslt:variable name="diff-att">
<!-- same number ... -->
<xslt:if test="count($node1/@*)!=count($node2/@*)">.</xslt:if>
<!-- ... and same name/content -->
<xslt:for-each select="$node1/@*">
<xslt:if test="not($node2/@*
[local-name()=local-name(current()) and
namespace-uri()=namespace-uri(current()) and
.=current()])">.</xslt:if>
</xslt:for-each>
</xslt:variable>
<xslt:choose>
<xslt:when test="string-length($diff-att)!=0">!</xslt:when>
<xslt:otherwise>=</xslt:otherwise>
</xslt:choose>
</xslt:when>
<!-- Other nodes: test for the same type and content -->
<xslt:when test="$type1!='element' and $type1=$type2 and
name($node1)=name($node2) and
($node1=$node2 or
($normalize='yes' and
normalize-space($node1)=
normalize-space($node2)))">=</xslt:when>
<!-- Otherwise: different node types or different name/content -->
<xslt:otherwise>!</xslt:otherwise>
</xslt:choose>
</xslt:template>
<!-- Type detection, thanks to M. H. Kay -->
<xslt:template match="*" mode="m:detect-type">element</xslt:template>
<xslt:template match="text()" mode="m:detect-type">text</xslt:template>
<xslt:template match="comment()" mode="m:detect-type">comment</xslt:template>
<xslt:template match="processing-instruction()" mode="m:detect-type">pi</xslt:template>
</xslt:transform>

As promised in the first edit of this post, I below also show to implement the same approach of the transform use for Saxon 9.8 HE by avoiding the use of fold-left and a named function reference and instead implementing the recursion in a user-defined function:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math"
xmlns:mf="http://example.com/mf"
exclude-result-prefixes="xs math mf"
version="3.0">
<xsl:param name="input-dir" as="xs:string?" select="'.'"/>
<xsl:param name="file-selection-pattern" as="xs:string" select="'?select=*.xml'"/>
<!-- saved merge.xslt from http://web.archive.org/web/20160809092524/http://www2.informatik.hu-berlin.de/~obecker/XSLT/#merge as original-merge.xslt -->
<xsl:param name="merge-code-uri" as="xs:string" select="'original-merge.xslt'"/>
<xsl:param name="merge-sheet" as="document-node()" select="doc($merge-code-uri)"/>
<!--
Call Saxon 9.8 with option -it to start with below template that allows merging a collection of files
as specified by the parameters $input-dir and $file-selection-pattern.
-->
<xsl:template name="xsl:initial-template">
<xsl:variable name="input-uris" as="xs:anyURI*" select="uri-collection($input-dir || $file-selection-pattern)"/>
<xsl:sequence select="mf:merge($input-uris)"/>
</xsl:template>
<xsl:function name="mf:merge" as="node()*">
<xsl:param name="input-uris" as="xs:anyURI*"/>
<xsl:sequence select="mf:chain-merge(tail($input-uris), doc(head($input-uris)))"/>
</xsl:function>
<xsl:function name="mf:chain-merge" as="node()*">
<xsl:param name="input-uris" as="xs:anyURI*"/>
<xsl:param name="result" as="node()"/>
<xsl:sequence select="if (empty($input-uris))
then $result
else mf:chain-merge(tail($input-uris), mf:merge($result, head($input-uris)))"/>
</xsl:function>
<xsl:function name="mf:merge" as="node()*">
<xsl:param name="doc1" as="document-node()"/>
<xsl:param name="doc2-uri" as="xs:string"/>
<xsl:sequence select="transform(map {
'stylesheet-node' : $merge-sheet,
'source-node' : $doc1,
'stylesheet-params' : map { xs:QName('with') : $doc2-uri }
})?output"/>
</xsl:function>
</xsl:stylesheet>


Make sure you edit the original merge XSLT 1.0 stylesheet to use version="3.0" to allow running it with Saxon 9.8 HE. This concludes this blog post.


Comments

Post a Comment

Popular posts from this blog

Using accumulators to number items in a streamable way

Extracting sub trees of a document using snapshot()