Tags

Directory listings in XProc

UPDATE: with some input of Vojtech Toman of EMC the code has been ameliorated so it runs now in both Calumet and Calabash.

What is the request

  1. get a list of directories
  2. inside each of these directories, get a list of files
  3. perform some simple markup processing on each of these files

What is on offer in XProc

directory-listing

One of the standard steps that XProc offers is the p:directory-list.
This p:directory-list step produces a list of the contents of a specified directory.
It's using following signature:
<p:declare-step type="p:directory-list">
<p:output port="result"/>
<p:option name="path" required="true"/> <!-- anyURI -->
<p:option name="include-filter"/> <!-- RegularExpression -->
<p:option name="exclude-filter"/> <!-- RegularExpression -->
</p:declare-step>

It generates and outputs an XML document describing the structure and contents of the directory specified on the path attribute.

Example.
Assuming that our path is file:///Users/paul/test/
having the following structure

directory structure

then the generated XML is as follows:

<c:directory xmlns:c="http://www.w3.org/ns/xproc-step" name="test" xml:base="file:/Users/paul/test/">
<c:file name=".DS_Store"/>
<c:directory name="A"/>
<c:directory name="B"/>
<c:directory name="C"/>
</c:directory>

where the root element contains the base URI and the name of the directory.
As child elements you will find the child directories and files; in this case a hidden file on MacOS X.

The XProc code used for generating this directory-list in XML is:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" xmlns:c="http://www.w3.org/ns/xproc-step" 
xmlns:cx="http://xmlcalabash.com/ns/extensions" name="myPipeline">
<p:output port="result" sequence="true"/>
<p:variable name="path" select="'file:///Users/paul/test/'">
<p:empty/>
</p:variable>
<p:directory-list>
<p:with-option name="path" select="$path">
<p:empty/>
</p:with-option>
</p:directory-list>
</p:declare-step>

repeat-over

Now that we have the directory list we can use this to loop over the child directories.
We will be using step p:for-each which allows us to proces a sequence of documents, applying a subpipeline to each document in turn. The loop itself is defined in step p:iteration-source where we select the child directories of the root directory (/c:directory/c:directory).

So when our XProc becomes
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" xmlns:c="http://www.w3.org/ns/xproc-step"
xmlns:cx="http://xmlcalabash.com/ns/extensions" name="myPipeline">
<p:output port="result" sequence="true"/>
<p:variable name="path" select="'file:///Users/paul/test/'">
<p:empty/>
</p:variable>
<p:directory-list>
<p:with-option name="path" select="$path">
<p:empty/>
</p:with-option>
</p:directory-list>
<p:for-each name="directoryloop">
<p:output port="result" sequence="true"/>
<p:iteration-source select="/c:directory/c:directory"/>
<p:identity/>
</p:for-each>
</p:declare-step>

our result is the sequence of 3 documents:

<c:directory xmlns:c="http://www.w3.org/ns/xproc-step" name="A"/>
<c:directory xmlns:c="http://www.w3.org/ns/xproc-step" name="B"/>
<c:directory xmlns:c="http://www.w3.org/ns/xproc-step" name="C"/>

This is the reason why the result port has been set to sequence='true'.

Adding a loop within the loop

Now for every of this subdirectories we want to build a list of files.
So within the loop and using during the first iteration e.g.
<c:directory xmlns:c="http://www.w3.org/ns/xproc-step" name="A"/>

we want to build a new directory-list where we need now to concatenate the path with the name of the childdirectory which we find in the XML above, hence XProcwise

<p:variable name="dirpath" select="p:resolve-uri(concat(c:directory/@name, '/'),$path)"/>

<p:directory-list>
<p:with-option name="path" select="$dirpath"/>
</p:directory-list>

Our XProc code becomes:
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" xmlns:c="http://www.w3.org/ns/xproc-step"
xmlns:cx="http://xmlcalabash.com/ns/extensions" name="myPipeline">
<p:output port="result" sequence="true"/>

<p:variable name="path" select="'file:///Users/paul/test/'">
<p:empty/>
</p:variable>
<p:directory-list>
<p:with-option name="path" select="$path">
<p:empty/>
</p:with-option>
</p:directory-list>
<p:for-each name="directoryloop">
<p:output port="result" sequence="true"/>
<p:iteration-source select="/c:directory/c:directory"/>
<p:variable name="dirpath" select="p:resolve-uri(concat(c:directory/@name, '/'),$path)"/>
<p:directory-list>
<p:with-option name="path" select="$dirpath"/>
</p:directory-list>
<p:identity/>
</p:for-each>
</p:declare-step>

Giving us for the first iteration:

<c:directory xmlns:c="http://www.w3.org/ns/xproc-step" name="A" xml:base="file:/Users/paul/test/A/">
<c:file name="A1.xml"/>
<c:file name="A2.xml"/>
</c:directory>

make-absolute-uris

It would be good to have also absolute URI's for the files.
Once again an Xproc step comes to the rescue: p:make-absolute-uris step which makes an element or attribute's value in the source document an absolute IRI value in the result document.

Hence we add following lines to our XProc:
<p:make-absolute-uris match="c:file/@name">
<p:with-option name="base-uri" select="$dirpath"/>
</p:make-absolute-uris>

leading to
<c:directory xmlns:c="http://www.w3.org/ns/xproc-step" name="A" xml:base="file:/Users/paul/test/A/">
<c:file name="file:/Users/paul/test/A/A1.xml"/>
<c:file name="file:/Users/paul/test/A/A2.xml"/>
</c:directory>

which gives us enough information now to iterate over the files using XPath /c:directory/c:file, to load them and to process them further.

The full XProc

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" xmlns:c="http://www.w3.org/ns/xproc-step"
xmlns:cx="http://xmlcalabash.com/ns/extensions" name="myPipeline">
<p:output port="result" sequence="true"/>
<!-- <p:import href="http://xmlcalabash.com/extension/steps/library-1.0.xpl"/>-->
<p:variable name="path" select="'file:///Users/paul/test/'">
<p:empty/>
</p:variable>
<p:directory-list>
<p:with-option name="path" select="$path">
<p:empty/>
</p:with-option>
</p:directory-list>
<p:for-each name="directoryloop">
<p:output port="result" sequence="true"/>
<p:iteration-source select="/c:directory/c:directory"/>
<p:variable name="dirpath" select="p:resolve-uri(concat(c:directory/@name, '/'),$path)"/>

<p:directory-list>
<p:with-option name="path" select="$dirpath"/>
</p:directory-list>
<p:make-absolute-uris match="c:file/@name">
<p:with-option name="base-uri" select="$dirpath"/>
</p:make-absolute-uris>
<p:for-each name="fileloop">
<p:iteration-source select="/c:directory/c:file"/>
<p:variable name="file" select="/c:file/@name"/>
<p:load name="file">
<p:with-option name="href" select="$file"/>
</p:load>
<p:identity/>
</p:for-each>
<!-- <p:identity/>-->
</p:for-each>
</p:declare-step>

If you see ways to optimize this approach, please comment.





Comments

Manfred Staudinger (unauthenticated)
Aug 21, 2009

Hi Paul,

First let me thank for this blog entry! I'm running calumet on Win2k with JDK version 1.6.0_14 from the
command line. When I run your very first example ("directory listing"), it works fine (Exit code: 0).
Differences to your result (significant or not) are:
1. additional xml-declaration plus CRLF
2. on the doc-element, the xml:base attribute is missing
I can run also the 2nd example ("repeat-over"), the third ("Adding a loop within the loop") and
the forth ("make-absolute-uris"). Still the xml-declaration gets added to each document of the
resulting sequence, but no xml-base appears.

But when I replace (4th example) the
<p:identity/>
with
<p:for-each name="fileloop">
<p:iteration-source select="/c:directory/c:file"/>
<p:variable name="file" select="/c:file/@name"/>
<p:load name="file">
<p:with-option name="href" select="$file"/>
</p:load>
<p:identity/>
</p:for-each>
then iI get
Exception in thread "main" com.emc.documentum.xml.xproc.pipeline.model.DynamicError: {http://www.w3.org/ns/xproc-error}XC0026: It is a dynamic error if the document does not exist or is not well-formed.
Original message: {http://www.w3.org/ns/xproc-error}XC0026: It is a dynamic error if the document does not exist or is not well-formed.
Original message: XPROC_ERROR:
Original message: Content is not allowed in prolog.
Original message: Content is not allowed in prolog.
Original message: Content is not allowed in prolog.
Original message: Content is not allowed in prolog.
Original message: XPROC_ERROR:
Original message: Content is not allowed in prolog.
Original message: Content is not allowed in prolog.
Original message: Content is not allowed in prolog.
Original message: Content is not allowed in prolog.
at com.emc.documentum.xml.xproc.util.impl.StepUtil.dynamicError(StepUtil.java:527)
at com.emc.documentum.xml.xproc.pipeline.model.step.impl.AbstractStepImpl.run(AbstractStepImpl.java:324)
... many more ...

Regards, Manfred
manfred.staudinger@gmail.com