XML/XSL Portal

Multiple Stylesheet Aware
HideXML MainHide

The following feeds are currently available for adding to your navigation bar.

Saxon diariesMichael Kay’s bloghttps://dev.saxonica.com/blog/atom/mike.xml2021-06-27T17:22:36.823ZMichael KayWhat should we do about Arrays?https://dev.saxonica.com/blog/mike/2021/06/arrays.html2021-06-27T15:34:00Z

Arrays were added to the data model for XPath 3.1 (and XQuery 3.1): the main motivation was the need for faithful representation of JSON data structures, while a secondary consideration was the long-standing requirement for "sequences of sequences".

Processing support for arrays in the current languages is rather limited. There's a basic set of functions available, but not much else. Support in XSLT 3.0 is particularly weak, because XSLT 3.0 was primarily designed to work with XPath 3.0 (which didn't have arrays), with 3.1 support added as something of an afterthought.

This note surveys where the gaps are, and how they should be filled.

Many of the complications in processing arrays arise because the members of an array can be arbitrary sequences, not just single items. There were two reasons for this design. One is simply orthogonality: the principle of no unnecessary restrictions. The other was support for the JSON null value, which maps naturally to an empty sequence in XDM, but only if an array is allowed have an empty sequence as one of its members.

Array Construction

XPath 3.1 offers two constructs for creating arrays: the "square" and "curly" constructors. Neither is completely general. The "square" constructor (for example [$X, $Y, $Z]) can construct an array with arbitrary values as its members, but the size of the array needs to be known statically. The "curly" constructor (for example array{$X, $Y, $Z}) can construct an array whose size is decided dynamically, but the members of the array must be singleton items (not arbitrary sequences). The WG failed to come up with a construct for creating an array where both the size of the array and the size of each member are determined dynamically. The only way to achieve this is with a fairly convoluted use of functions such as array:join().

XSLT 3.0 has no mechanism for array construction. An xsl:array instruction has been proposed, and is prototyped as saxon:array in current Saxon releases; but the difficulty is in defining the detail of how it should work. It makes sense for it to enclose a sequence constructor, so instructions like xsl:for-each and xsl:choose can be used when building the content. But sequence constructors deliver sequences of items, not sequences of sequences. So the current proposal for XSLT 4.0 envisages an xsl:array-member instruction that wraps a sequence as a zero-arity function. The problem with this is that the mechanism is transparent yet arbitrary; it looks like (and is) a kludge.

Array Processing

Similarly, there are limited options for processing of arrays. There's no equivalent of the "for" clause in FLWOR expressions that binds a variable to each member of an array in turn. The closest things on offer are the array:filter() and array:for-each() higher order functions – which are more useful in XQuery than in XSLT, because of the difficulty in XSLT of writing an anonymous function that constructs new XML element nodes. XSLT in particular relies heavily (in constructs such as xsl:apply-templates, xsl:for-each, xsl:iterate, and xsl:for-each-group) on binding values implicitly to the context item. But the context item is an item, not an arbitrary value, so binding members of arrays to the context item isn't an option.

Generalizing "." to represent an arbitrary value rather than a single item seems an attractive idea, but it's very hard to do without breaking a lot of existing code.

Iterating over an array and binding each member to a variable works well in XQuery, where adding a "for member" clause to FLWOR expressions works cleanly enough. But there's lots of other functionality for processing sequences that can't be translated easily into equivalent mechanisms for arrays, especially in XSLT.


It seems that a solution for both array construction and array processing is to find a way to pack an arbitrary sequence into a single item. We'll refer to a "sequence packed into an item" as a parcel. We can then construct an array from a sequence of parcels, and we can decompose an array into a sequence of parcels, allowing both operations to be implemented using all the existing machinery for handling sequences.

It seems that four operations are sufficient to fill the processing gap:

  • Wrap a sequence as a parcel
  • Unwrap a parcel as a sequence
  • Construct an array from a sequence of parcels
  • Decompose an array into a sequence of parcels

So four functions should do the job: parcel(item()*) => P, unparcel(P) => item()*, array:of(P*) as array(*), array:members(array(*)) as P*, where P is the item type of a parcel. Of course, we can also add XSLT or XQuery syntactic sugar on top of these building blocks.

We now have to address the question: what kind of item is a parcel? Is it represented using something we already know and love (like an array, or a zero-arity function) or is it something new? How should the type of a parcel be represented in type signatures, and what operations (apart from the above four) should be available on them?

I'm beginning to come to the conclusion that the type safety that comes from treating a parcel as a new kind of item justifies the extra complexity in the type system. If we reuse an existing kind of item (for example, zero-arity functions), then there's always going to be confusion about whether items of that type are to be treated as parcels or as their "ordinary selves".

However, I'm reluctant to add yet another fundamental type. We can't keep adding fundamental types, and new syntax, every time we need something new (cf my Balisage 2020 paper on adding promises). Can't we make the type system more extensible?

Pro tem, I suggest we build on the concept of "extension objects" defined in §25.1.3 of the XSLT specification. These are intended as opaque objects that can be returned by one extension function and supplied to another. This concept should really be defined in XDM rather than in XSLT. We should add that an "extension object" may be an instance of an "extension type", and that extension types are denoted in the ItemType syntax by a QName (that is, the same syntax as atomic types), with the QName being made known to the processor in some implementation-defined way. Then we reserve a namespace URI sys for "built in extension types", and define sys:parcel as such a type.

Saxon-CS says Hello Worldhttps://dev.saxonica.com/blog/mike/2021/03/saxon_cs_hello_world.html2021-03-22T10:34:00Z

The Saxon product on .NET has been living on borrowed time for a while. It's built by converting the Java bytecode of the Java product to the equivalent .NET intermediate language, using the open-source IKVM converter produced by Jeroen Frijters. Jeroen after many years of devoted service decided to give up further development and maintenance of IKVM a few years ago, which didn't immediately matter because the product worked perfectly well. But then Microsoft in 2019 announced that future .NET developments would be based on .NET Core, and IKVM has never supported .NET Core, so we clearly had a problem.

There's a team attempting to produce a fork of IKVM that supports the new .NET, but we've never felt we could put all our eggs in that basket. In any case, we also have performance problems with IKVM that we've never managed to resolve: some applications run 5 times slower than Java, and despite a lot of investigation, we've never worked out why.

So we decided to try a new approach, namely Java-to-C# source code conversion. After a lot of work, we've now achieved successful compilation and execution of a subset of the the code, and for the first time this morning, Saxon-CS successfully ran the minimal "Hello World" query.

We're a long way from having a product we can release, but we can now have confidence that this approach is going to be viable.

How does the conversion work? We looked at some available tools, notably the product from Tangible Solutions, and this gave us many insights into what could be readily converted, and where the remaining difficulties lay; it also convinced us that we'd be better off writing our own converter.

The basic workflow is:

  1. Using the open source JavaParser library, parse the Java code, generate an XML abstract syntax tree for each module, and annotate the syntax tree with type information where needed.
  2. Using XSLT code, do a cross-module analysis to determine which methods override each other, which have covariant return types, etc: information needed when generating the C# code.
  3. Perform an XSLT transformation on each module to generate C# code.

We can't convert everything automatically, so there's a range of strategies we use to deal with the remaining issues:

  • Some constructs can simply be avoided. We have trouble, for example, converting Java method references like Item::toString, because it needs a fair bit of context information to distinguish the various possible translations. But it's no great hardship to write the Java code a different way, for example as a lambda expression item -> item.toString(). Another example is naming conflicts: C# doesn't allow you, for example, to have a variable with the same name as a method in the containing class. It's no hardship to rename the variables so the problem doesn't arise.
  • We can use Java annotations to steer the conversion. For example, sometimes we want to generate C# code that's completely unrelated to the Java code. We can move this code into a method of its own, and then add an annotation @CSharpReplaceMethodBody which substitutes different code for the existing method body. The annotation is copied into the XML syntax tree by the JavaParser, and our converter can pick it up from there.
  • We already have a preprocessor mechanism to mark chunks of code as being excluded from particular variants of the product (such as Saxon-HE or Saxon-PE). We can make further use of this mechanism. However, it's limited by the fact that the code, prior to preprocessing, must be valid Java so that it works in the IDE.

The areas that have caused most trouble in conversion are:

  • Inner classes. C# has no anonymous inner classes, and its named inner classes correspond only to Java's static inner classes. Guided by the way the Tangible converter handles these, we've found a way of translating them that handles most cases, and we've added Java annotations that provide the converter with extra information where additional complexities arise.
  • Enumeration types. C#'s enumeration types are much more limited than the equivalent in Java, because enumeration constants can't have custom methods associated with them. We distinguish three kinds of enumeration classes: singleton enumerations (used to implement classes that will only have a single instance); simple enumerations with no custom behaviour, which can be translated to C# enumerations very directly, and more complex enumerations, that result in the generation of two separate C# classes, one to hold the enumeration constants, the other to accommodate the custom methods.
  • Generics. C# is much stricter about generic types than Java, because the type information is carried through to run-time, whereas in Java it is used only for compile-time type checking, which can be subverted by use of casting. So the rule in C# is, either use generics properly, or don't use them at all. We anticipated some of these issues a year or two ago when we first started thinking about this project: see Java Generics Revisited. The result is that the classes representing XDM sequences and sequence iterators no longer use generics, which has saved a lot of hassle in this conversion. But there are still many problems, notably (a) the type inference needed to support Java's diamond operator (as in new ArrayList<>(), where an explicit type parameter is needed in C#), and (b) the handling of covariant and contravariant wildcards (? extends T, ? super T.)
  • Iterators and enumerators. A for-each loop in Java (for (X x : collection)) relies on the collection operand implementing the java.lang.Iterable interface. To translate this into a C# for-each loop (foreach (X x in collection)) the collection needs to implement IEnumerable. So we convert all Iterables to IEnumerables, and that means we have to convert Iterators to Enumerators. Unfortunately Java's Iterator interface doesn't lend itself to static translation to a c# IEnumerator: in Java, the hasNext() method is stateless (so you can call it repeatedly), whereas C#'s MoveNext changes the current position (so you can't). We're fortunate that we only make modest use of Java iterators; in most of the code, we use Saxon's SequenceIterator interface in preferance, and this converts without trouble. We examined all the cases where Saxon explicitly uses hasNext() and next(), and made sure these followed the discipline of calling hasNext() exactly once before each call on next(); with this discipline, converting the calls to MoveNext() and Current works without problems.
  • Lambda expressions and delegates. In Java, lambda expressions can be used where the expected type is a functional interface; a functional interface in other ways is just an ordinary interface, and you can have concrete classes that implement it. So for example the second argument of NodeInfo.iterateAxis(axis, nodeTest) is a NodeTest, for which we can supply either a lambda expression (such as it -> it instanceof XSLExpose), or one of a whole range of implementation classes such as a SchemaElementTest, which tests whether an element belongs to an XSD-defined substitution group. In C#, lambda expressions can only be used when the expected type is a delegate, and if the expected type is a delegate, then (in effect) a lambda expression is the only thing you can supply. The way we've handled this is generally to make the main method (like iterateAxis() expect a non-delegate interface, and then to supply a proxy implementation of this interface that accepts a delegate. It's not a very satisfactory solution, but it works.

One area where we could have had trouble, but avoided it, is in the use of the Java CharSequence class. I wrote about this issue last year at String, CharSequence, IKVM, and .NET. As described in that article, we decided to eliminate our dependence on the CharSequence interface. For a great many internal uses of strings in Saxon, we now use a new interface UnicodeString which as the name implies is much more Unicode-friendly than Java's String and CharSequence. It also reduces memory usage, especially in the TinyTree. But there is a small overhead in the places where we have to convert strings to or from UnicodeStrings, which we can't hide entirely: it represents about 5% on the bottom line. But it does make all this code much easier to port between Java and C#.

What about dependencies? So far we've just been tackling the Saxon-HE code base, and that has very few dependencies that have caused any difficulty. Most of the uses of standard Java library classes (maps, lists, input and output streams, and the like) are handled by the converter, simply translating calls into the nearest C# equivalent. In some cases such as java.util.Properties we've written en emulation of the Java interface (or the parts of it that we actually use). In other cases we've redirected calls to helper methods. For example we don't always have enough type information to know whether Java's List.remove() should be translated to List.Remove() or List.RemoveAt(); so instead we generate a call on a static helper method, which makes the decision at runtime based on the type of the supplied argument.

The only external dependency we've picked up so far is for handling big decimal numbers. We're currently evaluating the BigDecimal library from Singulink, which appears to offer all the required functionality, though its philosophy is sufficiently different from the Java BigDecimal to make conversion non-trivial.

One thing I should stress is that we haven't written a general purpose Java to C# converter. Our converter is designed to handle the Saxon codebase, and nothing else. Some of the conversion rules are specific to particular Saxon classes, and as a general principle, we only convert the subset of the language and of the class library that we actually need. Some of the conversion rules assume that the code is written to the coding conventions that we use in Saxon, but which might not be followed in other projects.

So, Hello World to Saxon-CS. There's still a lot of work to do, but we've reached a significant milestone.

The Zeno Chain: a new data structure for XDM sequenceshttps://dev.saxonica.com/blog/mike/2021/03/zeno_chains.html2021-03-18T15:34:00Z

This article presents the Zeno Chain, a new data structure used to underpin the implementation of XDM sequences in Saxon. It is also designed to be capable of supporting XDM arrays, and might also have potential for holding long character strings.

The current implementation of the Zeno Chain is a mutable list, but the design lends itself easily to creating an immutable variant. It also makes it easy to construct an immutable list from a mutable one, making it efficient to construct a sequence with in-situ modification, and then "freeze" it once construction is complete.

Saxon currently uses a variety of structures for holding sequences and arrays. This variety is a problem in itself. Choosing the right structure for a particular scenario involves somewhat hit-or-miss decision making; it would be better to have a single "all-rounder" structure that performs well in a variety of situations.

There are of course vast numbers of data structures for sequences available in the computer science literature. One promising one, for example, is the "finger tree" which supports a wide range of access patterns efficiently. But it also has drawbacks: any tree structure that requires a node for each item in a list is going to have a large memory overhead when storing a long sequence, and the use of a fine-grained structure like this tends to mean that there is little locality of reference for memory addressing, leading to poor CPU caching performance.

The Zeno chain stores a sequence as a list of lists of items: that is, it is a tree with a constant depth of 2. In the Java implementation, both levels of list are instances of java.util.ArrayList. The key to the performance of the structure is managing the number and size of the second-level lists, which I call segments.

In a list that is constructed by appending individual items on the end (a common scenario), the length of a segment increases the closer it is to the start. For a list of 20,000 items, there are ten segments whose sizes are (8192, 4096, 4096, 2048, 1024, 256, 128, 64, 64, 32). (Now you know why I called it a Zeno chain.) The exact numbers don't matter here: what is important is that the total number of segments increases only logarithmically with the length of the sequence, and that the segments on the right are short, which makes further append operations efficient.

In a list constructed by prepending individual items, the distribution of lengths will be the other way round: shortest segments near the front. In the rare case where both append and prepend operations occur, both ends will have short segments, while longer segments will cluster around the middle.

Here's a summary of the major operations performed on the sequence:

  • Append an item: if the list is empty, construct a single segment of length 1. Otherwise, if the last segment has length < 32, append to it. If the last segment is already full, coalesce the last segment with the previous segment if the previous segment has sufficient room; if not, work up the list to the start to find adjacent segments that can be merged. A segment is considered to have sufficient room for such expansion if its resulting size would not exceed 2^(N+5) where N is the distance of the segment from the right-hand end of the sequence; it's this formula that ensures that longer segments accumulate at the start of the sequence. If all segments in the sequence are full — that is, if the segment sizes are decreasing powers of two — then add a new segment. Append operations essentially take constant time; 97% of them only affect the final segment.
  • Prepend an item: simply append in reverse.
  • Get the Nth item: search the master list of segments examining the sizes of the segments until the right segment is found, then get the item by addressing into the Java ArrayList. This takes logarithmic time. The average access time will be slightly higher in a list built by prepending items, because the chance of finding the required item in the first couple of segments is much lower.
  • Subsequence: make a new Zeno chain containing whole or part copies of the segments from the original chain that are in the required range.
  • Iteration: Keep two index positions, the index position in the master list, and the index position in the current segment, and use these indexes to retrieve the next item by calling ArrayList.get() twice.
  • Sequence concatenation: This is quite a common operation in XSLT and XPath, as it's the basis of the "flattening" operations such as xsl:for-each, xsl:apply-templates, and FLWOR expressions. The most direct approach is simply to concatenate the two master lists, leaving the segments unchanged. This however can lead to fragmentation of the sequence, so we perform a reorganization to reduce the number of short segments. Specifically, working from the right hand end, if any segment is found to be shorter than both its immediate neighbours, we combine it with the left-hand neighbour and reduce the number of segments by one. This has the effect of reducing the incidence of short segments in the middle of the chain.
  • Insertion, removal, and replacement: these operations are comparatively rare. With the immutable version of the structure, an alteration affecting one of the larger segments will require copying of everything else in that segment. This isn't ideal: but it's better than copying the entire sequence, which is what often happens today. And the use of the Java ArrayList at least means that the copying is very fast.

It's important to note that most operations on sequences don't actually result in a new sequence being constructed. Calling tail(), for example, doesn't copy any data: it delivers an iterator over a portion of the original sequence. The sequence only gets materialized if, for example the result is stored in a variable (and even then, not always).

Saxon's default implementation for a sequence is simply a Java List. Appending an item to a list generally copies the whole list. Where Saxon can detect that this is going to be inefficient, it instead uses a structure called a Chain: this is effectively a tree of segments. But there's little serious attempt to manage the depth of the tree or the size of the segments, and the results in some cases can be rather poor.The Zeno chain offers a signficant improvement; it also looks as if it can be used for arrays as well as sequences.

For managing long strings, I invented a similar structure, which I then discovered already existed in the literature and is known as a Rope: a Rope represents a string as a tree of substrings. The literature on Ropes describes how to keep the tree balanced, but it has nothing to say about how to decide how many substrings to hold, and how long to make them. The Zeno chain might turn out to provide an answer to that question.

Arrow Expressionshttps://dev.saxonica.com/blog/mike/2020/11/19-arrow-expressions.html2020-11-19T10:20:00Z

Arrow Expressions

When I proposed the arrow operator to the XQuery/XSLT working groups, I thought of it as minor syntactic sugar. It's just a convenience: instead of substring-before(substring-after(X, '['), ']') you can write X => substring-after('[') => substring-before(']') which helps you to avoid going cross-eyed. If you're the kind of person who can play the piano with your hands crossed over, you probably don't need it, but for the rest of us, it makes life just a tiny bit easier.

So I was a bit surprised at XML Prague 2020 that Juri Leino managed to construct an entire presentation around the arrow operator (Shooting Arrows Fast and Accurately). Not only that, he also developed a whole library of functions, called XBow, to increase their power.

Now, XBow actually reveals a bit of a weakness in the construct: you can construct a pipeline of functions, but you can't include arbitrary expressions in the pipeline unless each of the expressions is made available via a function. Moreover, the value output by one step in the pipeline can only be used as the first argument in the next function: you can do X => concat('$') to add a "$" at the end of a string, but there's no simple way of adding a "$" at the front, except by defining a new prepend function that does this for you (or hoping that XBow will have anticipated your requirement).

Now, of course you can do X ! concat('$', .). But that doesn't always fit the bill. Firstly, it only works when you're processing single items (or mapping a sequence to multiple items). Secondly, (to use the current jargon) the optics are wrong: it breaks the pipeline visually.

So my first suggestion is that we allow inline expressions to appear in a pipeline. Something like this: X => {~ + 1}, or X => {concat('$', ~)}. I'm using '~' here as a variable to refer to the implicit argument, that is, the value passed down the pipeline. I would have used '_', as Scala does, but unfortunately '_' is a legal element name so it already has a meaning. And '~' seems to work quite nicely.

The next thing that's been requested is implicit mapping, so you can use something like arrow notation to do X ! substring-after(., '$') ! number(.) => sum(). (Actually, the main obstacle in getting the arrow operator accepted by the XQuery Working Group was that some people wanted it to have this meaning.)

For that I propose we use a "thin arrow": X -> substring-after('$') -> number() => sum(). The effect of the thin arrow is that instead of passing the value of the LHS to the function on the RHS en bloc, we pass it one item at a time. Of course, if the value on the LHS is a single item, then it doesn't matter which kind of arrow we use, both have the same effect.

If you're a fan of map-reduce terminology, then you'll recognize this instantly as a map-reduce pipeline. The -> operations are doing a mapping, and the final => does a reduce. If you're more into functional thinking, you probably think of it more in terms of function composition.

Of course thin arrows can also be used with arbitrary expressions, just like thick arrows: (0 to 3) -> {~ + 1} -> format-integer('a') => string-join('.') returns "a.b.c.d".

And now I'd like to pull one more rabbit out of the hat. What if I want a function that applies the above pipeline to any input sequence. I could write function($x){$x -> {~ + 1} -> format-integer('a') => string-join('.')} but that seems clunky. I'm looking for a nice way to supply functions as arguments to higher-order functions like sort, where other languages have shown that a concise notation for anonymous functions (like a -> a+1 in Javascript) can make code a lot simpler, less verbose, more readable.

So my proposal is this: just remove the left-hand expression, so you have something starting with -> or =>, and use this as an anonymous arity-1 function.

So you can now do: //employee => sort((), ->{~/@salary}) to sort employees by salary, or //employee => sort((), ->{~/@salary}->substring-after('$')->number()) if you need to do a bit more processing.

As another little refinement, in the case of ->, the implicit argument is always a single item, so we can bind it to the context item. So ->{~/@salary} can be simplified to ->{@salary}. Basically, within curly braces on the RHS of ->, . and ~ mean the same thing.

I believe that all these constructs can be added to the grammar without introducing ambiguity or backwards incompatibility, but I haven't proved it conclusively yet.


The ~ construct seems to be the missing ingredient to enabling pipelines in XSLT. Consider:

  <xsl:apply-templates select="/" mode="m1"/>
  <xsl:apply-templates select="~" mode="m2"/>
  <xsl:for-each select="~">
    <e><xsl:copy-of select="."/><e>

Here "~" is acting as an implicit variable to pass the result of one instruction to be the input for the next: basically eliminating the clunky xsl:variable declarations needed to do this today. The instructions that form the children of the xsl:pipeline element are effectively connected to each other with an implicit => operator.

Draft Proposals for XSLT/XPath/XQuery 4.0https://dev.saxonica.com/blog/mike/2020/11/14-qt40-proposal-comments.html2020-11-14T19:19:00Z

Draft Proposals for XSLT/XPath/XQuery 4.0

I've been working on translating the ideas in my XML Prague 2020 paper, entitled a Proposal for XSLT 4.0 into concrete specifications, and my first attempt at this can be found here:

I'm hoping to gather together a community group of some kind to take this forward; meanwhile I've published a very preliminary set of drafts:

I put these ideas up yesterday on the XML community Slack channel and got some great feedback. Unfortunately Slack isn't really a good vehicle for managing the response to this feedback. I'm going to organise some GitHub space for a more structured discussion, but meanwhile, here are my reactions to the initial comments:

Phil Fearon:

There’s a lot to digest here so hope to provide feedback after I’ve read this more thoroughly. One suggestion (inspired by ReactJS/JSX) is to provide some syntactical sugar for xsl:call-template. So a call to a named template appears more like a literal result element (perhaps with a special namespace), with attributes that correspond to template params. This could also allow the child items of the special-LRE to be passed to the named template, accessed via a special $children param.

Yes, I've been wanting to do something like this for years, and it's really not difficult, so I've added it. If EX is listed in extension-element-prefixes, and if there's a named template name="ex:action", then <ex:action a="expr" b="expr"/> is interpreted as an xsl:call-template with xsl:wth-param children for parameters a and b.

Liam Quin

first quick note, best NOT to have them say W3C Recommendation on them as this may cause confusion.

Yes, sorry about that, still working my way around the stylesheets that generate the boilerplate text...

2d, cam xsl:text have a select attribute? i don't think value-of can be deprecated :disappointed: but xsl:text select= would be consistent & may help.

It's one of these things that one would like to simplify, but we can only add things not remove them, so that's not easy.

" the tunnel parameters that are implicitly passed in a template call may have names that duplicate the names of non-tunnel parameters that are explicitly passed on the same call." is a major source of difficult debugging if you forget tunnel=yes. Maybe the answer is just a warning from impl'ns.

Yes. How to solve this without breaking compatibility? Perhaps a dynamic error if you declare a non-tunnel parameter, and at run-time there's a tunnel parameter with that name, but no non-tunnel parameter, or vice versa? Or, as you say, just rely on warnings. I agree it's a very common mistake that's hard to debug.

The "at $pos" of XQuery is super useful. position() is tricksy.Maybe for-each at="name" ?

In 3.0 we experimented with replacing some of the context functions with explicit variable bindings and it got a bit messy, but I think it's a shame we didn't persevere. The toughest one is last(), it would be awfully nice if we knew statically whether last() was going to be needed or not, but again, hard to fix without breaking code.

prefix binding didn't make the cut for XPath?

I did a design for this and didn't like it enough to put it in. I'll try again.

item-at() seems not much easier than $xxx ! let $p := position() return $yyy[$p]

I'm toying now with an alternative to item-at() that's much more powerful: slice(sequence, positions) so you can do slice($s, 5) or slice($s, 5 to 10) or slice($s, -1) or slice($s, 1 by 3 to count($s)) or slice($s, -2 by -1 to -count($s)). Here A by B to C is an extension of the current range expression where A to B means A by 1 to B.

replace-with() seems like perl's e flag (JS has one too) but alas no polymorphism so can't write replace(., $expr, myfunc#1m 'e')

However, what about adding a map or an array of matching subgroups? ".{$2 || $1 * 2 || $2}"

Yes, I think it's a really useful capability, but I think it's cleaner to make it a separate function. Have to think about how subgroups might work.

Reece H. Dunn

I like the enum(...) syntax in addition to the union(...) syntax.

I like the extension of element and attribute type tests to be full name tests. The ability to define types for path expressions like (ol|ul) is missing, though.

Yes, I'm in two minds whether union(X, Y, Z) should be restricted to a union of atomic types, or whether it should allow a union of any types including node types. Orthogonality suggests the latter, but I was too timid to propose that.

For named item types, is it possible to make them available as part of the in-scope schema types (renamed to in-scope types that would include the schema and named types?), so you could say person-name instead of item-type(person-name). -- Having to qualify the named item types everywhere could get too verbose, especially if the name is short. _NOTE:_ This is done for MarkLogic types where you can refer to map:map, cts:query, etc.

Interesting idea. There's obviously a need to resolve conflicts but that's not a stopper. I think I was more concerned with the idea that if it's a QName then it must be atomic, and the messy fact that the sets of schema types and item types overlap, and the overlap contains all atomic types and some but not all union types.

Liam Quin

hmm, xsl:sequence could do with an "as" attribute.

Not convinced. You start wanting to put it anywhere e.g. on xsl:if or xsl:apply-templates.

Reece H. Dunn

In https://www.saxonica.com/qt4specs/XP/xpath-40-diff.html#id-itemtype-subtype, rule 2(d) is missing the reference to the EnumerationType symbol ("A is an ," instead of "A is an EnumerationType,").

Stylesheet trouble. The XSLT and XPath spec stylesheets have diverged, the XSLT spec allows <termref def="some-term"/> and picks up the term from the definition, but the XPath spec requires <termref def="some-term">term</termref>. I need to bring them back into line. Applies to your subsequent comments also.

Martin Honnen

I like the separator attribute on xsl:apply-templates and xsl:for-each. I wonder whether it would make sense to add it to xsl:for-each-group as well.

Yes. Also xsl:for-each-member. I'm not sure whether it should be an AVT or a general expression: with a general expression you could insert br or hr separators, especially if we have element constructor functions in XPath (separator="build:element('hr')")

Liam Quin

9 hours ago

i've been wondering abut the possibility of an xsl:uri-resolver for some time.

Not sure what it would do?

Also about xsl:mode elements being able to contain xsl:template elements.

Yes, I've wanted that for a long time. To be honest, it's not in 3.0 because I couldn't convince Sharon.

Reece H. Dunn

The changes for XPath/XQuery look good. I see you changed the syntax for the context item and lambda syntaxes to a unified syntax. I like that the concise and full syntaxes are now consistent.

Given that . is allowed in a ParamList, does that mean I can now define a function that works on the path context item? For example: declare function local:f(.) { xs:integer(.) + 2 }; //values/local:f()

Actually, allowing "." here was an oversight caused by my changing the way the grammar rules worked. But there might be some benefit it keeping it.

... does that mean that the context-dependent functions in F&O should be defined using that syntax. For example: fn:data(.) as xs:anyAtomicType*

I hadn't thought of baking the "implicit . as parameter" convention into the language, but it might make sense if it can be done.

Liam Quin

in XQuery i don't understand "The for evaluation of the function body is absent, unless the signature uses the "." notation, in which case it is evaluated with a singleton focus bound to the supplied argument value."

Markup trouble again.

Martin Honnen

9 hours ago

If xsl:for-each has a separator attribute, wouldn't xsl:for-each-member benefit from it as well?

Yes, see above.

Reece H. Dunn

9 hours ago

For the schema import in XQuery, would it make sense to have: [22] SchemaPrefix ::= ("namespace" NCName "=") | ("default" ("element" | "type") "namespace") now that the element and type namespaces are separate, similar to how DefaultNamespaceDecl has changed.

I thought about this and decided not. If you want finer control, use multiple declarations.

For parameter lists and context items, would it be more useful to have the context item as an optional first parameter? That would mirror the proposed variadic argument syntax (defined for arguments at the end of the parameter list), and would allow arguments to be passed to the function, such as: declare function local:add(., $n) { xs:integer(.) + $n }; //values/local:add(2)

I quite like that in principle. Needs more thought.

Reece H. Dunn 9 hours ago

Is there any description of the arity of context item based functions? -- There should be a note or something similar to say that the context item for a function definition or inline function expression does not count to its arity, so function () and function (.) both have an arity of 0.

I was thinking of them simply as arity-1 functions, suitable for callbacks in things like fn:filter and fn:sort. You're opening up new possibilities which I need to ponder.


Great to see the spec coming to light!

I proposed to add two new signatures to for-each

for-each(item()*, function (item(), xs:positiveInteger) as item()*) as item()*

for-each(item()*, function (item(), xs:positiveInteger, item()*) as item()*) as item()*

I've proposed that the function coercion rules should allow you to supply an arity-1 function where an arity-2 function appears in the signature; so we can extend fn:for-each to take a function(item, integer) as the predicate callback, and you can still supply function(item) if you don't care about the position.

which would bring for-each on par with FLOWR expressions ( for window and for … in … at ).

It would be great to see windowing done with higher-order functions, but it's a significant piece of design and not my top priority - even though it would bring XSLT up to the level of XQuery for this kind of functionality.

Martin Honnen

For fn:transform, Saxon has already added the option source-location as that is needed to use fn:transform with streaming; I think it makes sense to integrate that option into the fn:transform specification of the FO 4 draft.

Good point.

XSLT Update – Some Ideashttps://dev.saxonica.com/blog/mike/2020/10/29-xslt_update_some_ideas.html2020-10-29T16:19:00Z

XSLT Update – Some Ideas

I can't help feeling that many simple transformations on XML documents could be expressed more simply (an idea I explored with the interactive Gizmo tool available in Saxon 10).

I also think that if we had simpler syntax for simple operations, it might become easier to optimise. In particular, I'd like to be able to do simple operations like adding an attribute to the outermost element of a tree without doing a physical copy of the entire tree. (I wrote about that in an XML Prague paper, but I had to abandon the idea because the code became too complicated. As often happens, the bugbear was namespaces. For example, if you add a namespaced attribute to the outermost element, then the new namespace declaration has to propagate all the way down the tree.)

I'd also like to see transformations on JSON structures (maps and arrays) become much easier.

I've prototyped these ideas in the saxon:deep-update and saxon:update extensions, but I don't think these are the last word on the subject. (Please try them and give feedback.)

A simpler update syntax might also be very useful for updating the HTML page in Saxon-JS.

I think we can pick up ideas from XQuery Update, but without the complications of pending update lists and in-situ modification.

Let's start with:

   <xsl:delete match="note"/>

The idea is that xsl:update is an instruction that returns a deep copy of the context item (or other selected item if there's a select attribute), applying changes defined by the contained rules. In this case there is one rule, to delete elements that match the pattern note.

So it's rather like the copy-modify instruction in XQuery; it makes a copy of a supplied tree, with defined changes.

Other rules that might appear within <xsl:update> (for updating XML) might include:

<xsl:rename match="note" name="comment"/>
<xsl:rename match="a:*" name="{local-name()}"/>
<xsl:replace-value match="@status" value="accepted"/>
<xsl:add-attribute match="proposal(not(@status))" name="status" value="accepted"/>
<xsl:replace-content match="cite[@ref]" select="//bib[@id=current()/@ref]"/>
<xsl:insert match="section(not(head))" position="first">

Hopefully the intent is reasonably intuitive. The idea is to base the primitives on those available in XQuery Update. However, I'm not proposing to allow flow-of-control structures such as conditionals and function calls: each invocation of xsl:update will simply process the selected tree recursively, applying matching rules to nodes as they are found, based on pattern matching.

Defining the semantics

We can define the semantics of <xsl:update> as being equivalent to <xsl:apply-templates> using a mode that contains a number of implicit template rules, with a default action of shallow-copy (but extended to handle maps and arrays, see below).

For example, the implicit template rule for the <xsl:rename> rule might be (roughly):

<xsl:template match="note">
  <xsl:element name="comment">
    <xsl:apply-templates select="@*, node()"/>

Now, what if there's a rule to rename an element and another rule to add an attribute to the same element?

The way XQuery Update handles that is to process the rules in a number of phases: for example rename operations are handled in phase 1, delete operations in phase 5.

It's a bit hard to replicate that behaviour using template rules (in fact, this is something users often ask for). We could run a multiphase transformation using multiple modes, but it's not quite the same thing, because the match patterns would apply to the output of the previous phase, not to the original node in the input. And xsl:next-match doesn't do the job either, because we want the effect of the rules to be cumulative.

We could try another approach, which is to have the template rules return functions, so the <xsl:rename> rule becomes:

<xsl:template match="note" priority="1">
  <xsl:sequence select="function($x) {upd:rename($x, 'comment')}"/>

so the effect of apply-templates is to return a sequence of functions (in the order determined by the priority attributes) which are then applied to the node in turn.

This still doesn't exactly mirror what XQuery Update does, because after processing a node, it's then going to apply the rules to the new content of the node, not to the old content. But perhaps that actually makes more sense?


Part of the aim is not just to have simpler syntax for the user, but also to make the implementation more efficient than the standard transformation approach which always involves physical copying of a tree, no matter how small the changes.

What I want to achieve is to have a data structure, rather like the HashTrie that we use for representing XDM maps, in which changing one entry doesn't involve copying the whole tree, but at the same time leaves the original value intact. The first essential for such a structure is that it doesn't contain parent pointers: instead upwards navigation is achieved by remembering, when we get to a node, how we got there: this means the same node can be reached by multiple routes, allowing subtrees to be shared between different trees.

Suppose we are changing the value of a single attribute. It ought to be possible to achieve this by the following steps:

  • Find the element we are modifying, remembering the ancestor path of that element.
  • Create a "virtual copy" of this element (we already have this capability in Saxon)
  • Modify the virtual copy to add the attribute. Only one element is affected; the descendant tree of the virtual copy is shared with the original tree.
  • Work back through the ancestors; for each one, create a copy in which the affected child is replaced with the modified child, and all other children are virtual copies of the original.
  • Return the copied root node.

I'm hoping that it will be a lot easier to achieve this with the new syntax than it is with the current processing model, where we have to deal with all kinds of messiness like namespace inheritance. For example, we can define the new syntax so that it's equivalent to inherit-namespaces="no".

What about JSON?

I would like this mechanism to work just as well with JSON trees (that is, structures of maps and arrays) as with XML trees.

We're starting with some advantages: these structures don't have so much baggage. There's no node identity to worry about, no parent navigation, no namespaces. Also, the implementation data structures that we use for maps and arrays already allow efficient constant-time update.

I've experimented with mechanisms for deep update of a JSON structure with extension functions such as [saxon:pedigree()](https://www.saxonica.com/documentation/index.html#!functions/saxon/with-pedigree). and saxon:with-pedigree(). That's not exactly usable. But it might be the right primitive to implement something more usable.

I've also proposed better pattern syntax for maps and arrays. For example, match="tuple(first, last, *)" matches any map that has entries with keys "first" and "last".

One problem with using the XSLT recursive-descent approach for maps and arrays is that map entries (and indeed array members) aren't actually items. You can match a map as a whole, but it's hard to match one of its entries on its own. Again, I've experimented with various approaches to this. I think the introduction of tuples may help with this: we can define the recursive-descent operation on maps to process (match) each entry in the map in turn, where the entry is presented and matched as a tuple containing key and value. And then we allow syntax such as match="tuple(key: keyPattern, value: valuePattern)" to match these entries.

But perhaps we don't need to expose this. Perhaps we can define a good enough set of primitive actions that match at the level of the map itself, for example:

<xsl:remove-entry match="tuple(first, last, *)" key="'salary"/>
<xsl:replace-entry match="tuple(product-code, *) key="'price'" value="?price * 1.05"/>
<xsl:add-entry match="tuple(x, y, *)" key="'area'" value="?x * ?y"/>

I think this could fly: but there's a lot of detail to be worked out. Shame we don't have a WG any more to bounce ideas off (and get the bugs out).

String, CharSequence, IKVM, and .NEThttps://dev.saxonica.com/blog/mike/2020/07/string-charsequence-ikvm-and-net.html2020-07-20T09:06:54Z

A couple of years ago Jeroen Frijters announced that he would no longer be working on new IKVM developments (IKVM is the technology we use to make Saxon, which is written in Java, run on .NET). At one level that's not a problem: the tool works brilliantly and we can continue to use it. However, it doesn't support .NET Core, and Microsoft have announced that .NET 5 will be based on .NET Core, so that creates the risk that Saxon on .NET will hit a brick wall.

Various smart people are working on trying to pick up IKVM where Jeroen left off, but I don't particularly want to bet the business on them being successful. Jeroen produced brilliant software but he left very little in the way of documentation or test material, so it's a hard act to follow.

Meanwhile Microsoft seem to be back-pedalling on their original promise that .NET 5 would support Java interoperability. They've never given any indication of how it would do so, despite much speculation.

So we've been looking at alternative ways of taking Saxon on .NET forward into the future, and one of those is source code conversion. I've been looking at tools such as Tangible, which does a good job to a degree: they don't tackle the difficult parts of the problem where Java and C# are most different, but they give a very good insight into understanding what the difficult parts of the problem are going to be.

And one of those difficult parts, which I'm focussing on at the moment, is the CharSequence problem. CharSequence is a Java interface that we use very extensively, and there's no equivalent on .NET. Unlike other dependencies on Java classes and interfaces, this one is impossible to emulate directly, because java.lang.String implements CharSequence, and there's no way we can make System.String on .NET do the same.

The reason we use CharSequence, as with any interface, is so that we can have multiple implementations with different performance characteristics. To take a simple example, one of our implementations is CompressedWhitespace. A great deal of the text in an XML document is made up of whitespace, which sadly cannot be killed at birth: using a customised representation for strings that contain only whitespace gives a significant space saving. (And space savings also turn into speed improvements, given that execution time these days is dominated by how long it takes to get data in and out of the CPU's internal cache).

Given that CharSequence has no equivalent on .NET, it occurred to me to ask how IKVM deals with it. Although Jeroen never wrote much documentaton, he did write a lot of blog posts about interesting design problems, and sure enough it seems that he gave this a lot of attention back in 2003 (how time flies when you're having fun). I thought that he might use an implementation of CharSequence that wraps a System.String, but it seems he rejected that approach in favour of a mechanism of what he calls "ghost interfaces". There's a lot of detail, but the bottom line seems to be that the code:

CharSequence seq = "foo";

is compiled to .NET as:

System.Object seq = "foo";
if(seq instanceof System.String)
else if(seq instanceof CharSequence)
  throw new IncompatibleClassChangeError()

That looks pretty horrifying, and I've belatedly realised that it could account for a lot of our observations on .NET performance over the years.

When we first built Saxon on .NET, the performance overhead compared with Java was around 30%, which was quite acceptable. In recent years we've seen it getting worse, with some workloads showing a 300% slow-down, and despite considerable effort we've been at a loss to explain why. Synthetic benchmarks on IKVM continued to show a 30% overhead, but for Saxon the figure was far worse. We looked hard without success to find a hot-spot, something we were doing that IKVM handled particularly badly, but the slow-down seemed to be right across the board. I'm now prepared to conjecture that it's all down to our use of CharSequence - because CharSequence.charAt() is something we do very extensively, throughout the product.

When data arrives in Saxon from a SAX parser, the content of text nodes arrives in char[] arrays, while the content of attributes arrives in String objects. And we keep it that way: in the TinyTree, text nodes are effectively slices of a char[] array, and attributes are Strings. All the operations that we perform on text, including performance-critical operations such as equality matching, sorting, and string-to-number conversion, therefore need to work on either representation, and that's essentially why we use CharSequence. In general, we don't want to spend time converting data between different representations so we can perform different operations on it. 

(In recent releases, though, we've started using a different representation for operations where we need to count Unicode codepoints rather than UTF-16 chars. For regular expressions, and some other operations such as translate(), we first convert the string to a UnicodeString, which is our own interface that supports direct codepoint addressing, with internal implementations using 8, 16, or 32 bits per character depending on the widest character present in the string).

So if CharSequence is a problem, what should we do instead? Is there any other way we can implement operations such as collation comparison and string-to-number conversion efficiently without first converting the data to a common internal format?

I think part of the solution might be for these operations to be written to use codepoint iterators. Iterating over a string using an IntIterator that delivers codepoints is probably just as efficient as using a for-loop with charAt(), and it's possible to create an IntIterator over any string representation efficiently (meaning, without copying the actual characters).

This suggests the following broad approach:

(a) For attributes, continue to use Strings

(b) For text nodes on the Receiver pipeline and in the TinyTree, use an interface similar to CharSequence - let's call it UniString - that allows multiple implementations, but that doesn't have the magic property that String can be used directly as an implementation. (Instead, there will be an implementation of UniString that wraps a String).

(c) For operations on strings and string-like values, use a codepoint iterator wherever possible.

This is a gross simplification: we're dealing with half a million lines of code that's all concerned with string handling, so the detail is horrendous. But having a simplified description of the problem and the solution helps greatly when you're hacking through the jungle.

The Java class hierarchy for XPath type objectshttps://dev.saxonica.com/blog/mike/2020/02/the-java-class-hierarchy-for-xpath-type-objects.html2020-02-10T11:26:47Z

The set of interfaces and classes used in the Java code to represent XSD and XDM types has become something of a nightmare. This article is an attempt to explain it. When you don't understand something well, you can often improve your understanding by trying to explain it to others, so that's what I shall attempt to do.

The first complication is that we have to model schema types and item types, and these are overlapping categories.

Schema types - types as the term is used in XSD - are either simple types or complex types; simple types are either atomic types, union types, or list types. We can forget about complex types for the time being as they are relatively unproblematic. 

With simple types, we should mention in passing that one of the problems is that while processing a schema, we don't always immediately know what the variety of a simple type is; if it's derived from a base type and we haven't yet analysed the base type, then we park it as a "SimpleTypeDefinition" to be turned into an AtomicType, UnionType, or ListType later - which means that all references to the type need to be updated.

As well as their use in schema processing, schema types are used as type annotations on nodes in XDM, and they also appear in XPath expressions as the target of a "cast" or "castable" expression.

Item types are purely an XDM concept, and they include atomic types, node types, function types, map types, array types. Item types when combined with an occurrence indicator form a Sequence type. Sequence types are used in XPath in declaring the types of variables, parameters, and function results; they are also used in "instance of" and "treat as" expressions.

Atomic types are both schema types (more specifically, simple types) and item types. Not every schema type is an item type (complex types aren't, list types aren't), and not every item type is a schema type (node types and function types aren't). The categories overlap, so it's not surprising that the Java class hierarchy is complicated.

Union types add another complication. A simple union of atomic types (for example the union of xs:date and xs:dateTime) is useful as an item type, for example to define the type of a function argument or variable. But XSD union types aren't always simple unions of atomic types: they can also include list types, and they can define restrictions beyond those present in the member types. So XDM defines the concept of a "pure union type", which is a simple union of atomic types; pure union types are the only kind that can be used as item types. For convenience it's useful to have a term that embraces atomic types and pure union types: the XDM specifications call these "generalized atomic types", and in Saxon they are referred to as "plain types". Again, these overlapping categories make it very hard to get the Java class hierarchy right.

Simple types form a lattice; at the top of this lattice is the most general type "xs:anySimpleType", and at the bottom is the "void" type "xs:error" (void because it has no instances). These "edge case" types are simple types, but they don't fit cleanly into the classification of union types, list types, and atomic types.

Item types also overlap with XSLT patterns, and with the node tests used in axis steps. Constructs such as element(*) and text() are both node tests (suitable for use in patterns and axis steps) and item types. Not every item type is a node test (for example, array(*) isn't), and not every node test is an item type (for example, *:local isn't). Again, we have two intersecting categories. If we draw the Venn diagram of simple types, item types, and node tests, we find that simple tests don't overlap with node tests, but all other combinations have an intersection.

There's another dimension that we try to capture in the Java class hierarchy: we try to distinguish built-in types from user-defined types. There are built-in atomic types (xs:integer), built-in list types (xs:NMTOKENS), and built-in union types (xs:numeric); and there are also used-defined types in each of the three varieties. Capturing two dimensions of classification in a class hierarchy typically introduces multiple inheritance and complicates the hierarchy.

There's also a lot of complexity concerned with the relationship of schema types to other kinds of schema component. Again at this level we try to distinguish used-defined schema components (those derived from declarations in an XSD source document) from built-in schema components (which include not only simple types, but also complex types such as xs:anyType and xs:untyped). We distinguish "schema components" as defined in the XSD specification (which include not only schema types, but also element declarations, attribute declarations, identity constraints etc) and "schema structures" which are essentially constructs in a source XSD document; but looking at the code, nearly everything you find in a schema seems to be both a "schema component" and a "schema structure" and I'm having trouble seeing exactly what the difference between the two categories is.

The straw that broke the camel's back and made me examine whether refactoring is needed was the introduction of locally-declared union types with the syntax "union(xs:date, xs:time)". These are clearly union types, but they aren't built-in, and they don't correspond to declarations in any source schema, so they don't fit neatly into the existing classification of built-in versus user-defined.

We've got an awful lot of multiple inheritance in this hierarchy, and the accepted wisdom is that if you've got a lot of multiple inheritance, then you need to do some refactoring, and replace some of it with delegation.

We've got a model for that in the way we handle XSLT match patterns. Although node-tests are a subset of patterns, we don't treat node-tests as a subclass of patterns in the Java class hierarchy; rather, the class hierarchy for patterns includes a NodeTestPattern which contains a reference to a NodeTest. Similarly, atomic types are a subset of schema types, but that doesn't mean they need to implement SchemaType in the Java class hierarchy; rather the class hierarchy for SchemaTypes could include an AtomicSchemaType which contains a reference to an AtomicType.

Let's see what we can do.

UPDATE 2020-02-18

Well: I had a good go at refactoring this; but the new scheme was getting just as complex as the old, so I decided to revert all the work.

I tried to split the classes representing simple types into two: the "compile time" information used during XSD schema compilation, and the "executable" types used actively for validation. But I ended up with just as many classes (or more), and just as much multiple inheritance. I did manage to eliminate the messy process whereby a SimpleTypeDefinition is converted to an AtomicType, ListType, or UnionType as soon as we know its variety (i.e. when the reference to its base type is resolved -- it can be a forwards reference), but I found that doesn't open the door to any wider simplification.

Java Generics revisitedhttps://dev.saxonica.com/blog/mike/2020/01/java-generics-revisited.html2020-01-21T21:14:20Z

In Saxon 9.9 we took considerable pains to adopt Java Generics for processing sequences: in particular the Sequence and SequenceIterator classes, and all their subclasses, became Sequence<? extends Item> and SequenceIterator<? extends Item>.

I'm now coming to the conclusion that this was a mistake; or at any rate, that we went too far.

What exactly are the benefits of using Generics? It's supposed to improve type safety and reduce the need for casts which, if applied incorrectly, can trigger run-time exceptions. So it's all about detecting more of your errors at compile time.

Well, I don't think we've been seeing those benefits. And the main reason for that is that in most cases, when we're processing sequences, we don't have any static knowledge of the kind of items we are dealing with.

Sure, when we process a particular XPath path expression, we know whether it's going to deliver nodes or atomic values. But when we write the Java code in Saxon to handle path expressions, all we know is that the result will always be a sequence of items.

There are some cases where particular kinds of expression only handle nodes, or only handle atomic values. For example, the input sequences for a union operator will always be sequences of nodes. It would be nice if we didn't have to handle a completely general sequence and cast every item to class NodeInfo. But it's an illusion to think we can get extra type safety that way. The operands of a union are arbitrary expressions, and the iterators returned by the subexpressions are going to be arbitrary iterators; there's no way we can translate the type-safety we are implementing at the XPath level into type-safe evaluators at the Java level.

It's particularly obvious that generics give us no type-safety at the API level. In s9api, XPathSelector.evaluate() returns an XdmValue. That's a lot better than the JAXP equivalent which just returns Object, but the programmer still has to do casting to convert the items in the return XdmValue to nodes, string, integers, or whatever. And there's no way we can change that; the XPath expression is supplied as a string at run-time, so it's only at run-time that we know what type of items it returns. If that's true at the API level, it's equally true internally. Any kind of expression can invoke any other kind of expression (that's what orthogonality in language design is about), which means that the interfaces between an expression and its subexpressions are always going to be general-purpose sequences whose item type is known only at execution time.

There are a couple of aspects of Java  generics that cause us real pain.

  • The first is the XDM rule that every item is itself a Sequence. So if Sequence is a generic type, parameterized by Item type, and Item is a subclass of Sequence, then Item has to be itself a generic type parameterized by its own type. Rather than Item, it has to be Item<? extends Item>; or perhaps it should be Item<? extends Item<? extends Item>>, and so ad infinitum. And then StringValue extends Item<StringValue> and so on. We found ways around that conundrum, but the complexity is horrendous; it certainly doesn't achieve the goal of making it easier to write correct code.
  • The second is arrays. Arrays don't play at all well with generics; you can't create an array of a generic type, for example. And yet there are lots of places where it's useful to use arrays, and some where arrays are the only option. VarArgs functions, for example, present their arguments as an array. In some cases we wanted to carry on using arrays (rather than lists) for compatibility, in other cases we want to use them for convenience or for performance. The natural signature for a function call, for example is public Sequence call(Context context, Sequence[] args). There's no way we can refine this in a way that passes static information about the argument types from the caller to the callee, because we're using the same Java signature for all XPath functions.
But having got Generics working, at great effort, in 9.9, should we retain them or drop them?

One reason I'm motivated to drop them is .NET. We have a significant user base on .NET, but we have something of a potential crisis looming in terms of ongoing support for this platform. Microsoft appear to be basing their future strategy around .NET Core, allowing .NET Framework to fade away into the sunset. But the technology we use for bridging to .NET, namely IKVM, only supports .NET Framework and not .NET Core; and Jeroen Frijters who single-handedly developed IKVM and supported it for umpteen years (with no revenue stream to support it) has thrown in the towel and is no longer taking it forward. So we're looking at a number of options for a way forward on .NET. One of these is source code conversion; and to make source code conversion viable without forking the code, we need to minimise our dependencies on Java features that don't translate easily to C#. Notable among those features is generics.

In the short term, I think I'm going to roll back the use of generics in selected areas where they are clearly more trouble than they are worth. That's particularly true of Sequence and its subclasses, including Item. For SequenceIterator it's probably worth keeping generics for the time being, but we'll keep that under review.

Alphacodes for Sequence Typeshttps://dev.saxonica.com/blog/mike/2019/10/alphacodes-for-sequence-types.html2019-10-15T14:06:31Z

In the next releases of Saxon and Saxon-JS we have devised a compact notation for representation of SequenceType syntax in the exported SEF file. This note is to document this syntax.

The main aims in devising the syntax were compactness, together with fast generation and fast parsing. In addition it has the benefit that some operations are possible on the raw lexical form without doing a full parse.

The syntax actually handles ItemTypes as well as SequenceTypes; and in addition, it can handle the two examples of NodeTests that are not item types, namely *:local and uri:*. It can therefore be used in the SEF wherever a SequenceType, ItemType, or NodeTest is required.

The first character of an alphacode is the occurrence indicator. This is one of: * (zero or more), + (one or more), ? (zero or one), 0 (exactly zero), 1 (exactly one). If the first character is not one of these, then "1" is assumed; but the occurrence indicator is generally omitted only when representing an item type as distinct from a sequence type.

The occurrence indicator is immediately followed by the "primary alphacode" for the item type. These are chosen so that alphacode(T) is a prefix of alphacode(U) if and only if T is a supertype of U. For example, the primary alphacode for xs:integer is "ADI", and the primary alphacode for xs:decimal is "AD", reflecting the fact that xs:integer is a subtype of xs:decimal. The primary alphacodes are as follows:

"" (zero-length string): item()

A: xs:anyAtomicType

AB: xs:boolean

AS: xs:string

ASN: xs:normalizedString

ASNT: xs:token

ASNTL: xs:language


ASNTN: xs:Name





AQ: xs:QName

AU: xs:anyURI

AA: xs:date

AM: xs:dateTime

AMP: xs:dateTimeStamp

AT: xs:time

AR: xs:duration

ARD: xs:dayTimeDuration

ARY: xs:yearMonthDuration

AG: xs:gYear

AH: xs:gYearMonth

AI: xs:gMonth

AJ: xs:gMonthDay

AK: xs:gDay

AD: xs:decimal

ADI: xs:integer

ADIN: xs:nonPositiveInteger

ADINN: xs:negativeInteger

ADIP: xs:nonNegativeInteger

ADIPP: xs:positiveInteger

ADIPL: xs:unsignedLong

ADIPLI: xs:unsignedInt

ADIPLIS: xs:unsignedShort

ADIPLISB: xs:unsignedByte

ADIL: xs:long

ADILI: xs:int

ADILIS: xs:short

ADILISB: xs:byte

AO: xs:double

AF: xs:float

A2: xs:base64Binary

AX: xs:hexBinary

AZ: xs:untypedAtomic

N: node()

NE: element(*)

NA: attribute(*)

NT: text()

NC: comment()

NP: processing-instruction()

ND: document-node()

NN: namespace-node()

F: function(*)

FM: map(*)

FA: array(*)

E: xs:error

X: external (wrapped) object

XJ: external Java object

XN: external .NET object

XS: external Javascript object

Every item belongs to one or more of these types, and there is always a "most specific" type, which is the one that we choose.

Following the occurrence indicator and primary alphacode are zero or more supplementary codes. Each is preceded by a single space, is identified by a single letter, and is followed by a parameter value. For example the sequence type "element(BOOK)" is coded as "1NE nQ{}BOOK" - here 1 is the occurrence indicator, NE indicates an element node, and nQ{}BOOK is the required element name. The identifying letter here is "n". The supplementary codes (which may appear in any order) are as follows:

n - Name, as a URI-qualified name. Used for node names when the primary alphacode is one of (NE, NA, NP). Also used for the XSD type name when the type is a user-defined atomic or union type: the basic alphacode then represents the lowest common supertype that is a built-in type.  (Note: we assume that type names are globally unique. This cannot be guaranteed when deploying a SEF file: the schema at the receiving end might vary from that of the sender.) Also used for the class name in the case of external object types (in this case the namespace part will always be "Q{}"). Note that strictly speaking, the forms *:name and name:* can appear in a NameTest, but never in a SequenceType. However, they can be represented in alphacodes using the syntax "n*:name" and "nQ{uri}*" respectively. The syntax "~localname" is used for a name in the XSD namespace. 

c - Node content type (XSD type annotation), as a URI-qualified name optionally followed by "?" to indicate nillable. The syntax "~localname" is used for a name in the XSD namespace. Optionally present when the basic code is (NE, NA); omitted for NE when the content is xs:untyped, and for NA when the content is xs:untypedAtomic. Only relevant for schema-aware code.

k - Key type, present when the basic code is FM (i.e. for maps), omitted if the key type is xs:anyAtomicType. The value is the alphacode of the key type, enclosed in square brackets: it will always start with "1A".

v - Value type, present when when the basic code is (FM, FA) (i.e. for maps and arrays), omitted if the value type is item()*. The value is the alphacode of the value type, enclosed in square brackets. For example the alphacode for array(xs:string+)* is "*FA v[+AS]".

r - Return type, always present for functions. The value is the alphacode of the return type, enclosed in square brackets.

a - Argument types, always present for functions. The value is an array of alphacodes, enclosed in square brackets and separated by commas. For example, the alphacode for the function fn:dateTime#2 (with signature ($arg1 as xs:date?, $arg2 as xs:time?) as xs:dateTime?) is "1F r[?AM] a[?AA,?AT]"

m - Member types of an anonymous union type. The value is an array of alphacodes for the member types (these will always be atomic types), enclosed in square brackets and comma-separated. The basic code in this case will be "A", indicating xs:anyAtomicType. This is not used for the built-in union type xs:numeric, nor for user-defined atomic types defined in a schema; it is used only for anonymous union types defined using the Saxon extension syntax "union(a, b, c)".

e - Element type of a document-node() type, present optionally when the basic code is ND. The value is an alphacode, which will always start with "1NE".

t - Components of a tuple type (Saxon extension). The value is an array of tokens, enclosed in square brackets, where each token comprises the name of the component (an NCName), a colon, and the alphacode of the component type.

i, u, d - Venn type. The item type is the intersection, union, or difference of two item types. The letter "i", "u", or "d" indicates intersection, union, or difference respectively, followed by a list of (currently always two) item types enclosed in square brackets and separated by a comma. The principal type will typically be "N" or "NE". Saxon uses venn types internally to give a more precise inferred type for expressions; it is probably largely unused at run-time, and can therefore be safely ignored when reading a SEF file.

Named union types have a basic alphacode of "A", followed by the name of the union type in the form "A nQ{uri}local". The syntax "~localname" is used for a name in the XSD namespace, so the built-in union types xs:numeric and xs:error are represented as "A n~numeric" and "A n~error" respectively.

TODO: the documentation for union types is not aligned with the current implementation


0 - empty-sequence()

1AS - xs:string

1N - node()

1 - item()

* - item()*

1NE nQ{}item - element(item)

1ND e[1NE nQ{}item] - document-node(element(item))

*FM k[1AS] v[?AS] - map(xs:string, xs:string?)*

1F a[?AS,*AO] r[1AB] - function(xs:string?, xs:double*) as xs:boolean

Version: 2019-10-30

A new push event APIhttps://dev.saxonica.com/blog/mike/2019/05/a-new-push-event-api.html2019-05-01T16:34:00Z

For various internal and performance reasons, we're making some changes to Saxon's internal Receiver interface for the next release. This interface is a SAX-like interface for sending an XML document (or in general, any XDM instance) from one processing component to another, as a sequence of events such as startElement(), attributes(), characters(), and so on.

The interface is very widely used within Saxon: it handles communication from the XML parser to the document builder, document validation, serialization, and much else. It also allows instructions to be executed in "push mode", so for example when XSLT constructs a result tree, the tree is never actually constructed in memory, but instead events representing the tree are sent straight from the transformer to the serializer.

I know that although this interface is labelled as internal, some user applications attempt either to implement the interface or to act as a client, sending events to one of Saxon's many implementations of the interface. So in making changes, it seems a good time to recognize that there is a need for an interface at this level, and that existing candidates are really rather clumsy to use.

Among those candidates are the venerable SAX ContentHandler interface, and the newer StAX XMLStreamWriter interface.

There are a number of structural reasons that make the ContentHandler hard to use:

  • It offers a number of different configuration options for XML parsers, which cause namespace information to be provided in different ways. But the ContentHandler has no way of discovering which of these options the XML parser (or other originator of events) is actually using.
  • It's not actually one interface but several: some events are sent not to the ContentHandler, but to a LexicalHandler or DTDHandler.
  • The information available to the ContentHandler doesn't align well with the information defined in the XDM data model; for example, comments are available only to the LexicalHandler, not to the ContentHandler
In addition, the way QNames and namespaces are handled makes life unnecessarily difficult for both sides of the interface.

In some ways the XMLStreamWriter is an improvement, and I've certainly used it in preference when writing an application that has to construct XML documents in this way. But a major problem of the XMLStreamWriter is that it's underspecified, to the extent that there is a separate guidance document from a third-party suggesting how implementations should interpret the spec. Again, the main culprit is namespace.

One of the practical problems with all these event-based interfaces is that debugging can be very difficult. In particular, if you forget to issue an endElement() call, you don't find out until the endDocument() event finds there's a missing end tag somewhere, and tracking down where the unmatched startElement() is in a complex program can be a nightmare. I decided that addressing this problem should be one of the main design aims of a new interface -- and it turns out that it isn't difficult.

Let's show off the new design with an example. Here is some code from Saxon's InvalidityReportGenerator, which generates an XML report of errors found during a schema validation episode, using the XMLStreamWriter interface:

writer.writeStartElement(REPORT_NS, "meta-data");
writer.writeAttribute("name", Version.getProductName() + "-" + getConfiguration().getEditionCode());
writer.writeAttribute("version", Version.getProductVersion());
writer.writeEndElement(); //</validator>
writer.writeAttribute("errors", "" + errorCount);
writer.writeAttribute("warnings", "" + warningCount);
writer.writeEndElement(); //</results>
if (schemaName != null) {
    writer.writeAttribute("file", schemaName);
writer.writeAttribute("xsd-version", xsdversion);
writer.writeEndElement(); //</schema>
writer.writeAttribute("at", DateTimeValue.getCurrentDateTime(null).getStringValue());
writer.writeEndElement(); //</run>
writer.writeEndElement(); //</meta-data>

And here is the equivalent using the new push API:

Push.Element metadata = report.element("meta-data");
        .attribute("name", Version.getProductName() + "-" + getConfiguration().getEditionCode())
        .attribute("version", Version.getProductVersion());
        .attribute("errors", "" + errorCount)
        .attribute("warnings", "" + warningCount);
        .attribute("file", schemaName)
        .attribute("xsd-version", xsdversion);
        .attribute("at", DateTimeValue.getCurrentDateTime(null).getStringValue());

What's different? The most obvious difference is that the method for creating a new element returns an object (a Push.Element) which is used for constructing the attributes and children of the element. This gives it an appearance rather like a tree-building API, but this is an illusion: the objects created are transient. Methods such as attribute() use the "chaining" design - they return the object to which they are applied - making it easy to apply further methods to the same object, without the need to bind variables. The endElement() calls have disappeared - an element is closed automatically when the next child is written to the parent element, which we can do because we know which element the child is being attached to.

There are a few other features of the design worthy of attention:

  • Names of elements and attributes can be supplied either as a plain local name, or as a QName object. A plain local name is interpreted as being in the default namespace in the case of elements (the default namespace can be set at any level), or as being in no namespace in the case of attributes. For the vast majority of documents, there is never any need to use QNames; very often the only namespace handling is a single call on setDefaultNamespace().
  • The close() method on elements (which generates the end tag) is optional. If you write another child element, the previous child is closed automatically. If you close a parent element, any unclosed child element is closed automatically. The specimen code above shows one call on close(), which is useful in this case for readability: the reader can see that no further children are going to be added.
  • The argument of methods such as attribute() and text() that supplies the content may always be null. If the content is null, no attribute or text node is written. This makes it easier to handle optional content without disrupting the method chaining.

I have rewritten several classes that construct content using push APIs to use this interface, and the resulting readability is very encouraging.

Representing namespaces in XDM tree modelshttps://dev.saxonica.com/blog/mike/2019/02/representing-namespaces-in-xdm-tree-models.html2019-02-01T11:59:58Z

Most tree representations of XML, including the Saxon TinyTree and LinkedTree implementation, as well as DOM, represent namespace information by holding a set of namespace declarations and undeclarations on each element node.

I'm considering a change to this representation (for the Saxon implementations) to do something that more closely reflects the way namespaces are actually defined in XDM: each element node has a set of in-scope namespaces (held in a NamespaceMap object) containing all the information about the namespaces that apply to that element.

The obvious objection to this, and the reason I've never done it before, is that it looks at first sight to be very inefficient. But consider:

(a) in the vast majority of documents, there are very few namespace declarations on any element other than the root

(b) if there are no namespace declarations on an element, it can point to the same NamespaceMap object that its parent element points to; in most cases, all elements in the document will point to the same shared NamespaceMap.

(c) having a NamespaceMap object immediately available on every element node means we never need to search up the ancestor axis to resolve namespace prefixes

(d) there are still opportunities for implementations of NamespaceMap that use "deltas" if space-saving in pathological cases is considered necessary.

Note that the NamespaceMap holds prefix=uri pairs, not namespace nodes. Namespace nodes have node identity and parentage, which is what makes them so expensive. prefix-uri pairs are just pairs of strings without such baggage, and they can be freely shared across element nodes.

The current implementation I'm using for NamespaceMap is an immutable map implemented as a pair of String[] arrays, one for prefixes and one for uris. The prefix array is maintained in sorted order so we can use binary search to find a prefix. Insertion of a new prefix/uri mapping is O(n), but this doesn't matter because the number of bindings is usually less than ten, and it's a rare operation anyway that only happens during tree construction.

Because the NamespaceMap is immutable, the system is quite easy to implement in a tree builder that gets notified of namespaces incrementally (for example by a SAX parser). The tree builder maintains a stack of NamespaceMap objects. On a startElement event it allocates to the element the same NamespaceMap object that the parent element is using; when a namespace declaration or undeclaration is encountered, this is replaced with a new NamespaceMap with the required modifications.

The real motivation for the change is in implementing copy operations. In complex multi-phase transformations both deep and shallow element copy operations are very frequent, and copying of the namespace information is a significant cost. The XSLT and XQuery language semantics require that when an element is copied, all its in-scope namespaces are copied, and this requires searching the ancestor axis to find them (we try quite hard to optimize this away, but we're not always successful). If the in-scope namespaces are readily to hand in a simple immutable object, we save this effort and just pass the complete object down the pipeline.

The builder for the tree to which the element is being copied now has to merge this set of namespaces with the existing namespaces inherited from ancestor elements on the receiving tree. It should now be clear why I chose the particular data structure for the NamespaceMap: merging two sets of namespace bindings reduces to merging two sorted arrays, which is quite an efficient operation. It's also easy to optimize for the common case where the in-scope namespaces of the element being copied are exactly the same as the in-scope namespaces of its parent element (typically we'll find that the same NamespaceMap object is in use), in which case the merge becomes a null operation.

Of course, there are many details to work through (not least, how we fit this in with third-party tree models that continue to use declarations and undeclarations). But initial experiments are looking encouraging.

The Receiver Pipelinehttps://dev.saxonica.com/blog/mike/2018/06/the-receiver-pipeline.html2018-06-20T11:07:12Z

A significant feature of the internal architecture of Saxon is the Receiver pipeline. A receiver is an object that (rather like a SAX ContentHandler) is called with a sequence of events such as startElement(), characters(), and endElement(); it typically does some processing on these events and then calls similar events on the next Receiver in the pipeline. The mechanism is efficient, because it avoids building a tree in memory, and because it allows much of the conditional logic of the processing (for example, whether or not to validate the document) to be executed at the time the pipeline is constructed, rather than with conditional code executed for every event that occurs.

Receiver pipelines are used throughout Saxon: from doing whitespace stripping on the source document, to serialization of the result tree. The schema validator is implemented as a receiver pipeline, as are operations such as namespace fixup.

But despite the elegance of the design, there have been some perennial problems with the implementation. For example, there are variations on exactly what input different implementations of Receiver will accept: some for example require an open() event while others don't; some accept entire element or document nodes in an append() event while others don't. This limits the ability to construct a pipeline using arbitrary combinations of Receivers, and worse, it's very hard to establish exactly what the permitted combinations are.

This has come to a head recently in trying to get some of the new features in XSLT 3.0 and XQuery 3.1 working reliably and robustly. The straw that broke the camel's back was the innocent-seeming item-separator serialization property. The item-separator is used while doing "sequence normalization" as the first stage of serialization; and the problem was that we didn't really do sequence normalization as a separate step in the processing. The obvious symptom that there are design problems here has been that whenever we get all the XQuery tests working, we find we've broken XSLT; and then when we get all the XSLT tests working, we find XQuery is now failing.

The model according to the specs is that the transformation or query engine produces "raw" results (which can be any sequence of items), and this is then input to the serialization process (or possibly just to sequence normalization, which wraps the results in a document node and then delivers the document). But although Saxon could deliver raw results from an XQuery running in "pull" mode (the XQueryEvaluator.iterate() method) we never really had the capability to produce raw output in push mode: the push code did sequence normalization within the query/transformation logic, rather than leaving it to the serializer. That's for historic reasons, of course: with XSLT 2.0, that's the way it was defined (the result of the transformation was always a document, with optional serialization).

So the first principle to establish in sorting this out is: the interface between the query or transformation engine and the Destination (which may or may not be a Serializer) is a raw sequence, delivered over the Receiver interface.

This requires a definition of exactly how a raw sequence is delivered over this interface: that is, what's the contract between the provider of the Receiver interface and the client (the sender of events). I've created that definition, and I've also written a Receiver implementation which validates that the sequence of events conforms to this definition; we can put this validation step into the pipeline when we feel it useful (for example, when running with assertions enabled). This exercise has revealed quite a few anomalies that should be fixed, for example cases where endDocument() is not being called before calling close().

There are three ways of delivering output from a query or transformation: raw output, document output (the result of sequence normalization), and serialized output. The next question that arises is, who decides which form is delivered. The simplest solution is: this is decided entirely at the API level, and does not depend on anything in the stylesheet or query. (This means that the XSLT build-tree attribute is ignored entirely.) In s9api terms, your choice of Destination object determines which kind of output you get. And at the implementation level, the Destination object always receives raw output; we don't want the transformation engine doing different things depending what kind of Destination has been supplied.

The other related area that needed sorting out was the API interaction with xsl:result-document. We've always had the OutputURIResolver as a callback for determining what should happen to secondary result documents, but this is no longer fit for purpose. It was already a struggle to extend it to handle thread safety when the xsl:result-document instruction became asynchronous; further extending it to work with the s9api Destination framework has never been attempted because it just seemed too difficult. Having made the decision to introduce a dependency on Java 8 for the next major Saxon release, I think we can solve this at the API level with two enhancements:

  1. on XsltTransformer and Xslt30Transformer, a new method setResultDocumentResolver() which takes as argument an implementation of Function<URI, Destination> - that is a function that accepts an absolute URI as input, and returns a Destination;
  2. on Destination, a new method onClose() which takes as argument a Consumer<Destination>.

So when xsl:result-document is called, we construct the absolute URI and pass it to the registered result document resolver, and then use the returned Destination to write the result tree. On completion we call any onClose() handler registered with the Destination, which gives the application the opportunity to process the result document (for example, by writing it to a database).

Of course, we have to work out how to implement this while retaining a level of backwards compatibility for applications using the existing OutputURIResolver.

A tricky case with xsl:result-document has been where the href attribute is omitted or empty. I think the cleanest design here is to call the registered result document resolver passing the base output URI as argument, and use the returned Destination in the normal way. The application then has to sort out the fact that the original primary Destination for the transformation is not actually used.

Yet another complication in the design is the rule in XSLT that when xsl:result-document requests schema validation of the output, schema validation is done after sequence normalization and before serialization. This is pretty ugly from a specification point of view: the serialization spec defines serialization as a 6-step process of which sequence normalization is the first; the XSLT spec really has no business inserting an additional step in the middle of this process. When the specification is ugly, the implementation usually ends up being ugly too, and we have to find some way for the transformation engine to inject a validation step into the middle of the pipeline implemented by the Destination, which ought by rights to be completely encapsulated.

Standing back from all this, unlike some refactoring exercises, in this case the basic design of the code proved to be sound, but it needed reinforcement to make the implementation more robust. It needed a clear definition and enforcement of the contract implied by the Receiver interface; it needed a clear separation of concerns between the transformation/query engine and the Destination processing; and it needed a clean API to control it all.

Navigating XML trees using Java Streamshttps://dev.saxonica.com/blog/mike/2018/04/navigating-xml-trees-using-java-streams.html2018-04-13T08:31:44Z

Navigating XML trees using Java Streams

For the next major Saxon release I am planning an extension to the s9api interface to exploit the facilities of Java 8 streams to allow powerful navigation of XDM trees: the idea is that navigation should be as easy as using XPath, but without the need to drop out of Java into a different programming language. To give a flavour, here is how you might select the elements within a document that have @class='hidden':



We'll see how that works in due course.

Why do we need it?

The combination of Java and XML is as powerful and ubiquitous today as it as been for nearly twenty years. Java has moved on considerably (notably, as far as this article is concerned, with the Java 8 Streams API), and the world of XML processing has also made great strides (we now have XSLT 3.0, XPath 3.1, and XQuery 3.1), but for some reason the two have not moved together. The bulk of Java programmers manipulating XML, if we can judge from the questions they ask on forums such as StackOverflow, are still using DOM interfaces, perhaps with a bit of XPath 1.0 thrown in.

DOM shows its age. It was originally designed for HTML, with XML added as an afterthought, and XML namespaces thrown in as a subsequent bolt-on. Its data model predates the XML Infoset and the (XPath-2.0-defined) XDM model. It was designed as a cross-language API and so the designers deliberately eschewed the usual Java conventions and interfaces in areas such as the handling of collections and iterators, not to mention exceptions. It does everything its own way. As a navigational API it carries a lot of baggage because the underlying tree is assumed to be mutable. Many programmers only discover far too late that it's not even thread-safe (even when you confine yourself to retrieval-only operations).

There are better APIs than DOM available (for example JDOM2 and XOM) but they're all ten years old and haven't caught up with the times. There's nothing in the Java world that compares with Linq for C# users, or ElementTree in Python.

The alternative of calling out from Java to execute XPath or XQuery expressions has its own disadvantages. Any crossing of boundaries from one programming language to another involves data conversions and a loss of type safety. Embedding a sublanguage in the form of character strings within a host language (as with SQL and regular expressions) means that the host language compiler can't do any static syntax checking or type checking of the expressions in the sublanguage. Unless users go to some effort to avoid it, it's easy to find that the cost of compiling XPath expressions is incurred on each execution, rather than being incurred once and amortized. And the API for passing context from the host language to the sublanguage can be very messy. It doesn't have to be quite as messy as the JAXP interface used for invoking XPath from Java, but it still has to involve a fair bit of complexity.

Of course, there's the alternative of not using Java (or other general-purpose programming languages) at all: you can write the whole application in XSLT or XQuery. Given the capability that XSLT 3.0 and XQuery 3.1 have acquired, that's a real possibility far more often than most users realise. But it remains true that if only 10% of your application is concerned with processing XML input, and the rest is doing something more interesting, then writing the whole application in XQuery would probably be a poor choice.

Other programming languages have developed better APIs. Javascript has JQuery, C# programmers have Linq, Scala programmers have something very similar, and PHP users have SimpleXML. These APIs all have the characteristic that they are much more deeply integrated into the host language, and in particular they exploit the host language primitives for manipulation of sequences through functional programming constructs, with a reasonable level of type safety given that the actual structure of the XML document is not statically known.

That leads to the question of data binding interfaces: broadly, APIs that exploit static knowledge of the schema of the source document. Such APIs have their place, but I'm not going to consider them any further in this article. In my experience they can work well if the XML schema is very simple and very stable. If the schema is complex or changing, data binding can be a disaster.

The Java 8 Streams API

This is not the place for an extended tutorial on the new Streams API introduced in Java 8. If you haven't come across it, I suggest you find a good tutorial on the web and read it before you go any further.

Java Streams are quite unrelated to XSLT 3.0 streaming. Well, almost unrelated: they share the same high-level objectives of processing large collections of data in a declarative way, making maximum use of lazy evaluation to reduce memory use, and permitting parallel execution. But that's where the similarity ends. Perhaps the biggest difference is that Java 8 streams are designed to process linear data structures (sequences), whereas XSLT 3.0 streaming is designed to process trees.

But just to summarise:

  • Java 8 introduces a new interface, Stream<X>, representing a linear sequence of items of type X
  • Like iterators, streams are designed to be used once. Unlike iterators, they are manipulated using functional operations, most notably maps and filters, rather than being processed one item at a time. This makes for less error-prone programming, and allows parallel execution.

The functional nature of the Java 8 Streams API means it has much in common with the processing model of XPath. The basic thrust of the API design presented in this article is therefore to reproduce the primitives of the XPath processing model, re-expressing them in terms of the constructs provided by the Java 8 Streams API.

If the design appears to borrow concepts from other APIs such as LINQ and Scala and SimpleXML, that's not actually because I have a deep familiarity with those APIs: in fact, I have never used them in anger, and I haven't attempted to copy anything across literally. Rather, any similarity is because the functional concepts of XPath processing map so cleanly to this approach.

The Basics of the Saxon s9api API

The Saxon product primarily exists to enable XSLT, XQuery, XPath, and XML Schema processing. Some years ago I decided that the standard APIs (JAXP and XQJ) for invoking such functionality were becoming unfit for purpose. They had grown haphazardly over the years, the various APIs didn't work well together, and they weren't being updated to exploit the newer versions of the W3C specifications. Some appalling design mistakes had been unleashed on the world, and the strict backwards compatibility policy of the JDK meant these could never be corrected.

To take one horrid example: the NamespaceContext interface is used to pass a set of namespace bindings from a Java application to an XPath processor. To implement this interface, you need to implement three methods, of which the XPath processor will only ever use one (getNamespaceURI(prefix)). Yet at the same time, there is no way the XPath processor can extract the full set of bindings defined in the NamespaceContext and copy them into its own data structures.

So I decided some years ago to introduce a proprietary alternative called s9api into the Saxon product (retaining JAXP support alongside), and it has been a considerable success, in that it has withstood the test of time rather well. The changes to XSLT transformation in 3.0 were sufficiently radical that I forked the XsltTransformer interface to create a 3.0 version, but apart from that it has been largely possible to add new features incrementally. That's partly because of a slightly less obsessive attitude to backwards compatibility: if I decide that something was a bad mistake, I'm prepared to change it.

Although s9api is primarily about invoking XSLT, XQuery, and XPath processing, it does include classes that represent objects in the XDM data model, and I will introduce these briefly because the new navigation API relies on these objects as its foundation. The table below lists the main classes.

Class Description
XdmValue Every value is the XDM model is a sequence of items. The XdmValue class is therefore the top of the class hierarchy. Because it's a sequence, it implements Iterable<XdmItem>, so you can use a Java foreach loop to process the items sequentially. In the latest version I have used Java generics to add a type parameter, so XdmValue<XdmNode> is a sequence of nodes, and XdmValue<XdmAtomicValue> is a sequence of atomic values. As well as an iterator() method, it has an itemAt() method to get the Nth item, and a size() method to count the items. Internally an XdmValue might exist as an actual sequence in memory, or as a "promise": sufficient data to enable the items to be materialized when they are needed.
XdmItem This class represents an Item in the XDM model. As such it is both a component of an XdmValue, and also an XdmValue (of length one) in its own right. It's an abstract class, because every item is actually something more specific (a node, an atomic value, a function). Some of the methods inherited from XdmValue become trivial (for example size() always returns 1).
XdmNode This is a subclass of XdmItem used to represent nodes. Unlike many models of XML, we don't subclass this for different kinds of node: that's mainly because XDM has deliberately aimed at uniformity, with the same accessors available for all node kinds. Many of the methods on XdmNode, such as getNodeName(), getStringValue(), getTypedValue(), and getNodeKind(), are directly equivalent to accessors defined in the W3C XDM specification. But in addition, XdmNode has a method axisIterator to navigate the tree using any of the XPath axes, the result being returned as an iterator over the selected nodes.
XdmAtomicValue Another subclass of XdmItem, this is used to represent atomic values in the XDM model. As with XdmNode, we don't define further subclasses for different atomic types. There are convenience methods to convert XdmAtomicValue instances to and from equivalent (or near-equivalent) Java classes such as String, Double, BigInteger, and Date.
XdmFunctionItem From XPath 3.0, functions are first-class values alongside nodes and atomic values. These are represented in s9api as instances of XdmFunctionItem. Two specific subclasses of function, with their own behaviours, are represented using the subclasses XdmMap and XdmArray. I won't be saying much about these in this article, because I'm primarily concerned with navigating XML trees.

The new API: Steps and Predicates

The basic concept behind the new extensions to the s9api API is navigation using steps and predicates. I'll introduce these concepts briefly in this section, and then go on to give a more detailed exposition.

The class XdmValue<T> acquires a new method:

XdmStream select(Step step)

The Step here is a function that takes an item of class T as its input, and returns a stream of items. If we consider a very simple Step, namely child(), this takes a node as input and returns a stream of nodes as its result. We can apply this step to an XdmValue consisting entirely of nodes, and it returns the concatenation of the streams of nodes obtained by applying the step to each node in the input value. This operation is equivalent to the "!" operator in XPath 3.0, or to the flatMap() method in many functional programming languages. It's not quite the same as the familiar "/" operator in XPath, because it doesn't eliminate duplicates or sort the result into document order. But for most purposes it does the same job.

There's a class net.sf.saxon.s9api.streams.Steps containing static methods which provide commonly-used steps such as child(). In my examples, I'll assume that the Java application has import net.sf.saxon.s9api.streams.Steps.*; in its header, so it can use these fields and methods without further qualification.

One of the steps defined by this class is net.sf.saxon.s9api.streams.Steps.child(): this step is a function which, given a node, returns its children. There are other similar steps for the other XPath axes. So you can find the children of a node N by writing N.select(child()).

Any two steps S and T can be combined into a single composite step by writing S.then(T): for example Step grandchildren = child().then(child()) gives you a step which can be used in the expression N.select(grandchildren) to select all the grandchildren.

The class Step inherits from the standard Java class Function, so it can be used more generally in any Java context where a Function is required.

Predicate<T> is a standard Java 8 class: it defines a function that can be applied to an object of type T to return true or false. The class net.sf.saxon.s9api.streams.Predicates defines some standard predicates that are useful when processing XML. For example isElement() gives you a predicate that can be applied to any XdmItem to determine if it is an element node.

Given a Step A and a Predicate P, the expression A.where(P) returns a new Step that filters the results of A to include only those items that satisfy the predicate P. So, for example, child().where(isElement()) is a step that selects the element children of a node, so that N.select(child().where(isElement())) selects the element children of N. This is sufficiently common that we provide a shorthand: it can also be written N.select(child(isElement())).

The predicate hasLocalName("foo") matches nodes having a local name of "foo": so N.select(child().where(hasLocalName("foo")) selects the relevant children. Again this is so common that we provide a shorthand: N.select(child("foo")). There is also a two argument version child(ns, "foo") which selects children with a given namespace URI and local name.

Another useful predicate is exists(step) which tests whether the result of applying a given step returns at least one item. So, for example N.select(child().where(exists(attribute("id")))) returns those children of N that have an attribute named "id".

The result of the select() method is always a stream of items, so you can use methods from the Java Stream class such as filter() and flatMap() to process the result. Here are some of the standard things you can do with a stream of items in Java:

  • You can get the results as an array: N.select(child()).toArray()
  • Or as a list: N.select(child()).collect(Collectors.toList())
  • You can apply a function to each item in the stream: N.select(child()).forEach(System.err::println)
  • You can get the first item in the stream: N.select(child()).findFirst().get()

However, Saxon methods such as select() always return a subclass of Stream called XdmStream, and this offers additional methods. For example:

  • You can get the results as an XdmValue: N.select(child()).asXdmValue()
  • A more convenient way to get the results as a Java List: N.select(child()).asList()
  • If you know that the stream contains a single node (or nothing), you can get this using the methods asNode() or asOptionalNode()
  • Similarly, if you know that the stream contains a single atomic value (or nothing), you can get this using the methods asAtomic() or asOptionalAtomic()
  • You can get the last item in the stream: N.select(child("para")).last()

More about Steps

The actual definition of the Step class is:

public abstract class Step<T extends XdmItem> implements Function<XdmItem, Stream<? extends T>>

What that means is that it's a function that any XdmItem as input, and delivers a stream of U items as its result (where U is XdmItem or some possibly-different subclass). (I experimented by also parameterizing the class on the type of items accepted, but that didn't work out well.)

Because the types are defined, Java can make type inferences: for example it knows that N.select(child()) will return nodes (because child() is a step that returns nodes).

As a user of this API, you can define your own kinds of Step if you want to: but most of the time you will be able to do everything you need with the standard Steps available from the class net.sf.saxon.s9api.stream.Steps. The standard steps include:

  • The axis steps ancestor(), ancestor-or-self(), attribute(), child(), descendant(), descendantOrSelf(), following(), followingSibling(), namespace(), parent(), preceding(), precedingSibling(), self().
  • For each axis, three filtered versions: for example child("foo") filters the axis to select elements by local name (ignoring the namespace if any); child(ns, local) filters the axis to select elements by namespace URI and local name, and child(predicate) filters the axis using an arbitrary predicate: this is a shorthand for child().where(predicate).
  • A composite step can be constructed using the method step1.then(step2). This applies step2 to every item in the result of step1, retaining the order of results and flattening them into a single stream.
  • A filtered step can be constructed using the method step1.where(predicate1). This selects those items in the result of step1 for which predicate1 returns true.
  • A path with several steps can be constructed using a call such aspath(child(isElement()), attribute("id")). This returns a step whose effect is to return the id attributes of all the children of the target node.
  • If the steps are sufficiently simple, a path can also by written means of a simple micro-syntax similar to XPath abbreviated steps. The previous example could also be written path("*", "@id"). Again, this returns a step that can be used like any other step. (In my own applications, I have found myself using this approach very extensively).
  • The step atomize() extracts the typed values of nodes in the input, following the rules in the XPath specification. The result is a stream of atomic values
  • The step toString() likewise extracts the string values, while toNumber() has the same effect as the XPath number() function

Last but not least, xpath(path) returns a Step that evaluates an XPath expression. For example, doc.select(xpath("//foo")) has the same effect as doc.select(descendant("foo")). A second argument to the xpath() method may be used to supply a static context for the evaluation. Note that compilation of the XPath expression occurs while the step is being created, not while it is being evaluated; so if you bind the result of xpath("//foo") to a variable, then the expression can be evaluated repeatedly without recompilation.

More about Predicates

The Predicate class is a standard Java 8 interface: it is a function that takes any object as input, and returns a boolean. You can use any predicates you like with this API, but the class net.sf.saxon.s9api.streams.Predicates provides some implementations of Predicate that are particularly useful when navigating XML documents. These include the following:

  • isElement(), isAttribute(), isText(), isComment(), isDocument(), isProcessingInstruction(), isNamespace() test that the item is a node of a particular kind
  • hasName("ns", "local"), hasLocalName("n"), and hasNamespaceUri("ns") make tests against the name of the node
  • hasType(t) tests the type of the item: for example hasType(ItemType.DATE) tests for atomic values of type xs:date
  • exists(step) tests whether the result of applying the given step is a sequence containing at least one item; conversely empty(step) tests whether the result of the step is empty. For example, exists(CHILD) is true for a node that has children.
  • some(step, predicate) tests whether at least one item selected by the step satisfies the given predicate. For example, some(CHILD, IS_ELEMENT) tests whether the item is a node with at least one element child. Similarly every(step, predicate) tests whether the predicate is true for every item selected by the step.
  • eq(string) tests whether the string value of the item is equal to the given string; while eq(double) does a numeric comparison. A two-argument version eq(step, string) is shorthand for some(step, eq(string)). For example, descendant(eq(attribute("id"), "ABC")) finds all descendant elements having an "id" attribute equal to "ABC".
  • Java provides standard methods for combining predicates using and, or, and not. For example isElement().and(eq("foo")) is a predicate that tests whether an item is an element with string-value "foo".

The XdmStream class

The fact that all this machinery is built on Java 8 streams and functions is something that many users can safely ignore; they are essential foundations, but they are hidden below the surface. At the same time, a user who understands that steps and predicates are Java Functions, and that the result of the select() method is a Java Stream, can take advantage of this knowledge.

One of the key ideas that made this possible was the idea of subclassing Stream with XdmStream. This idea was shamelessly stolen from the open-source StreamEx library by Tagir Valeev (though no StreamEx code is actually used). Subclassing Stream enables additional methods to be provided to handle the results of the stream, avoiding the need for clumsy calls on the generic collect() method. Another motivating factor here is to allow for early exit (short-circuit evaluation) when a result can be delivered without reading the whole stream. Saxon handles this by registering onClose() handlers with the stream pipeline, so that when the consumer of the stream calls the XdmStream.close() method, the underlying supplier of data to the stream is notified that no more data is needed.


This section provides some examples extracted from an actual program that uses s9api interfaces and does a mixture of Java navigation and XPath and XQuery processing to extract data from an input document.

First, some very simple examples. Constructs like this are not uncommon:

XdmNode testInput = (XdmNode) xpath.evaluateSingle("test", testCase);

This can be replaced with the much simpler and more efficient:

XdmNode testInput = testCase.selectFirst(child("test"));

Similarly, the slightly more complex expression:

XdmNode principalPackage = (XdmNode) xpath.evaluateSingle("package[@role='principal']", testInput);


XdmNode principalPackage = testInput.selectFirst(child("package").where(eq(attribute("role"), "principal"));

A more complex example from the same application is this one:

boolean definesError = xpath.evaluate("result//error[starts-with(@code, 'XTSE')]", testCase).size() > 0;

Note here how the processing is split between XPath code and Java code. This is also using an XPath function for which we haven't provided a built-in predicate in s9api. But that's no problem, because we can invoke Java methods as predicates. So this becomes:

boolean definesError = testCase.selectFirst(child("result"), descendant("error").where(
                     some(attribute("code"), (XdmNode n) -> n.getStringValue().startsWith("XTSE"))) != null;

Capturing Accumulatorshttps://dev.saxonica.com/blog/mike/2018/03/capturing-accumulators.html2018-03-28T08:09:03Z

A recent post on StackOverflow made me realise that streaming accumulators in XSLT 3.0 are much harder to use than they need to be.

A reminder about what accumulators do. The idea is that as you stream your way through a large document, you can have a number of tasks running in the background (called accumulators) which observe the document as it goes past, and accumulate information which is then available to the "main" line of processing in the foreground. For example, you might have an accumulator that simply keeps a note of the most recent section heading in a document; that's useful because the foreground processing can't simply navigate around the document to find the current section heading when it finds that it's needed.

Accumulator rules can fire either on start tags or end tags or both, or they can be associated with text nodes or attributes. But there's a severe limitation: a streaming accumulator must be motionless: that's XSLT 3.0 streaming jargon to say that it can only see what's on the parser's stack at the time the accumulator triggers. This affects both the pattern that controls when the accumulator is triggered, and the action that it can take when the rule fires.

For example, you can't fire a rule with the pattern match="section[title='introduction']" because navigation to child elements (title) is not allowed in a motionless pattern. Similarly, if the rule fires on  match="section", then you can't access the title in the rule action (select="title") because the action too must be motionless. In some cases a workaround is to have an accumulator that matches the text nodes (match="section/title/text()[.='introduction']") but that doesn't work if section titles can have mixed content.

It turns out there's a simple fix, which I call a capturing accumulator rule. A capturing accumulator rule is indicated by the extension attribute <xsl:accumulator-rule saxon:capture="yes" phase="end">, which will always be a rule that fires on an end-element tag. For a capturing rule, the background process listens to all the parser events that occur between the start tag and the end tag, and uses these to build a snapshot copy of the node. A snapshot copy is like the result of the fn:snapshot function - it's a deep copy of the matched node, with ancestor elements and their attributes tagged on for good measure. This snapshot copy is then available to the action part of the rule processing the end tag. The match patterns that trigger the accumulator rule still need to be motionless, but the action part now has access to a complete copy of the element (plus its ancestor elements and their attributes).

Here's an example. Suppose you've got a large document like the XSLT specification, and you want to produce a sorted glossary at the end, and you want to do it all in streamed mode. Scattered throughout the document are term definitions like this:

<termdef id="dt-stylesheet" term="stylesheet">A  <term>stylesheet</term> consists of one or more packages: specifically, one
   <termref def="dt-top-level-package">top-level package</termref> and zero or
   more <termref def="dt-library-package">library packages</termref>.</termdef>

Now we can write an accumulator which simply accumulates these term definitions as they are encountered:

<xsl:accumulator name="terms" streamable="yes">
    <xsl:accumulator-rule match="termdef" phase="end" select="($value, .)" saxon:capture="yes"/>

(the select expression here takes the existing value of the accumulator, $value, and appends the snapshot of the current termdef element, which is available as the context item ".")

And now, at the end of the processing, we can output the glossary like this:

<xsl:template match="/" mode="streamable-mode">
        <!-- main foreground processing goes here -->
        <xsl:apply-templates mode="#current"/>
        <!-- now output the glossary -->
        <div id="glossary" class="glossary">
            <xsl:apply-templates select="accumulator-after('terms')" mode="glossary">
                <xsl:sort select="@term" lang="en"/>

The value of the accumulator is a list of snapshots of termdef elements, and because these are snapshots, the processing at this point does not need to be streamable (snapshots are ordinary trees held in memory).

The amount of memory needed to accomplish this is whatever is needed to hold the glossary entries. This follows the design principle behind XSLT 3.0 streaming, which was not to do just those things that required zero working memory, but to enable the programmer to do things that weren't purely streamable, while having control over the amount of memory needed.

I think it's hard to find an easy way to tackle this particular problem without the new feature of capturing accumulator rules, so I hope it will prove a useful extension.

I've implemented this for Saxon 9.9. Interestingly, it only took about 25 lines of code: half a dozen to enable the new extension attribute, half a dozen to allow it to be exported to SEF files and re-imported, two or three to change the streamability analysis, and a few more to invoke the existing streaming implementation of the snapshot function from the accumulator watch code. Testing and documenting the feature was a lot more work than implementing it.

Here's a complete stylesheet that fleshes out the creation of a (skeletal) glossary:

<?xml version="1.0" encoding="UTF-8"?>
  xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:f="http://accum001/"
  exclude-result-prefixes="#all" version="3.0">

  <!-- Stylesheet to produce a glossary using capturing accumulators -->
  <!-- The source document is a W3C specification in xmlspec format, containing
    term definitions in the form <termdef term="banana">A soft <termref def="fruit"/></termdef> -->
  <!-- This test case shows the essential principles of how to render such a document
    in streaming mode, with an alphabetical glossary of defined terms at the end -->
  <xsl:param name="streamable" static="yes" select="'yes'"/>
  <xsl:accumulator name="glossary" as="element(termdef)*" initial-value="()" streamable="yes">
    <xsl:accumulator-rule match="termdef" phase="end" saxon:capture="yes" select="($value, .)"/>

  <xsl:mode streamable="yes" on-no-match="shallow-skip" use-accumulators="glossary"/>
  <xsl:template name="main">
    <xsl:source-document href="xslt.xml" streamable="yes" use-accumulators="glossary">
      <xsl:apply-templates select="."/>
 <xsl:template match="/">
      <!-- First render the body of the document -->
      <!-- Now generate the glossary -->
          <xsl:apply-templates select="accumulator-after('glossary')" mode="glossary">
            <xsl:sort select="@term" lang="en"/>
  <xsl:template match="div1|inform-div1">
    <div id="{@id}">
  <!-- Main document processing: just output the headings -->
  <xsl:template match="div1/head | inform-div1/head">
    <xsl:attribute name="title" select="."/>
  <!-- Glossary processing -->
  <xsl:mode name="glossary" streamable="no"/>
  <xsl:template match="termdef" mode="glossary">
        <xsl:value-of select="@term"/>
        <xsl:value-of select="."/>

Diagnostics on Type Errorshttps://dev.saxonica.com/blog/mike/2018/03/diagnostics-on-type-errors.html2018-03-16T15:50:27Z

Providing good diagnostics for programming errors has always been a high priority in Saxon, second only to conformance with the W3C specifications. One important area of diagnostics is reporting on type errors: that is, cases where a particular context requires a value of a given type, and the supplied value is the wrong type. A classic example would be providing a string as the first argument to format-date(), which requires an xs:date to be supplied.

Of course, the more programmers follow the discipline of declaring the expected types of function parameters and variables, the more helpful the compiler can be in diagnosing programming errors caused by supplying the wrong type of value.

Type errors can be detected statically or dynamically. Saxon uses "optimistic type checking". 

At compile time, it a value of type R is required in a particular context, and the expression appearing in that context is E, then the compiler attempts to infer the static type of expression E: call this S. Sometimes this is straightforward, for example if E is a call on the node-name() function, then it knows that S is xs:QName. In other case the compiler has to be smarter: for example it knows that the static type of a call on remove() is the same as the static type of the first argument, with an adjustment to the occurrence indicator.

Optimistic type checking reports an error at compile time only if there is nothing in common between the required type R and the inferred static type of E: that is, if there is no overlap between the set of instances of the two types. That would mean that a run-time failure is inevitable (assuming the code actually gets executed), and the W3C specifications allow early reporting of such an error.

There's another interesting case where the types overlap only to the extent that both allow an empty sequence: for example if the required type is (xs:string*) and the supplied type is (xs:integer*). That's almost certainly an error, but W3C doesn't allow an error to be reported here because there is a faint chance that execution could succeed. So Saxon reports this as a warning. With maps and arrays, incidentally, there are analogous situations where the only overlap is an empty map or array, but Saxon isn't yet handling that case specially.

If the types aren't completely disjoint, there are two other possibilities: the required type R might subsume the supplied type S, meaning that no run-time type checking is needed because the call will always succeed. The other possibility is that the types overlap: evaluating the supplied expression E might or might not produce a value that matches the required type R. In this case Saxon generates code to perform run-time type checking. (This is one reason why declaring the types of parameters and variables is such good practice: the code runs faster because there is no unnecessary run-time checking.)

Until recently, the error message for a type error takes the form:

Required item type of CCC is RRR; supplied value has item type SSS

For example:

Required item type of first argument to format-date() is xs:date; supplied value has item type xs:string

which works pretty well in most cases. However, I'm finding that as I write more complex code involving maps and arrays, it's no longer good enough. The problem is that as the types become more complex, simply giving the required and actual types isn't enough to make it clear why they are incompatible. You end up with messages like this one:

Required item type of first argument of local:x() is map(xs:integer, xs:date); supplied value has item type map(xs:anyAtomicType, xs:date).

where an expert user can probably work out that the problem is that the supplied map contains an entry whose key is not an integer; but it doesn't exactly point clearly to the source of the problem.

The problem comes to a head particularly when tuple types are used (see here). If the required type is a tuple type, reporting the supplied type as a map type is particularly unhelpful.

I'm therefore changing the approach: instead of reporting on the supplied type of the value (or the inferred type of the expression, in the case of static errors), I'm reporting an explanation of why it doesn't match. Here's the new version of the message:

The required item type of the first argument of local:x() is map(xs:integer, xs:date); the supplied value map{xs:date("2018-03-16Z"):5, "x":3} does not match. The map contains a key (xs:date("2018-03-16Z")) of type xs:date that is not an instance of the required type xs:integer.

So firstly, I'm outputting the actual value, or an abbreviated form of it, rather than just its type (that only works, of course, for run-time errors). And secondly, I'm highlighting how the type-checker worked out that the value doesn't match the required type: it's saying explicitly which rule was broken.

(Another minor change you can see here is that I'm making more effort to write complete English sentences.)

This doesn't just benefit the new map and array types, you can also see the effect with node types. For example, if the required type is document-node(element(foo)), you might see the message:

The required item type of the first argument of local:x() is document-node(element(Q{}foo)); the supplied value doc() does not match. The supplied document node has an element child (<bar>) that does not satisfy the element test. The node has the wrong name.

Another change I'm making is to distribute type-checking into a sequence constructor. At present, if a function is defined to return (say) a list of element nodes, and the function body contains a sequence of a dozen instructions, one of which returns a text node, you get a message saying that the type of the function result is wrong, but it doesn't pinpoint exactly why. By distributing the type checking (applying the principle that if the function must return element nodes, then each of the instructions must return element nodes) we can (a) identify the instruction in error much more precisely, and (b) avoid the run-time cost of checking the results of those instructions that we know statically are OK.

Interestingly, all these changes were stimulated by my own recent experience in writing a complex stylesheet. I described the plans for this here and the coding has now been completed (I'll report on the outcome later). It's a classic case of dogfood: if you use your own products in anger, you find ways of improving them that you wouldn't have thought of otherwise, and that users wouldn't have suggested because they don't know what's possible.

Could we write an XSD Schema Processor in XSLT?https://dev.saxonica.com/blog/mike/2018/02/could-we-write-an-xsd-schema-processor-in-xslt.html2018-02-10T18:58:46Z

Many computing platforms are not well-served by up to date XML technology, and in consequence Saxonica has been slowly increasing its coverage of the major platforms: extending from Java to .NET, C++, PHP, Javascript using a variety of technical approaches. This makes it desirable to implement as much as possible using portable languages, and if we want to minimize our dependence on third-party technologies (IKVMC, for example, is now effectively unsupported) we should be writing in our own languages, notably XSLT.

This note therefore asks the question, could one write an XSD Schema 1.1 processor in XSLT?

In fact a schema processor has two parts, compile time (compiling schema documents into the schema component model and SCM) and run-time (validating an instance document using the SCM).

The first part, compiling, seems to pose no intrinsic difficulty. Some of the rules and constraints that need to be enforced are fairly convoluted, but the only really tricky part is compiling grammars into finite-state-machines, and checking grammars (or the resulting finite-state-machine) for conformance with rules such as the Unique Particle Attribution constraint. But since we already have a tool (written in Java) for compiling schemas into an XML-based SCM file, and since it wouldn't really inconvenience users too much for this tool to be invoked via an HTTP interface, the priority for a portable implementation is really the run-time part of the processor rather than the compile-time part. (Note that this means ignoring xsi:schemaLocation, since that effectively causes the run-time validator to invoke the schema compiler.)

There are two ways one could envisage implementing the run-time part in XSLT: either with a universal stylesheet that takes the SCM and the instance document as inputs, or by generating a custom XSLT stylesheet from the SCM, rather as is done with Schematron. For the moment I'll keep an open mind which of these two approaches is preferable.

Ideally, the XSLT stylesheet would use streaming so the instance document being validated does not need to fit in memory. We'll bear this requirement in mind as we look at the detail.

The XSLT code, of course, cannot rely on any services from a schema processor, so it cannot be schema-aware.

Let's look at the main jobs the validator has to do.

Validating strings against simple types

Validating against a primitive type can be done simply using the XPath castable operator.

Validating against a simple type derived by restriction involves checking the various facets. For the most part, the logic of each facet is easily expressed in XPath. There are a few exceptions:

  • Patterns (regular expressions). The XPath regular expression syntax is a superset of the XSD syntax. To evaluate XSD regular expressions, we either need some kind of extension to the XPath matches() function, or we need to translate XSD regular expressions into XPath regular expressions. This translation is probably not too difficult. It mainly involves rejecting some disallowed constructs (such as back-references, non-capturing groups, and reluctant quantifiers), and escaping "^" and "$" with a backslash.

  • Length facets for hexBinary and base64Binary. Base646Binary can be cast to hexBinary, and the length of the value in octets can be computed by converting to string and dividing the string length by 2.

Validating against a list type can be achieved by tokenizing, and testing each token against the item type.

Validating against a union type can be achieved by validating against each member type (and also validing against any constraining facets defined at the level of the union itself).

Validating elements against complex types

The only difficult case here is complex content. It should be possible to achieve this by iterating over the child nodes using xsl:iterate, keeping the current state (in the FSM) as the value of the iteration parameter. On completion the element is valid if the state is a final state. As each element is processed, it needs to be checked against the state of its parent element's FSM, and in addition a new validator is established for validating its children. This is all streamable.

Assertions and Conditional Type Assignment

Evaluating XPath expressions can be achieved using xsl:evaluate. The main difficulty is setting up the node-tree to which xsl:evaluate is applied. This needs to be a copy of the original source subtree, to ensure that the assertion cannot stray outside the relevant subtree. Making this copy consumes the source subtree, which makes streaming tricky: however, the ordinary complex type validation can also happen on the copy, so I think streaming is possible.

Identity constraints (unique, key, keyref)

This is where streaming really gets quite tricky - especially given the complexity of the specification for those rare keyref cases where the key is defined on a different element from the corresponding keyref.

The obvious XSLT mechanism here is accumulators. But accumulator rules are triggered by patterns, and defining the patterns that correspond to the elements involved in a key definition is tricky. For example if sections nest recursively, a uniqueness constraint might say that for every section, its child section elements must have unique @section-number attributes. A corresponding accumulator would have to maintain a stack of sections, with a map of section numbers at each level of the stack, and the accumulator rule for a section would need to check the section number of that section at the current level, and start a new level.

A further complication is that there may be multiple (global and/or local) element declarations with the same name, with different unique / key / keyref constraints. Deciding which of these apply by means of XSLT pattern matching is certainly difficult and may be impossible.

The multiple xs:field elements within a constraint do not have to match components of the key in document order, but a streamed implementation would still be possible using the map constructor, which allows multiple downward selections - provided that the xs:field selector expressions are themselves streamable, which I think is probably always the case.

The problem of streamability could possibly be solved with some kind of dynamic pipelining. The "main" validation process, when it encounters a start tag, is able to establish which element declaration it belongs to, and could in principle spawn another transformation (processing the same input stream) for each key / unique constraint defined in that element declaration: a kind of dynamic xsl:fork.

I think as a first cut it would probably be wise not to attempt streaming in the case of a schema that uses unique / key / keyref constraints. More specifically, if any element has such constraints, it can be deep-copied, and validation can then switch to the in-memory subtree rather than the original stream. After all, we have no immediate plans to implement streaming other than in the Java product, and that will inevitably make an XSLT-based schema processor on other platforms unstreamed anyway.

Outcome of validation

There are two main scenarios we should support: validity checking, and type annotation. With validity checking we want to report many invalidities in a single validation episode, and the main output is the validation report. With type annotation, the main output is a validated version of the instance document, and a single invalidity can cause the process to terminate with a dynamic error.

It is not possible for a non-schema-aware stylesheet to add type annotations to the result tree without some kind of extensions. The XSLT language only allows type annotations to be created as the result of schema validation. So we will need an extension for this purpose: perhaps a saxon:type-annotation="QName" attribute on instructions such as xsl:element, xsl:copy, xsl:attribute.

For reporting validation errors, it's important to report the location of the invalidity. This also requires extensions, such as saxon:line-number().


I don't think there are any serious obstacles to writing a validation engine in XSLT. Making it streamable is harder, especially for integrity constraints. A couple of extensions are needed: the ability to add type annotations to the result tree, and the ability to get line numbers of nodes in the source.

I still have an open mind about whether a universal stylesheet should be used, or a generated stylesheet for a particular schema.

Transforming JSONhttps://dev.saxonica.com/blog/mike/2017/11/transforming-json.html2017-11-13T13:02:13Z

In my [conference paper at XML Prague](https://www.saxonica.com/papers/xmlprague-2016mhk.pdf) in 2016 I examined a couple of use cases for transforming JSON structures using XSLT 3.0. The overall conclusion was not particularly encouraging: the easiest way to achieve the desired results was to convert the JSON to XML, transform the XML, and then convert it back to JSON.

Unfortunately this study came too late to get any new features into XSLT 3.0. However, I've been taking another look at the use cases to see whether we could design language extensions to handle them, and this is looking quite encouraging.

Use case 1: bulk update

We start with the JSON document

[ {
  "id": 3, "name": "A blue mouse", "price": 25.50,
  "dimensions": {"length": 3.1, "width": 1.0, "height": 1.0},
  "warehouseLocation": {"latitude": 54.4, "longitude": -32.7 }},
  "id": 2, "name": "An ice sculpture", "price": 12.50,
  "tags": ["cold", "ice"],
  "dimensions": {"length": 7.0, "width": 12.0, "height": 9.5 },
  "warehouseLocation": {"latitude": -78.75, "longitude": 20.4 }
} ]

and the requirement: for all products having the tag "ice", increase the price by 10%, leaving all other data unchanged. I've prototyped a new XSLT instruction that allows this to be done as follows:

select=" ?*[?tags?* = 'ice']"
action="map:put(., 'price', ?price * 1.1)"/>

How does this work?

First the instruction evaluates the root expression, which in this case returns the map/array representation of the input JSON document. With this root item as context item, it then evaluates the select expression to obtain a sequence of contained maps or arrays to be updated: these can appear at any depth under the root item. With each of these selected maps or arrays as the context item, it then evaluates the action expression, and uses the returned value as a replacement for the selected map or array. This update then percolates back up to the root item, and the result of the instruction is a map or array that is the same as the original except for the replacement of the selected items.

The magic here is in the way that the update is percolated back up to the root. Because maps and arrays are immutable and have no persistent identity, the only way to do this is to keep track of the maps and arrays selected en-route from the root item to the items selected for modification as we do the downward selection, and then modify these maps and arrays in reverse order on the way back up. Moreover we need to keep track of the cases where multiple updates are made to the same containing map or array. All this magic, however, is largely hidden from the user. The only thing the user needs to be aware of is that the select expression is constrained to use a limited set of constructs when making downward selections.

The select expression select="?*[?tags?* = 'ice']" perhaps needs a little bit of explanation. The root of the JSON tree is an array of maps, and the initial ?* turns this into a sequence of maps. We then want to filter this sequence of maps to include only those where the value of the "tags" field is an array containing the string "ice" as one of its members. The easiest way to test this predicate is to convert the value from an array of strings to a sequence of strings (so ?tags?*) and then use the XPath existential "=" operator to compare with the string "ice".

The action expression map:put(., 'price', ?price * 1.1) takes as input the selected map, and replaces it with a map in which the price entry is replaced with a new entry having the key "price" and the associated value computed as the old price multiplied by 1.1.

Use case 2: Hierarchic Inversion

The second use case in the XML Prague 2016 paper was a hierarchic inversion (aka grouping) problem. Specifically: we'll look at a structural transformation changing a JSON structure with information about the students enrolled for each course to its inverse, a structure with information about the courses for which each student is enrolled.

Here is the input dataset:

    "faculty": "humanities",
    "courses": [
        "course": "English",
        "students": [
            "first": "Mary",
            "last": "Smith",
            "email": "mary_smith@gmail.com"
            "first": "Ann",
            "last": "Jones",
            "email": "ann_jones@gmail.com"
        "course": "History",
        "students": [
            "first": "Ann",
            "last": "Jones",
            "email": "ann_jones@gmail.com"
            "first": "John",
            "last": "Taylor",
            "email": "john_taylor@gmail.com"
    "faculty": "science",
    "courses": [
        "course": "Physics",
        "students": [
            "first": "Anil",
            "last": "Singh",
            "email": "anil_singh@gmail.com"
            "first": "Amisha",
            "last": "Patel",
            "email": "amisha_patel@gmail.com"
        "course": "Chemistry",
        "students": [
            "first": "John",
            "last": "Taylor",
            "email": "john_taylor@gmail.com"
            "first": "Anil",
            "last": "Singh",
            "email": "anil_singh@gmail.com"

The goal is to produce a list of students, sorted by last name then irst name, each containing a list of courses taken by that student, like this:

  { "email": "anil_singh@gmail.com",
    "courses": ["Physics", "Chemistry" ]},
  { "email": "john_taylor@gmail.com",
    "courses": ["History", "Chemistry" ]},

The classic way of handling this is in two phases: first reduce the hierarchic input to a flat sequence in which all the required information is contained at one level, and then apply grouping to this flat sequence.

To achieve the flattening we introduce another new XSLT instruction:

select="?* ! map:find(., 'students)?*"/>

Again the root expression delivers a representation of the JSON document as an array of maps. The select expression first selects these maps ("?*"), then for each one it calls map:find() to get an array of maps each representing a student. The result of the instruction is a sequence of maps corresponding to these student maps in the input, where each output map contains not only the fields present in the input (first, last, email), but also fields inherited from parents and ancestors (faculty, course). For good measure it also contains a field _keys containing an array of keys representing the path from root to leaf, but we don't actually use that in this example.

Once we have this flat structure, we can construct a new hierarchy using XSLT grouping:

<xsl:for-each-group select="$students" group-by="?email">
<xsl:map-entry key="'email'" select="?email"/>
<xsl:map-entry key="'first'" select="?first"/>
<xsl:map-entry key="'last'" select="?last"/>
<xsl:map-entry key="'courses'">
<xsl:for-each select="current-group()">
<saxon:array-member select="?course"/>

This can then be serialized using the JSON output method to produce to required output.

Note: the saxon:array and saxon:array-member instructions already exist in Saxon 9.8. They fill an obvious gap in the XSLT 3.0 facilities for handling arrays - a gap that exists largely because the XSL WG was unwilling to create a dependency XPath 3.1.

Use Case 3: conversion to HTML

This use case isn't in the XML Prague paper, but is included here for completeness.

The aim here is to construct an HTML page containing the information from a JSON document, without significant structural alteration. This is a classic use case for the recursive application of template rules, so the aim is to make it easy to traverse the JSON structure using templates with appropriate match patterns.

Unfortunately, although the XSLT 3.0 facilities allow patterns that match maps and arrays, they are cumbersome to use. Firstly, the syntax is awkward:

match=".[. instance of map(...)]"

We can solve this with a Saxon extension allowing the syntax


Secondly, the type of a map isn't enough to distinguish one map from another. To identify a map representing a student, for example, we aren't really interested in knowing that it is a map(xs:string, item()*). What we need to know is that it has fields (email, first, last). Fortunately another Saxon extension comes to our aid: tuple types, described here: http://dev.saxonica.com/blog/mike/2016/09/tuple-types-and-type-aliases.html With tuple types we can change the match pattern to

match="tuple(email, first, last)"

Even better, we can use type aliases:

<saxon:type-alias name="student" as="tuple(email, first, last)"/>
<xsl:template match="~student">...</xsl:template>

With this extension we can now render this input JSON into HTML using the stylesheet:

<?xml version="1.0" encoding="utf-8"?>

xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0"

<saxon:type-alias name="faculty" type="tuple(faculty, courses)"/>
<saxon:type-alias name="course" type="tuple(course, students)"/>
<saxon:type-alias name="student" type="tuple(first, last, email)"/>

<xsl:template match="~faculty">
<h1>{?faculty} Faculty</h1>
<xsl:apply-templates select="?courses?*"/>

<xsl:template match="~course">
<h2>{?course} Course</h2>
<p>List of students:</p>
<xsl:apply-templates select="?students?*">
<xsl:sort select="?last"/>
<xsl:sort select="?first"/>

<xsl:template match="~student">
<td>{?first} {?last}</td>

<xsl:template name="xsl:initial-template">
<xsl:apply-templates select="json-doc('courses.json')"/>



With only the facilities of the published XSLT 3.0 recommendation, the easiest way to transform JSON is often to convert it first to XML node trees, and then use the traditional XSLT techniques to transform the XML, before converting it back to JSON.

With a few judiciously chosen extensions to the language, however, a wide range of JSON transformations can be achieved natively.

Bugs: How well are we doing?https://dev.saxonica.com/blog/mike/2017/02/bugs-how-well-are-we-doing.html2017-02-05T17:36:01Z

We're about to ship another Saxon 9.7 maintenance release, with another 50 or so bug clearances. The total number of patches we've issued since 9.7 was released in November 2015 has now reached almost 450. The number seems frightening and the pace is relentless. But are we getting it right, or are we getting it badly wrong?

There are frequently-quoted but poorly-sourced numbers you can find on the internet suggesting a norm of 10-25 bugs per thousand lines of code. Saxon is 300,000 lines of (non-comment) code, so that would suggest we can expect a release to have 3000 to 7500 bugs in it. One one measure that suggests we're doing a lot better than the norm. Or it could also mean that most of the bugs haven't been found yet.

I'm very sceptical of such numbers. I remember a mature product in ICL that was been maintained by a sole part-time worker, handling half a dozen bugs a month. When she went on maternity leave, the flow of bugs magically stopped. No-one else could answer the questions, so users stopped sending them in. The same happens with Oracle and Microsoft. I submitted a Java bug once, and got a response 6 years later saying it was being closed with no action. When that happens, you stop sending in bug reports. So in many ways, a high number of bug reports doesn't mean you have a buggy product, it means you have a responsive process for responding to them. I would hate the number of bug reports we get to drop because people don't think there's any point in submitting them.

And of course the definition of what is a bug is completely slippery. Very few of the bug reports we get are completely without merit, in the sense that the product is doing exactly what it says on the tin; at the same time, rather few are incontrovertible bugs either. If diagnostics are unhelpful, is that a bug?

The only important test really is whether our users are satisfied with the reliability of the product. We don't really get enough feedback on that at a high level. Perhaps we should make more effort to find out; but I so intensely hate completing customer satisfaction questionnaires myself that I'm very reluctant to inflict it on our users. Given that open source users outnumber commercial users by probably ten-to-one, and that the satisfaction of our open source users is just as important to us as the satisfaction of our commercial customers (because it's satisfied open source users who do all the sales work for us); and given that we don't actually have any way of "reaching out" to our open source users (how I hate the marketing jargon); and given that we really wouldn't know what to differently if we discovered that 60% of our users were "satisfied or very satisfied": I don't really see very much value in the exercise. But I guess putting a survey form on the web site wouldn't be difficult, some people might interpret it as a signal that we actually care.

With 9.7 there was a bit of a shift in policy towards fixing bugs pro-actively (more marketing speak). In particular, we've been in a phase where the XSLT and XQuery specs were becoming very stable but more test cases were becoming available all the time (many of them, I might add, contributed by Saxonica - often in reaction to queries from our users). So we've continuously been applying new tests to the existing release, which is probably a first. Where a test showed that we were handling edge cases incorrectly, and indeed when the spec was changed in little ways under our feet, we've raised bugs and fixes to keep the conformance level as high as possible (while also maintaining compatibility). So we've shifted the boundary a little between feature changes (which traditionally only come in the next release), and bug fixes, which come in a maintenance release. That shift also helps to explain why the gap between releases is becoming longer - though the biggest factor holding us back, I think, is the ever-increasing amount of testing that we do before a release.

Fixing bugs pro-actively (that is before any user has hit the bug) has the potential to improve user satisfaction if it means that they never do hit the bug. I think it's always as well to remember also that for every user who reports a bug there may be a dozen users who hit it and don't report it. One reason we monitor StackOverflow is that a lot of users feel more confident about reporting a problem there, rather than reporting it directly to us. Users know that their knowledge is limited and they don't want to make fools of themselves, and you need a high level of confidence to tell your software vendor that you think the product is wrong. 

On the other hand, destabilisation is a risk. A fix in one place will often expose a bug somewhere else, or re-awaken an old bug that had been laid to rest. As a release becomes more mature, we try to balance the benefits of fixing problems with the risk of de-stabilisation.

So, what about testing? Can we say that because we've fixed 450 bugs, we didn't run enough tests in the first place?

Yes, in a sense that's true, but how many more tests would have had to write in order to catch them? We probably run about a million test cases (say, 100K tests in an average of ten product configurations each) and these days the last couple of months before a major release are devoted exclusively to testing. (I know that means we don't do enough continuous testing. But sorry, it doesn't work for me. If we're doing something radical to the internals of the product then things are going to break in the process, and my style is to get the new design working while it's still fresh in my head, then pick up the broken pieces later. If everything had to work in every nightly build, we would never get the radical things done. That's a personal take, and of course what works with a 3-4 person team doesn't necessarily work with a larger project. We're probably pretty unusual in developing a 300Kloc software package with 3-4 people, so lots of our experience might not extrapolate.)

We've had a significant number of bug reports this time on performance regression. (This is of course another area where it's arguable whether it's a bug or not. Sometimes we will change the design in a way that we know benefits some workloads at the expense of others.) Probably most of these are extreme scenarios, for example compilation time for stylesheets where a single template declares 500 local variables. Should we have run tests to prevent that? Well, perhaps we should have more extreme cases in our test suite: the vast majority of our test cases are trivially small. But the problem is, there will always be users who do things that we would never have imagined. Like the user running an XSD 1.1 schema validation in which tens of thousands of assertions are expected to "fail", because they've written it in such a way that assertion failures aren't really errors, they are just a source of statistics for reporting on the data.

The bugs we hate most (and therefore should to most to prevent) are bugs in bytecode generation, streaming, and multi-threading. The reason we hate them is that they can be a pig to debug, especially when the user-written application is large and complex. 

  • For bytecode generation I think we've actually got pretty good test coverage, because we not only run every test in the QT3 and XSLT3 test suites with bytecode generation enabled, we also artificially complicate the tests to stop queries like 2+5 being evaluated by the compiler before bytecode generation kicks in. We've also got an internal recovery mechanism so if we detect that we've generated bad code, we fall back to interpreted mode and the user never notices (problem with that is of course that we never find out).
  • Streaming is tricky because the code is so convoluted (writing everything as inverted event-based code can be mind-blowing) and because the effects of getting it wrong often give very little clue as to the cause. But at least the failure is "in your face" for the user, who will therefore report the problem, and it's likely to be reproducible. Another difficulty with streaming is that because not all code is streamable, tests for streaming needed to be written from scratch.
  • Multi-threading bugs are horrible because they occur unpredictably. If there's a low probability of the problem happening then it can require a great deal of detective work to isolate the circumstances, and this often falls on the user rather than on ourselves. Fortunately we only get a couple of these a year, but they are a nightmare when they come. In 9.7 we changed our Java baseline to Java 6 and were able therefore to replace many of the hand-built multithreading code in Saxon with standard Java libraries, which I think has helped reliability a lot. But there are essentially no tools or techniques to protect you from making simple thread-safety blunders, like setting a property in a shared object without synchronization. Could we do more testing to prevent these bugs? I'm not optimistic, because the bugs we get are so few, and so particular to a specific workload, that searching the haystack just in case it contains a needle is unlikely to be effective.
Summary: Having the product perceived as reliable by our users is more important to us than the actual bug count. Fixing bugs quickly before they affect more users is probably the best way of achieving that. If the bug count is high because we're raising bugs ourselves as a result of our own testing, then that's no bad thing. It hasn't yet got to the level where we can't cope with the volumes, or where we have to filter things through staff who are only employed to do support. If we can do things better, let us know.

Guaranteed Streamabilityhttps://dev.saxonica.com/blog/mike/2016/12/guaranteed-streamability.html2016-12-09T21:56:27Z

The XSLT 3.0 specification in its current form provides a set of rules (that can be evaluated statically, purely by inspecting the stylesheet) for determining whether the code is (or is not) guaranteed streamable.

If the code is guaranteed streamable then every processor (if it claims to support streaming at all) must use streaming to evaluate the stylesheet; if it is not guaranteed streamable then the processor can choose whether to use streaming or not.

The tricky bit is that there's a requirement in the spec that if the code isn't guaranteed streamable, then a streaming processor (on request) has to detect this and report it. The status section of the spec says that this requirement is "at risk", meaning it might be removed if it proves too difficult to implement. There are people on the working group who believe passionately that this requirement is really important for interoperability; there are others (including me) who fully understand why users would like to have this, but have been arguing that it is extremely difficult to deliver.

In this article I'm going to try to explain why it's so difficult to achieve this requirement, and to explore possibilities for overcoming these difficulties.

Streamability analysis can't be performed until various other stages of static analysis are complete. It generally requires that names have been resolved (for example, names of modes and names of streamable functions). It also relies on rudimentary type analysis (determining the static type of constructs). For Saxon, this means that streamability analysis is done after parsing, name fixup, type analysis, and rewrite optimization.

When Saxon performs these various stages of analysis, it modifies the expression tree as it goes: not just to record the information obtained from the analysis, but to make use of the information at execution time. It goes without saying that in modifying the expression tree, it's not permitted to replace a streamable construct with a non-streamable one, and that isn't too hard to achieve (though these things are relative...). But the requirement to report departures from guaranteed streamability imposes a second requirement, which is proving much harder. If we are to report any deviations from guaranteed streamability, then up to the point where we do the streamability analysis, we must never replace a non-streamable construct with a streamable one.

There are various points at which we currently replace a non-streamable construct with a streamable one.

  • Very early in the process, the expression tree that is output by the parsing phase uses the same data structure on the expression tree to represent equivalent constructs in the source. For example, the expression tree produced by <xsl:if test="$a=2"><xsl:sequence select="3"/></xsl:if> will be identical to the expression tree produced by <xsl:sequence select="if ($a=2) then 3 else ()"/>. But streamability analysis makes a distinction between these two constructs. It's not a big distinction (in fact, the only thing it affects is exactly where you are allowed to call the accumulator-after() function) but it's big enough to count.
  • At any stage in the process, if we spot a constant expression then we're likely to replace it with its value. For example if we see the expression $v+3, and $v is a global variable whose value is 5, we will replace the expression with the literal 8. This won't usually affect streamability one way or the other. However, there are a few cases where it does. The most obvious is where we work out that an expression is void (meaning it always returns an empty sequence). For example, according to the spec, the expression (author[0], author[1]) is not streamable because it makes two downward selections. But Saxon spots that author[0] is void and rewrites the expression as (author[1]), which is streamable. Void expressions often imply some kind of user error, so we often output a warning when this happens, but just because we think the user has written nonsense doesn't absolve us from the conformance requirement to report on guaranteed streamability. Void expressions are particularly likely to be found with schema-aware analysis.
  • Inlining of calls to user-defined functions will often make a non-streamable expression streamable.
  • Many other rewrites performed by the optimizer have a similar effect, for example replacing (X|Y) by *[self::X|self::Y].
My first attempt to meet the requirement is therefore (a) to add information to the expression tree where it's needed to maintain a distinction that affects streamability, and (b) to try to avoid those rewrites that turn non-streamable expressions into streamable ones. As a first cut, skipping the optimization phase completely seems an easy way to achieve (b). But it turns out it's not sufficient, firstly because some rewrites are done during the type-checking phase, and secondly because it turns out that without an optimization pass, we actually end up finding that some expressions that should be streamable are not. The most common case for this is sorting into document order. Given the expression A/B, Saxon actually builds an expression in the form sort(A!B) relying on the sort operation to sort nodes into document order and eliminate duplicates. This relies on the subsequent optimization phase to eliminate the sort() operation when it can. If we skip the optimization phase, we are left with an unstreamable expression.

The other issue is that the streamability rules rely on type inferencing rules that are much simpler than the rules Saxon uses. It's only in rare cases that this will make a difference, of course: in fact, it requires considerable ingenuity to come up with such cases. The most obvious case where types make a difference to streamability is with a construct like <xsl:value-of select="$v"/>: this is motionless if $v is a text or attribute node, but consuming if it is a document or element node. If a global variable with private visibility is initialized with select="@price", but has no "as" attribute, Saxon will infer a type of attribute(price) for the variable, but the rules in the spec will infer a type of item()*. So to get the same streamability answer as the spec gives, we need to downgrade the static type inferencing in Saxon.

So I think the changes needed to replicate exactly the streamability rules of the XSLT 3.0 spec are fairly disruptive; moreover, implementing the changes by searching for all the cases that need to change is going to be very difficult to get right (and is very difficult to test unless there is another trustworthy implementation of the rules to test against).

This brings us to Plan B. Plan B is to meet the requirement by writing a completely free-standing tool for streamability analysis that's completely separate from the current static analysis code. One way to do this would be to build on the tool written by John Lumley and demonstrated at Balisage a couple of years ago. Unfortunately that's incomplete and out of date, so it would be a significant effort to finish it. Meeting the requirement in the spec is different here from doing something useful for users: what the spec demands is a yes/no answer as to whether the code is streamable; what users want to know is why, and what they need to change to make the code streamable. The challenge is to do this without users having to understand the difficult abstractions in the spec (posture, sweep, and the rest). John's tool produces an annotated expression tree revealing all the properties: that's great for a user who understands the methodology but probably rather bewildering to the typical end user. Doing the minimum for conformance, a tool that just says yes or no without saying why, involves a lot of work to get a "tick in the box" with a piece of software that no-one will ever use, but would be a lot easier to produce. Conformance has always been a very high priority for Saxonica, but I can't see anyone being happy with this particular solution.

So, assuming the WG maintains its insistence of having this feature (and it seems to me likely that it will), what should we do about it?

One option is simply to declare a non-conformance. Once upon a time, standards conformance was very important to Saxon's reputation in the market, but I doubt that this particular non-conformance would affect our sales.

Another option is to declare conformance, do our best to achieve it using the current analysis technology, and simply log bugs if anyone reports use cases where we get the answer wrong. That seems sloppy and dishonest, and could leave us with a continuing stream of bugs to be fixed or ignored.

Another option is the "minimal Plan B" analyser - a separate tool for streamability analysis, that simply reports a yes/no answer (without explanation). It would be significant piece of work to create this and test it, and it's unclear that anyone would use it, but it's probably the cheapest way of getting the conformance tick-in-the-box.

A final option is to go for a "fully featured" but free-standing streamability analysis tool, one which aims to not only answer the conformance question about guaranteed streamability, but also to provide genuinely useful feedback and advice helping users to create streamable stylesheets. Of course ideally such a tool would be integrated into an IDE rather than being free-standing. I've always argued that there's only a need for one such tool: it's not something that every XSLT 3.0 processor needs to provide. Doing this well would be a large project and involves different skills from those we currently have available.

In the short term, I think the only honest and affordable approach would be the first option: declare a non-conformance. Unfortunately that could threaten the viability of the spec, because we can only get a spec to Recommendation status if all features have been shown to be implementable.

No easy answers.


I've been thinking about a Plan C which might fly...

The idea here is to try and do the streamability analysis using the current expression tree structure and the current streamability logic, but applying the streamability rules to an expression tree that faithfully represents the stylesheet as parsed, with no modifications from type checking or optimization.

To do this, we need to:

  • Define a configuration flag --strictStreamability which invokes the following logic.
  • Fix places where the initial expression tree loses information that's needed for streamability analysis. The two that come to mind are (a) losing the information that something is an instruction rather than an expression (e.g. we lose the distinction between xsl:map-entry and a singleton map expression) - this distinction is needed to assess calls on accumulator-after(); (b) turning path expressions A/B into docSort(A!B). There may be other cases that we will discover along the road (or fail to discover, since we may not have a complete set of test cases...)
  • Write a new type checker that attaches type information to this tree according to the rules in the XSLT 3.0 spec. This will be much simpler than the existing type checker, partly because the rules are much simpler, but more particularly because the only thing it will do is to assign static types: it will never report any type errors, and it will never inject any code to do run-time type checking or conversion.
  • Immediately after this type-checking phase, run the existing streamability rules against the expression tree. As far as I'm aware, the streamability rules in Saxon are equivalent to the W3C rules (at any rate, most of the original differences have now been eliminated).

There are then two options. We could stop here: if the user sets the --strictStreamability flag, they get the report on streamability, but they don't get an executable that can actually be run. The alternative would be, if the streamability analysis succeeds, attempt to convert the expression tree into a form that we can actually use, by running the existing simplify / typecheck / optimize phases. The distinctions introduced to the expression tree by the changes described above would be eliminated by the simplify() phase, and we would then proceed along the current lines, probably including a rerun of the streamability analysis against the optimised expression tree (because the posture+sweep annotations are occasionally needed at run-time).

I will do some further exploration to see whether this all looks feasible. It will be very hard to prove that we've got it 100% right. But in a sense that doesn't matter, so long as the design is sound and we're passing known tests then we can report honestly that to the best of our knowledge the requirement is satisfied, which is not the case with the current approach.

Tuple types, and type aliaseshttps://dev.saxonica.com/blog/mike/2016/09/tuple-types-and-type-aliases.html2016-09-08T11:44:15Z

I've been experimenting with some promising Saxon extensions.

Maps and arrays greatly increase the flexibility and power of the XPath / XSLT / XQuery type system. But one drawback is that the type declarations can be very cumbersome, and very uninformative.

Suppose you want to write a library to handle arithmetic on complex numbers. How are you going to represent a complex number? There are several possibilities: as a sequence of two doubles (xs:double*); as an array of two doubles (array(xs:double)), or as a map, for example map{"r": 0.0e0, "i": 0.0e0} (which has type map(xs:string, xs:double)).

Note that whichever of these choices you make, (a) your choice is exposed to the user of your library by the way you declare the type in your function signatures, (b) the type allows many values that aren't legitimate representations of complex numbers, and (c) there's nothing in the type declaration that tells the reader of your code that this has anything to do with complex numbers.

I think we can tackle these problems with two fairly simple extensions to the language.

First, we can define type aliases. For XSLT, I have implemented an extension that allows you to declare (as a top-level element anywhere in the stylesheet):

<saxon:type-alias name="complex"
                  type="map(xs:string, xs:double)"/>

and then you can use this type alias (prefixed by a tilde) anywhere an item type is allowed, for example

<xsl:variable name="i" as="~complex" 
              select="cx:complex(0.0, 1.0)"/>

Secondly, we can define tuple types. So we can instead define our complex numbers as:

<saxon:type-alias name="complex" 
                  type="tuple(r: xs:double, i: xs:double)"/>

We're not actually introducing tuples here as a fundamental new type with their own set of functions and operators. Rather, a tuple declaration defines constraints on a map. It lists the keys that must be present in the map, and the type of the value to be associated with each key. The keys here are the strings "r" and "i", and in both cases the value must be an xs:double. The keys are always NCNames, which plays well with the map lookup notation M?K; if $c is a complex number, then the real and imaginary parts can be referenced as $c?r and $c?i respectively.

For this kind of data structure, tuple types provide a much more precise constraint over the contents of the map than the current map type does. It also provides much better static type checking: an expression such as $c?i can be statically checked (a) to ensure that "i" is actually a defined field in the tuple declaration, and (b) that the expression is used in a context where an xs:double value is expected.

I've been a little wary in the past of putting syntax extensions into Saxon; conformance to standards has always been a primary goal. But the standards process seems to be running out of steam, and I'm beginning to feel that it's time to push a few innovative ideas out in product to keep things moving forward. For those who would prefer to stick entirely to stuff defined by W3C, rest assured that these features will only be available if you explicitly enable extensions.

Improving Compile-Time Performancehttps://dev.saxonica.com/blog/mike/2016/06/improving-compile-time-performance.html2016-06-22T10:04:28Z

For years we've been putting more and more effort into optimizing queries and stylesheets so that they would execute as fast as possible. For many workloads, in particular high throughput server-side transformations, that's a good strategy. But over the last year or two we've become aware that for some other workloads, it's the wrong thing to do.

For example, if you're running a DocBook or DITA transformation from the command line, and the source document is only a couple of KB in size, then the time taken to compile the stylesheet greatly exceeds the actual transformation time. It might take 5 seconds to compile the stylesheet, and 50 milliseconds to execute it. (Both DocBook and DITA stylesheets are vast.) For many users, that's not an untypical scenario.

If we look at the XMark benchmarks, specifically a query such as Q9, which is a fairly complex three-way join, the query executes against a 10Mb source document in just 9ms. But to achieve that, we spend 185ms compiling and optimizing the query. We also spend 380ms parsing the source document. So in an ad-hoc processing workflow, where you're compiling the query, loading a source document, and then running a query, the actual query execution cost is about 2% of the total. But it's that 2% that we've been measuring, and trying to reduce.

We haven't entirely neglected the other parts of the process. For example, one of the most under-used features of the product is document projection, which enables you during parsing, to filter out the parts of the document that the query isn't interested in. For query Q9 that cuts down the size of the source document by 65%, and reduces the execution time of the query to below 8ms. Unfortunately, although the memory saving is very useful, it actually increases the parsing time to 540ms. Some cases are even more dramatic: with Q2, the size of the source document is reduced by 97%; but parsing is still slowed down by the extra work of deciding which parts of the document to retain, and since the query only takes 2ms to execute anyway, there's no benefit other than the memory saving.

For the DocBook and DITA scenarios (unlike XMark) it's the stylesheet compilation time that hurts, rather than the source document parsing time. For a typical DocBook transformation of a small document, I'm seeing a stylesheet compile time of around 3 seconds, source document parsing time of around 0.9ms, and transformation time also around 0.9ms. Clearly, compile time here is far more important than anything else.

The traditional answer to this has always been to compile the stylesheet once and then use it repeatedly. That works if you're running hundreds of transformations using the same stylesheet, but there are many workflows where this is impractical.

Saxon 9.7 makes a big step forward by allowing the compiled form of a stylesheet to be saved to disk. This work was done as part of the implementation of XSLT 3.0 packages, but it doesn't depend on packages in any way and works just as well with 1.0 and 2.0 stylesheets. If we export the docbook stylesheets as a compiled package, and then run from this version rather than from source, the time taken for loading the compiled stylesheet is around 550ms rather than the original 3 seconds. That's a very useful saving especially if you're processing lots of source documents using a pipeline written say using a shell script or Ant build where the tools constrain you to run one transformation at a time. (To ensure that exported stylesheet packages work with tools such as Ant, we've implemented it so that in any API where a source XSLT stylesheet is accepted, we also accept an exported stylesheet package).

But the best performance improvements are those where you don't have to do anything different to get the benefits (cynically, only about 2% of users will ever read the release notes.) So we've got a couple of further projects in the pipeline.

The first is simply raw performance tuning of the optimizer. There's vast potential for this once we turn our minds to it. What we have today has grown organically, and the focus has always been on getting the last ounce of run-time performance regardless how long it takes to achieve it. One approach is to optimize a bit less thoroughly: we've done a bit of that recently in response to a user bug report showing pathological compilation times on an extremely large (20Mb) automatically generated stylesheet. But a better approach is to think harder about the data structures and algorithms we are using.

Over the last few days I've been looking at how we do loop-lifting: that is, identifying subexpressions that can be moved out of a loop because each evaluation will deliver the same result. The current approach is that the optimizer does a recursive walk of the expression tree, and at each node in the tree, the implementation of that particular kind of expression looks around to see what opportunities there are for local optimization. Many of the looping constructs (xsl:for-each, xsl:iterate, for expressions, filter expressions, path expressions) at this point initiate a search of the subtree for expressions that can be lifted out of the loop. This means that with nested loops (a) we're examining the same subtrees once for each level of loop nesting, and (b) we're hoisting the relevant expressions up the tree one loop at a time, rather than moving them straight to where they belong. This is not only a performance problem; the code is incredibly complex, it's hard to debug, and it's hard to be sure that it's doing as effective a job as it should (for example, I only found during this exercise that we aren't loop-lifting subexpressions out of xsl:for-each-group.)

In 9.7, as reported in previous blog posts, we made some improvements to the data structures used for the expression tree, but so far we've been making rather little use of this. One improvement was to add parent pointers, which enables optimizations to work bottom-up rather than top-down. Another improvement was a generic structure for holding the links from a parent node to its children, using an Operand object that (a) holds properties of the relationship (e.g. it tells you when the child expression is evaluated with a different focus from the parent), and (b) is updatable, so a child expression can replace itself by some different expression without needing the parent expression to get involved. These two improvements have enabled a complete overhaul of the way we do loop-lifting. Without knowing anything about the semantics of different kinds of expressions, we can now do a two-phase process: first we do a scan over the expression tree for a function or template to identify, for each node in the tree, what its "innermost scoping node" is: for example an expression such as "$i + @x" is scoped both by the declaration of $i and by the instruction (e.g. xsl:for-each) that sets the focus, and the innermost scoping expression is the inner one of these two. Then, in a second pass, we hoist every expression that's not at the same looping level as its innermost scoping expression to be evaluated (lazily) outside that loop. The whole process is dramatically simpler and faster than what we were doing before, and at least as effective - possibly in some cases more so.

The other project we're just starting on is to look at just-in-time compilation. The thing about stylesheets like DocBook is that they contain zillions of template rules for processing elements which typically don't appear in your average source document. So why waste time compiling template rules that are never used? All we really need to do is make a note of the match patterns, build the data structures we use to identify which rule is the best match for a node, and then do the work of compiling that rule the first time it is used. Indeed, the optimization and byte-code generation work can be deferred until we know that the rule is going to be used often enough to make it worthwhile. We're starting this project (as one should start all performance projects) by collecting instrumentation, so we can work out exactly how much time we are spending in each phase of compilation; that will tell us how much we should be doing eagerly and how much we should defer. There's a trade-off with usability here: do users want to be told about errors found while type-checking parts of the stylesheet that aren't actually exercised by a particular run?

Plenty of ideas to keep us busy for a while to come.

Introducing Saxon-JShttps://dev.saxonica.com/blog/mike/2016/02/introducing-saxon-js.html2016-02-13T14:15:04Z

At XML Prague yesterday we got a spontaneous round of applause when we showed the animated Knight's tour application, reimplemented to use XSLT 3.0 maps and arrays, running in the browser using a new product called Saxon-JS.

So, people will be asking, what exactly is Saxon-JS?

Saxon-EE 9.7 introduces a new option -export which allows you to export a compiled stylesheet, in XML format, to a file: rather like producing a .so file from a C compiler, or a JAR file from a Java compiler. The compiled stylesheet isn't executable code, it's a decorated abstract syntax tree containing, in effect, the optimized stylesheet execution plan. There are two immediate benefits: loading a compiled stylesheet is much faster than loading the original source code, so if you are executing the same stylesheet repeatedly the cost of compilation is amortized; and in addition, it enables you to distribute XSLT code to your users with a degree of intellectual property protection analogous to that obtained from compiled code in other languages. (As with Java, it's not strong encryption - it wouldn't be too hard to write a fairly decent decompiler - but it's strong enough that most people won't attempt it.)

Saxon-JS is an interpreted, written in pure Javascript, that takes these compiled stylesheet files and executes them in a Javascript environment - typically in the browser, or on Node.js. Most of our development and testing is actually being done using Nashorn, a Javascript engine bundled with Java 8, but that's not a serious target environment for Saxon-JS because if you've got Nashorn then you've got Java, and if you've got Java then you don't need Saxon-JS.

Saxon-JS can also be seen as a rewrite of Saxon-CE. Saxon-CE was our first attempt at doing XSLT 2.0 in the browser. It was developed by producing a cut-down version of the Java product, and then cross-compiling this to Javascript using Google's GWT cross-compiler. The main drawbacks of Saxon-CE, at a technical level, were the size of the download (800Kb or so), and the dependency on GWT which made testing and debugging extremely difficult - for example, there was no way of testing our code outside a browser environment, which made running of automated test scripts very time-consuming and labour-intensive. There were also commercial factors: Saxon-CE was based on a fork of the Saxon 9.3 Java code base and re-basing to a later Saxon version would have involved a great deal of work; and there was no revenue stream to fund this work, since we found a strong expectation in the market that this kind of product should be free. As a result we effectively allowed the product to become dormant.

We'll have to see whether Saxon-JS can overcome these difficulties, but we think it has a better chance. Because it depends on Saxon-EE for the front-end (that is, there's a cost to developers but the run-time will be free) we're hoping that there'll be a reveue stream to finance support and ongoing development; and although the JS code is not just a fork but a complete rewrite of the run-time code the fact that it shares the same compiler front end means that it should be easier to keep in sync.

Development has been incredibly rapid - we only started coding at the beginning of January, and we already have about 80% of the XSLT 2.0 tests running - partly because Javascript is a powerful language, but mainly because there's little new design involved. We know how an XSLT engine works, we only have to decide which refinements to leave out. We've also done client-side XSLT before so we can take the language extensions of Saxon-CE (how to invoke templates in response to mouse events, for example) the design of its Javascript APIs, and also some of its internal design (like the way event bubbling works) and reimplement these for Saxon-JS.

One of the areas where we have to make design trade-offs is deciding how much standards conformance, performance, and error diagnostics to sacrifice in the interests of keeping the code small. There are some areas where achieving 100% conformance with the W3C specs will be extremely difficult, at least until JS6 is available everywhere: an example is support for Unicode in regular expressions. For performance, memory usage (and therefore expression pipelining) is important, but getting the last ounce of processor efficiency less so. An important factor (which we never got quite right for Saxon-CE) is asynchronous access to the server for the doc() and document() functions - I have ideas on how to do this, but it ain't easy.

It will be a few weeks before the code is robust enough for an alpha release, but we hope to get this out as soon as possible. There will probably then be a fairly extended period of testing and polishing - experience suggests that when the code is 90% working, you're less than half way there.

I haven't yet decided on the licensing model. Javascript by its nature has no technical protection, but that doesn't mean we have to give it an open source license (which would allow anyone to make changes, or to take parts of the code for reuse in other projects).

All feedback is welcome: especially on opportunities for exploiting the technology in ways that we might not have thought of.

Parent pointers in the Saxon expression treehttps://dev.saxonica.com/blog/mike/2015/09/parent-pointers-in-the-saxon-expression-tree.html2015-09-11T19:38:54Z

A while ago (http://dev.saxonica.com/blog/mike/2014/11/redesigning-the-saxon-expression-tree.html) I wrote about my plans for the Saxon expression tree. This note is an update.

We've made a number of changes to the expression tree for 9.7.

  • Every node in the tree (every expression) now references a Location object, providing location information for diagnostics (line number, column number, etc). Previously the expression node implemented the SourceLocator interface, which meant it provided this information directly. The benefit is that we can now have different kinds of Location object. In XQuery we will typically hold the line and column and module URI. In XSLT, for a subexpression within an XPath expression, we can now hold both the offset within the XPath expression, and the path to the containing node within the XSLT stylesheet. Hopefully debuggers and editing tools such as oXygen and Stylus Studio will be able to take advantage of the improved location information to lead users straight to the error location in the editor. Where an expression has the same location information as its parent or sibling expressions, the Location object is shared.

    Another reason for changing the way we hold location information is connected with the move to separately-compiled packages in XSLT 3.0. This means that the system we previously used, of globally-unique integer "location identifiers" which are translated into real location information by reference to a central "location provider" service, is no longer viable.

  • Every node in the tree now points to a RetainedStaticContext object which holds that part of the static context which can vary from one expression to another, and which can be required at run-time. Previously we only attempted to retain the parts of the static context that each kind of expression actually used. The parts of the static context that this covers include the static base URI, in-scope namespaces, the default collation, and the XPath 1.0 compatibility flag. Retaining the whole static context might seem extravagent. But in fact, it very rarely changes, so a child expression will nearly always point to the same RetainedStaticContext object as its parent and sibling expressions.

  • Every node in the tree now points to its parent node. This choice has proved tricky. It gives many advantages: it means that the code for every expression can easily find details of the containing package, the configuration options, and a host of details about the query or stylesheet as a whole. The fact that we have a parent node eliminates the need for the "container" object (typically the containing function or template) which we held in previous releases. It also reduces the need to pass additional information to methods on the Expression class, for example methods to determine the item type and cardinality of the expression. There is a significant downside to holding this information, which is the need to keep it consistent. Some of the tree rewrite operations performed by the optimizer are complex enough without having to worry about keeping all the parent pointers correct. And it turns out to be quite difficult to enforce consistency through the normal "private data, public methods" encapsulation techniques: those work when you have to keep the data in a single object consistent, but they aren't much use for maintaining mutual consistency between two different objects. In any case it seems to be unavoidable that to achieve the kind of tree rewrites we want to perform, the tree has to be temporarily inconsistent at various stages.

    Using parent pointers means that you can't share subtrees. It means that when you perform operations like inlining a function, you can't just reference the subtree that formed the body of the function, you have to copy it. This might seem a great nuisance. But actually, this is not a new constraint. It never was safe to share subtrees, because the optimiser would happily make changes to a subtree without knowing that there were other interested parties. The bugs this caused have been an irritation for years. The introduction of parent pointers makes the constraint more explicit, and makes it possible to perform integrity checking on the tree to discover when we have inadvertently violated the constraints.

    During development we've had diagnostic code switched on that checks the integrity of the tree and outputs warnings if problems are found. We've gradually been examining these and eliminating them. The problems can be very hard to diagnose, because the detection of a problem in the data may indicate an error that occurred in a much earlier phase of processing. We've developed some diagnostic tools for tracing the changes made to a particular part of the tree and correlating these with the problems detected later. Most of the problems, as one might expect, are connected with optimization rewrites. A particular class of problem occurs with rewrites that are started but then not completed, (because problems are found) or with "temporary" rewrites that are designed to create an equivalent expression suitable for analysis (say for streamability analysis or for schema-aware static type-checking) but which are not actually intended to affect the run-time interpreted tree. The discipline in all such cases is to copy the part of the tree you want to work on, rather than making changes in-situ.

    For some non-local rewrites, such as loop-lifting optimizations, the best strategy seems to be to ignore the parent pointers until the rewrite is finished, and then restore them during a top-down tree-walk.

    The fact that we now have parent pointers makes context-dependent optimizations much easier. Checking, for example, whether  a variable reference occurs within a loop (a "higher-order expression" as the XSLT 3.0 spec calls it) is now much easier: it can be done by searching upwards from the variable reference rather than retaining context information in an expression visitor as you walk downwards. Similarly, if there is a need to replace one expression by another (a variable reference by a literal constant, say), the fact that the variable reference knows its own parent makes the substitution much easier.

    So although the journey has had a few bumps, I'm reasonably confident that we will see long-term benefits.

Lazy Evaluationhttps://dev.saxonica.com/blog/mike/2015/06/lazy-evaluation.html2015-06-28T23:15:13Z

We've seen some VM dumps recently that showed evidence of contention problems when multiple threads (created, for example, using <xsl:for-each> with the saxon:threads attribute) were attempting lazy evaluation of the same local variable. So I've been looking at the lazy evaluation code in Saxon to try and understand all the permutations of how it works. A blog posting is a good way to try and capture that understanding before I forget it all again. But I won't go into the extra complexities of parallel execution just yet: I'll come back to that at the end.

Lazy evaluation applies when a variable binding, for example "let $v := //x[@y=3]" isn't evaluated immediately when the variable declaration is encountered, but only when the variable is actually referenced. This is possible in functional languages because evaluating an expression has no side-effects, so it doesn't matter when (or how often) it is done. In some functional languages such as Scheme, lazy evaluation happens only if you explicitly request it. In others, such as Haskell, lazy evaluation is mandated by the language specification (which means that a variable can hold an infinite sequence, so long as you don't try to process its entire value). In XSLT and XQuery, lazy evaluation is entirely at the discretion of the compiler, and in this post I shall try to summarize how Saxon makes use of this freedom.

Internally, when a local variable is evaluated lazily, Saxon instead of putting the variable's value in the relevant slot on the stack, will instead put a data structure that contains all the information needed to evaluate the variable: that is, the expression itself, and any part of the evaluation context on which it depends. In Saxon this data structure is called a Closure. The terminology isn't quite right, because it's not quite the same thing as the closure of an inline function, but the concepts are closely related: in some languages, lazy evaluation is implemented by storing, as the value of the variable, not the variable's actual value, but a function which delivers that value when invoked, and the data needed by this function to achieve that task is correctly called a closure. (If higher-order functions had been available in Saxon a few years earlier, we might well have implemented lazy evaluation this way.)

We can distinguish two levels of lazy evaluation. We might use the term "deferred evaluation" to indicate that a variable is not evaluated until it is first referenced, and "incremental evaluation" to indicate that when it is referenced, it is only evaluated to the extent necessary. For example, if the first reference is the function call head($v), only the first item in the sequence $v will be evaluated; remaining items will only be evaluated if a subsequent reference to the variable requires them.

Lazy evaluation can apply to global variables, local variables, parameters of templates and functions, and return values from templates and functions. Saxon handles each case slightly differently.

We should mention some static optimizations which are not directly related to lazy evaluation, but are often confused with it. First, a variable that is never referenced is eliminated at compile-time, so its initializing expression is never evaluated at all. Secondly, a variable that is only referenced once, and where the reference is not in any kind of loop, is inlined: that is, the variable reference is replaced by the expression used to initialize the variable, and the variable itself is then eliminated. So when someone writes "let $x := /a/b/c return $x[d=3]", Saxon turns this into the expression "(/a/b/c)[d=3]". (Achieving this of course requires careful attention to the static and dynamic context, but we won't go into the details here.)

Another static optimization that interacts with variable evaluation is loop-lifting. If an expression within a looping construct (for example the content of xsl:for-each, or of a predicate, or the right-hand-side of the "/" operator) will have the same value for every iteration of the loop, then a new local variable bound to this expression is created outside the loop, and the original expression is replaced by a reference to the variable. In this situation we need to take care that the expression is not evaluated unless the loop is executed at least once (both to avoid wasted evaluation cost, and to give the right behaviour in the event that evaluating the expression fails with a dynamic error.) So lazy evaluation of such a variable becomes mandatory.

The combined effect of these static optimizations, together with lazy evaluation, is that the order of evaluation of expressions can be quite unintuitive. To enable users to understand what is going on when debugging, it is therefore normal for some of these rewrites to be suppressed if debugging or tracing are enabled.

For global variables, Saxon uses deferred evaluation but not incremental evaluation. A global variable is not evaluated until it is first referenced, but at that point it is completely evaluated, and the sequence representing its value is held in memory in its entirety.

For local variables, evaluation is generally both deferred and incremental. However, the rules are quite complex.

  • If the static type shows that the value will be a singleton, then it will be evaluated eagerly. [It's not at all clear that this rule makes sense. Certainly, incremental evaluation makes no sense for singletons. But deferred evaluation could still be very useful, for example if the evaluation is expensive and the variable is only referenced within a branch of a conditional, so the value is not always needed.]

  • Eager evaluation is used when the binding expression is very simple: in particular when it is a literal or a reference to another variable.

  • Eager evaluation is used for binding expressions that depend on position() or last(), to avoid the complexities of saving these values in the Closure.

  • There are some optimizations which take precedence over lazy evaluation. For example if there are variable references using predicates, such as $v[@x=3], then the variable will not only be evaluated eagerly, but will also be indexed on the value of the attribute @x. Another example: if a variable is initialized to an expression such as ($v, x) - that is, a sequence that appends an item to another variable - then we use a "shared append expression" which is a data structure that allows a sequence to be constructed by appending to an existing sequence without copying the entire sequence, which is a common pattern in algorithms using head-tail recursion.

  • Lazy evaluation (and inlining) need special care if the variable is declared outside a try/catch block, but is referenced within it. In such a case a dynamic error that occurs while evaluating the initialization expression must not be caught by the try/catch; it is logically outside its scope. (Writing this has made me realise that this is not yet implemented in Saxon; I have written a test case and it currently fails.)

If none of these special circumstances apply, lazy evaluation is chosen. There is one more choice to be made: between a Closure and a MemoClosure. The common case is a MemoClosure, and in this case, as the variable is incrementally evaluated, the value is saved for use when evaluating subsequent variable references. A (non-memo) closure is used when it is known that the value will only be needed once. Because most such cases have been handled by variable inlining, the main case where a non-memo closure is used is for the return value of a function. Functions, like variables, are lazily evaluated, so that the value returned to the caller is not actually a sequence in memory, but a closure containing all the information needed to materialize the sequence. (Like most rules in this story, there is an important exception: tail-call optimization, where the last thing a function does is to call itself, takes precedence over lazy evaluation).

So let's look more closely at the MemoClosure. A MemoClosure is a data structure that holds the following information:

  • The Expression itself (a pointer to a node in the expression tree). The Expression object also holds any information from the static context that is needed during evaluation, for example namespace bindings.

  • A copy of the dynamic context at the point where the variable is bound. This includes the context item, and values of any local variables referenced by the expression.

  • The current evaluation state: one of UNREAD (no access to the variable has yet been made), MAYBE_MORE (some items in the value of the variable are available, but there may be more to come), ALL_READ (the value of the variable is fully available), BUSY (the variable is being evaluated), or EMPTY (special case of ALL_READ in which the value is known to be an empty sequence).

  • An InputIterator: an iterator over the results of the expression, relevant when evaluation has started but has not finished

  • A reservoir: a list containing the items delivered by the InputIterator so far.

Many variable references, for example count($v), or index-of($v, 'z') result in the variable being evaluated in full. If this is the first reference to the variable, that is if the state is UNREAD, the logic is essentially

inputIterator = expression.iterate(savedContext);

for item in inputIterator {
state = ALL_READ;
return new SequenceExtent(reservoir);

(However, Saxon doesn't optimize this case, and it occurs to me on writing this that it could.)

Other variable references, such as head($v), or $v[1], or subsequence($v, 1, 5), require only partial evaluation of the expression. In such cases Saxon creates and returns a ProgressiveIterator, and the requesting expression reads as many items from the ProgressiveIterator as it needs. Requests to get items from the ProgressiveIterator fetch items from the reservoir to the extent they are available; on exhaustion of the reservoir, they then attempt to fetch items from the InputIterator until either enough items are available, or the InputIterator is exhausted. Items delivered from the InputIterator are copied to the reservoir as they are found.

So far so good. This has all been in place for years, and works well. We have no evidence that it is in any way optimal, but it has been carefully tweaked over the years to deal with particular cases where it was performing badly. What has changed recently is that local variables can be referenced from multiple threads. There are two particular cases where this happens today: when xsl:result-document is used in Saxon-EE, it executes by default asynchronously in a new thread; and when the extension attribute saxon:threads is used on xsl:for-each, the items selected by the xsl:for-each are processed in parallel rather than sequentially.

The effect of this is that the MemoClosure object needs to be thread-safe: multiple requests to access the variable can come simultaneously from different threads. To achieve this a number of methods are synchronized. One of these is the next() method of the ProgressiveIterator: if two threads reference the variable at the same time, each gets its own ProgressiveIterator, and the next() method on one of these iterators is forced to wait until the other has finished.

This works, but it is risky. Brian Goetz in his excellent book Java Concurrency in Practice recommends that a method should not be synchronized unless (a) its execution time is short, and (b) as the author of the method, you know exactly what code will execute while it is active. In this case neither condition is satisfied. The next() method of ProgressiveIterator calls the next() method of the InputIterator, and this may perform expensive computation, for example retrieving and parsing a document using the doc() function. Further, we have no way of analyzing exactly what code is executed: in the worst case, it may include user-written code (for example, an extension function or a URIResolver). The mechanism can't deadlock with itself (because there cannot be a cycle of variable references) but it is practically impossible to prove that it can't deadlock with other subsystems that use synchronization, and in the face of maliciously-written used code, it's probably safe to assume that deadlock can occur. We haven't seen deadlock happen in practice, but it's unsatisfactory that we can't prove its impossibility.

So what should we do about it?

I think the answer is, add yet another exception to the list of cases where lazy evaluation is used: specifically, don't use it for a variable that can be referenced from a different thread. I'm pretty sure it's possible to detect such cases statically, and they won't be very common. In such cases, use eager evaluation instead.

We must be careful not to do this in the case of a loop-lifted variable, where the correct error semantics depend on lazy evaluation. So another tweak to the rules is, don't loop-lift code out of a multithreaded execution block.

This investigation also suggests a few other refinements we might make.

  • It seems worth optimizing for the case where the entire value of a variable is needed, since this case is so common. The problem is, it's not easy to detect this case: a calling expression such as count($v) will ask for an iterator over the variable value, without giving any indication that it intends to read the iterator to completion.

  • We need to reassess the rule that singleton local variables are evaluated eagerly.

  • We currently avoid using lazy evaluation for expressions with certain dependencies on the dynamic context (for example, position() and last()). But in the course of implementing higher-order functions, we have acquired the capability to hold such values in a saved copy of the dynamic context.

  • We could look at a complete redesign that takes advantage of higher-order functions and their closures. This might be much simpler than the current design; but it would discard the benefits of years of fine-tuning of the current design.

  • I'm not convinced that it makes sense for a MemoClosure to defer creation of the InputIterator until the first request for the variable value. It would be a lot simpler to call inputIterator = Expression.iterate(context) at the point of variable declaration; in most cases the implementation will defer evaluation to the extent that this makes sense, and this approach saves the cost of the elaborate code to save the necessary parts of the dynamic context. It's worth trying the other approach and making some performance measurements.

A redesign of the NamePoolhttps://dev.saxonica.com/blog/mike/2015/06/a-redesign-of-the-namepool.html2015-06-24T14:39:33Z

As explained in my previous post, the NamePool in Saxon is a potential problem for scaleability, both because access can cause contention, and also because it has serious limits on the number of names it can hold: there's a maximum of one million QNames, and performance starts getting seriously bad long before this limit is reached.

Essentially, the old NamePool is a home-grown hash table. It uses a fixed number of buckets (1024), and when hash collisions occur, the chains of hash duplicates are searched serially. The fact that the number of buckets is fixed, and entries are only added to the end of a chain, is what makes it (reasonably) safe for read access to the pool to occur without locking.

One thing I have been doing over a period of time is to reduce the amount of unnecessary use of the NamePool. Most recently I've changed the implementation of the schema component model so that references from one schema component to another are no longer implemented using NamePool fingerprints. But this is peripheral: the core usage of the NamePool for comparing names in a query against names in a source document will always remain the dominant usage, and we need to make this scaleable as parallelism increases.

Today I've been exploring an alternative design for the NamePool (and some variations on the implementation of the design). The new design has at its core two Java ConcurrentHashMaps, one from QNames to fingerprints, and one from fingerprints to QNames. The ConcurrentHashMap, which was introduced in Java 5, doesn't just offer safe multi-threaded access, it also offers very low contention: it uses fine-grained locking to ensure that multiple writers, and any number of readers, can access the data structure simulaneously.

Using two maps, one of which is the inverse of the other, at first seemed a problem. How can we ensure that the two maps are consistent with each other, without updating both under an exclusive lock, which would negate all the benefits? The answer is that we can't completely, but we can get close enough.

The logic is like this:

private final ConcurrentHashMap<StructuredQName, Integer> qNameToInteger = new ConcurrentHashMap<StructuredQName, Integer>(1000);
private final ConcurrentHashMap<Integer, StructuredQName> integerToQName = new ConcurrentHashMap<Integer, StructuredQName>(1000);
private AtomicInteger unique = new AtomicInteger();

// Allocate fingerprint to QName

Integer existing = qNameToInteger.get(qName);
if (existing != null) {
    return existing;
Integer next = unique.getAndIncrement();
existing = qNameToInteger.putIfAbsent(qName, next);
if (existing == null) {
    integerToQName.put(next, qName);
    return next;
} else {
    return existing;

Now, there are several things slightly unsafe about this. We might find that the QName doesn't exist in the map on our first look, but by the time we get to the "putIfAbsent" call, someone else has added it. The worst that happens here is that we've used up an integer from the "unique" sequence unnecessarily. Also, someone else doing concurrent read access might see the NamePool in a state where one map has been updated and the other hasn't. But I believe this doesn't matter: clients aren't going to look for a fingerprint in the map unless they have good reason to believe that fingerprint exists, and it's highly implausible that this knowledge comes from a different thread that has only just added the fingerprint to the map.

There's another ConcurrentHashMap involved as well, which is a map from URIs to lists of prefixes used in conjunction with that URI. I won't go into that detail.

The external interface to the NamePool doesn't change at all by this redesign. We still use 20-bit fingerprints plus 10-bit prefix codes, so we still have the limit of a million distinct names. But performance no longer degrades when we get close to that limit; and the limit is no longer quite so hard-coded.

My first attempt at measuring the performance of this found the expected benefits in scalability as the concurrency increases and as the size of the vocabulary increases, but the performance under more normal conditions was worse than the existing design: execution time of 5s versus 3s for executing 100,000 cycles each of which performed an addition (from a pool of 10,000 distinct names so 90% of the additions were already present) followed by 20 retrievals.

I suspected that the performance degradation was caused by the need to update two maps, whereas the existing design only uses one (it's cleverly done so that the fingerprint generated for a QName is closely related to its hash key, which enables us to use the fingerprint to navigate back into the hash table to reconstruct the original QName).

But it turned out that the cause was somewhere else. The old NamePool design was hashing QNames by considering only the local part of the name and ignoring the namespace URI, whereas the new design was computing a hash based on both the local name and the URI. Because URIs are often rather long, computing the hash code is expensive, and in this case it adds very little value: it's unusual for the same local name to be associated with more than one URI, and when it happens, the hash table is perfectly able to cope with the collision. By changing the hashing on QName objects to consider only the local name, the costs for the new design came down slightly below the current implementation (about 10% better, not enough to be noticeable).

So I feel comfortable putting this into production. There are a dozen test cases failing (out of 20,000) which I need to sort out first, but it all looks very promising.

Another look in the NamePoolhttps://dev.saxonica.com/blog/mike/2015/06/another-look-in-the-namepool.html2015-06-22T08:59:45Z

I've been looking again at the implementation of some of the parallel processing features in Saxon (see my XMLPrague 2015 paper) and at how best to make use of the facilities in the Java platform to support them. In the course of this I've been studying Brian Goetz's excellent book Java Concurrency in Practice, which although dated, is still probably the best text available on the subject; and the fact that it only takes you up to Java 6 is an advantage in our case, because we still want Saxon to run on Java 6.

Reading the book has made me think again about the venerable design of Saxon's NamePool, which is the oldest thing in the product where multithreading is relevant. The NamePool is basically a shared data structure holding QNames.

The design of the NamePool hasn't changed much over the years. On the whole it works well, but there are a number of limitations:

  • Updating the NamePool requires an exclusive lock, and on occasions there has been heavy contention on that lock, which reduces the effectiveness of running more threads.

  • We read the NamePool without acquiring a lock. All the textbooks say that's bad practice because of the risk of subtle bugs. We've been using this design for over a dozen years without a single of these subtle bugs coming to the surface, but that doesn't mean they aren't there. It's very hard to prove that the design is thread-safe, and it might only take an unusual JVM, or an unusual hardware architecture, or a massively parallel application, or pathological data (such as a local name that appears in hundreds of namespaces) for a bug to suddenly appear: which would be a serious embarassment.

  • The fact that we read the NamePool with no lock means that the data structure itself is very conservative. We use a fixed number of hash buckets (1024), and a chain of names within each bucket. We only ever append to the end of such a chain. If the vocabulary is large, the chains can become long, and searching then takes time proportional to the length of the chains. Any attempt to change the number of buckets on the fly is out of the question so long as we have non-locking readers. So performance degrades with large vocabularies.

  • We've got a problem coming up with XSLT 3.0 packages. We want packages to be independently compiled, and we want them to be distributed. That means we can't bind names to fingerprints during package construction, because the package will have to run with different namepools at run-time. We can probably solve this problem by doing the binding of names at package "load" or "link" time; but it's a change of approach and that invites a rethink about how the NamePool works.

Although the NamePool design hasn't changed much over the years, we've changed the way it is used: which essentially means, we use it less. Some while ago we stopped using the NamePool for things such as variable names and function names: it is now used only for element names, attribute names, and type names. Around Saxon 9.4 we changed the Receiver interface (Saxon's ubiquitous interface for passing push-mode events down a pipeline) so that element and attribute names were represented by a NodeName object instead of an integer fingerprint. The NodeName can hold a name either in string form or as an integer code, or both, so this change meant that we didn't have to allocate a NamePool fingerprint just to pass events down this pipeline, which in turn meant that we didn't have to allocate fingerprints to constructed elements and attributes that were going to be immediately serialized. We also stopped using the NamePool to allocate codes to (prefix, uri) pairs representing namespace bindings.

These changes have been pretty effective and it's a while since we have seen a workload suffer from NamePool contention. However, we want to increase the level of parallelism that Saxon can support, and the NamePool remains a potential pinch point.

There are a number of things we can do that would make a difference. We could for example use a Java ReadWriteLock to allow either a single writer or multiple readers; this would allow us to introduce operations such as reconfiguring the hash table as the size of the vocabulary increases, without increasing contention because of the high frequency of read access.

But let's first try and remind ourselves what the NamePool is actually for. It is there, first and foremost, to allow fast run-time testing of whether a particular node satisfies a NameTest. Because we use the same NamePool when constructing a source tree and when compiling a stylesheet or query, the node on the tree and the NameTest in the compiled code both know the integer fingerprint of the name, and testing the node against the NameTest therefore reduces to a fast integer comparison. This is undoubtedly far faster than doing a string comparison, especially one involving long URIs.

If that was the only thing we used the NamePool for, then we would only need a single method, allocateFingerprint(namespace, localName). What are all the other methods there for?

Well, firstly, the NamePool knows the mapping of fingerprints to names, so we have read-only get methods to get the fingerprint corresponding to a name, or the name corresponding to a fingerprint. These are a convenience, but it seems that they are not essential. The client code that calls the NamePool to allocate a fingerprint to a name could retain the mapping somewhere else, so that there is no need to go back to a shared NamePool to rediscover what is already known.

The most obvious case where this happens is with the TinyTree. The TinyTree holds the names of elements and attributes as fingerprints, not as strings, so operations like the XPath local-name() and namespace-uri() functions get the fingerprint from the TinyTree and then call on the NamePool to translate this back to a string. We could avoid this by keeping a map from integers to strings within the TinyTree itself. This could potentially have other benefits: we could make fewer calls on the NamePool to allocate fingerprints during tree construction; and retargeting a TinyTree to work with a different NamePool would be easier.

Secondly, there's a lot of code in the NamePool to manage prefixes. This isn't needed for the core function of matching a node against a NameTest, since that operation ignores namespace prefixes. The detail here is that when we call NamePool.allocate(), we actually supply prefix, uri, and local-name, and we get back a 32-bit nameCode which uniquely represents this triple; the bottom 20 bits uniquely represent the local-name/uri pair, and it is these 20 bits (called the fingerprint) that are used in QName comparisons. The purpose of this exercise has nothing to do with making name comparisons faster; rather it is mainly concerned with saving space in the TinyTree. By packing the prefix information into the same integer as the local-name and URI, we save a few useful bits. But there are other ways of doing this without involving the NamePool; we could use the same few bits to index into a table of prefixes that is local to the TinyTree itself. There are of course a few complications; one of the benefits of the NamePool knowing about prefixes is that it can provide a service of suggesting a prefix to use with a given URI when the system is required to invent one: users like it when the prefix that emerges is one that has previously been associated with that URI by a human being. But there are probably less expensive ways of achieving this.

Let's suppose that we reduced the functionality of the NamePool to a single method, allocate(QName) → int. How would we then implement it to minimize contention? A simple and safe implementation might be

HashMap<QName, Integer> map;
int next = 0;

public synchronized int allocate(QName q) {
       Integer n = map.get(q);
       if (n == null) {
              int m = ++next;
              map.put(q, m);
              return m;
       } else {
              return n;

This still serializes all allocate operations, whether or not a new fingerprint is allocated. We can almost certainly do better by taking advantage of Java's concurrent collection classes, though it's not immediately obvious what the best way of doing it is. But in any case, if we can achieve this then we've reduced the NamePool to something much simpler than it is today, so optimization becomes a lot easier. It's worth noting that the above implementation still gives us the possibility to discover the fingerprint for a known QName, but not to (efficiently) get the QName for a known fingerprint.

To get here, we need to start doing two things:

(a) get prefixes out of the NamePool, and handle them some other way.

(b) stop using the NamePool to discover the name associated with a known fingerprint.

After that, redesign becomes relatively straightforward.

How long is a (piece of) string?https://dev.saxonica.com/blog/mike/2015/02/how-long-is-a-piece-of-string.html2015-02-09T14:06:09Z

As I explained in my previous post, I've been re-examining the way functions work in Saxon. In particular, over the last week or two, I've been changing the way system functions (such as fn:string-length) work. There's a terrific amount of detail and complexity here, but I thought it might be interesting to take one simple function (fn:string-length) as an example, to see where the complexity comes from and how it can be reduced.

At first sight, fn:string-length looks pretty simple. How long is a (piece of) string? Just ask Java to find out: surely it should just map to a simple call on java.lang.String.length(). Well, no actually.

If we look to the specification, there are two complications we have to deal with. Firstly we are counting the number of Unicode characters, not (as Java does) the number of 16-bit UTF16 codepoints. In the case of surrogate pairs, one character occupies two codepoints, and that means that a nave implementation of string-length() takes time proportional to the length of the string.

Secondly, there are two forms of the string-length() function. With zero arguments, it's defined to mean string-length(string(.)). That's different from nearly all other functions that have 0-argument and 1-argument forms, where (for example) name() means name(.). Saxon handles functions like name() by converting them statically to name(.), and that conversion doesn't work in this case. To illustrate the difference, consider an attribute code="003", defined in the schema as an xs:integer. The function call string-length(@code) returns 1 (it atomizes the attribute to produce an integer, converts the integer to the string "3", and then returns the length of this string. But @code!string-length() returns 3 - the length of the string value of the attribute node.

The other complexity applies specifically to string-length#0 (that is, the zero-argument form). Dynamic calls to context-dependent functions bind the context at the point where the function is created, not where it is called. Consider:

<xsl:for-each select="0 to 9">
<xsl:variable name="f" select="string-length#0"/>
<xsl:for-each select="21 to 50">
<xsl:value-of select="$f()"/>

This will print the value "1" three hundred times. In each case the context item at the point where $f is bound is a one-digit integer, so $f() returns the length of that integer, which is always one. The context item at the point where $f() is evaluated is irrelevant.

Now let's take a look at the Saxon implementation. There's a Java class StringLength which in Saxon 9.6 is about 200 lines of code (including blank lines, comments, etc), and this does most of the work. But not all: in the end all it does is to call StringValue.getStringLength(), which is what really does the work. Atomic values of type xs:string are represented in Saxon by an instance of the class StringValue, which encapsulates a Java CharSequence: often, but not always, a String. The reason for the encapsulating class is to provide type safety on methods like Function.call() which returns a Sequence; StringValue implements AtomicValue which implements Item which implements Sequence, so the XDM data model is faithfully represented in the Java implementation classes.

In addition there's a class StringLengthCompiler which generates a bytecode implementation of the string-length function. This is another 60 or so lines.

Some functions also have a separate streaming implementation to accept streamed input, and one or two (string-join() and concat(), for example), have an implementation designed to produce streamed output. That's designed to ensure that an instruction like <xsl:value-of select="//emp/name" separator=","/>, which compiles down to a call on string-join() internally, doesn't actually assemble the whole output in memory, but rather writes each part of the result string to the output stream as it becomes available.

Since the introduction of dynamic function calls, many system functions have two separate implementations, one for static calls and one for dynamic calls. That's the case for string-length: the evaluateItem() method used for static calls is almost identical to the call() method used for dynamic calls. One reason this happened was because of a fear of performance regression that might occur if the existing code for static calls was generalized, rather than introducing a parallel path.

In 9.6, the implementation of dynamic calls to context-dependent functions like string-length#0 is rather fudged. In fact, the expression string-length#0 compiles into a call on function-lookup("fn:string", 0). The implementation of function-lookup() keeps a copy of both the static and dynamic context at the point where it is called, and this is then used when evaluating the resulting function. This is vastly more expensive than it needs to be: for functions like string-length#0 where there are no arguments other than the context, the function can actually be pre-evaluated at the point of creation. In the new 9.7 implementation, the result of the expression string-length#0 is a function implemented by the class ConstantFunction, which encapsulates its result and returns this result when it is called. (It's not quite as simple as this, because the constant function also has to remember its name and arity, just in case the user asks.)

The method StringValue.getStringLength() attempts to recognize cases where walking through the codepoints of the string to look for surrogate pairs is not actually necessary. In previous releases there was an extra bit kept in StringValue, set when the string was known to contain no surrogate pairs: so having walked the string once, it would never be done again. In 9.6 this mechanism is replaced with a different approach: Saxon includes several implementations of CharSequence that maintain the value as an array of fixed-size integers (8-bit, 16-bit, or 32-bit, as necessary). If the CharSequence within a StringValue is one of these classes (known collectively as UnicodeString), then the length of the string is the length of the array. And when getStringLength() is called on a string the first time, the string is left in this form, in the hope that future operations on the string will benefit. Of course, this will in some cases be counter-productive (and there's a further refinement in the implementation, which I won't go into, that's designed to overcome this).

There are a few other optimizations in the implementation of string-length() that are worth mentioning. Firstly, it's quite common for users to write

<xsl:if test="string-length($x) != 0">

Here we don't need to count surrogate pairs in the string: the string is zero-length if and only if the underlying CharSequence is zero-length. Saxon therefore does a static rewrite of such an expression to boolean(string($x)). (If $x is statically known to be a string, the call string($x) will then be further rewritten as $x.)

If string-length#1 is applied to a value that can be computed statically, then the string-length function is itself computed statically. (This optimization, for odd historical reasons, is often called "constant folding". It's possible only when there are no context dependencies.)

During type-checking, the implementation of string-join#0 keeps a note of whether a context item is known to exist. This is used during byte-code generation; if it's known that the context item won't be absent, then there is no need to generate code to check for this error condition. It's through tiny optimizations like this that generated bytecode ends up being faster than interpreted code.

In my current exercise refactoring the implementation of system functions such as string-length, I've been looking at how much of logic is duplicated either across the different implementations of a single function (streamed and unstreamed, static and dynamic, bytecode and interpreted) or across the implementations of functions that have a lot in common (such as string(), string-length(), and normalize-space()). I've found that with the exception of the core code in StringValue.getStringLength, and the optimization of string-length()=0, everything else can be vastly reduced. In place of the original StringLength class, there are now two (inner) classes StringLength_0 and StringLength_1 each of which consists of a single one-line method. The code for generating byte-code can also be considerably simplified by achieving more reuse across different functions.

The main essence of the reorganization is that the class StringLength (or rather, its two variants) are no longer Expressions, they are now Functions. Previously a call onto string-length($x) compiled to an expression, held as a node on the expression tree. Now it compiles into two object, a StringLength object which is a pure function, and a SystemFunctionCall object which is an expression that calls the function. The SystemFunctionCall object is generic across all functions, while the implementations of SystemFunction contain all the code that is specific to one function. This change was motivated primarily by the need to handle dynamic function calls (and hence first-class function objects) properly, but it has provided a stimulus for a refactoring that achieves much more than this.

So, how long is a piece of string? At least we now know how to work it out more efficiently. Sorry this little yarn wasn't shorter.

Functions, Function Calls, Function Itemshttps://dev.saxonica.com/blog/mike/2015/02/functions-function-calls-function-items.html2015-02-01T14:05:36Z

XSLT and XQuery, in their 3.0 incarnations, have become fully functional programming languages, where functions are first class values that can be manipulated in the same way as other values, for example they can be bound to variables and passed as arguments to other functions. So you would expect that functions play a pretty central role in the Saxon implementation. In this article I shall review how functions work within Saxon, and how I think this needs to change.

One of the things that happens when features are added to a complex piece of software is that they tend to be added around the edges, rather than in the core. You can tell when something was added by how central it is to the class hierarchy. It shouldn't be that way, but there are good reasons why it happens. And when we look at how functions work in Saxon, that's what we see. For example, if we had always had dynamic function calls, then we would probably find that static function calls (where the function to be called is known at compile time) were treated simply as a special case of dynamic calls, much as ($x+1) is treated as a special case of ($x+$y). But because they came along later, dynamic calls are actually handled very differently.

I've been doing work recently on the design of Saxon's expression tree (the data structure produced by the compilation phase for use by the execution phase of stylesheet and query evaluation). One aspect of that has been working out how to write the expression tree to persistent storage, and then load it back again for execution. In doing this, I've been struck by the sheer number of different classes that are somehow related to functions, and by the lack of coherence between them.

The first thing we notice, which is in itself a bit smelly, is that there is no class called Function. In fact, for many functions, there is no object in Saxon that represents the function itself. For example, with system functions (recall that in XSLT 1.0 and XPath 1.0 that's the only kind there were), we have two things: a data table containing general information about functions, such as their type signatures and context dependencies, and a class SystemFunctionCall which actually represents a call to the function, not the function itself. The implementation of functions such as name() or substring() is in a class called Name or Substring which is a subclass of SystemFunctionCall. This in turn is a subclass of Expression, and as such it forms a node in the expression tree.

This works fine for static function calls, but what happens to an expression such as substring#3 or true#0 or name#0? Answer: different things in each of these three cases.

For substring#3, this is a pure function: there are no context dependencies. We create something called a SystemFunctionItem, which is an Item (and therefore a Sequence), we wrap this inside a Literal, and we put the Literal on the expression tree. This is much the same as the way we handle a constant such as "London" or 39. Internally, the SystemFunctionItem contains a reference to an instance of the Substring class, which is where things get a bit peculiar, because Substring is designed to be a function call, not a function. In fact, Substring has a dual personality, it tries to act in both roles depending which methods you call. That's a hack, and like all hacks, it comes back to bite you.

For true#0 there's a further bit of hackery, because the function call true() doesn't actually generate a SystemFunctionCall, it generates a BooleanValue: it's treated as a constant, just like "London" or 39. But we have to support dynamic calls on true(), so we introduced a SystemFunctionCall called True to handle this case, even though it doesn't have the dual personality: it acts only as a function, never as a function call.

The name#0 function is different again, because it encapsulates context. It's defined to return the name of the node that was the context node at the point where the name#0 expression was evaluated. So this should remind us that name#0 is not a value, it is an expression: the function it represents has no static existence, it can only be created dynamically by evaluating the expression with a particular dynamic context. We solve this with another hack: for context-dependent function "literals" like name#0 or position#0, the compiler actually generates a call on function-lookup("name", 0), which is an expression rather than a value, and which has to be evaluated at run-time.

The SystemFunctionItem class implements an internal interface called FunctionItem, and as one might expect, a FunctionItem is an Item and therefore a Sequence: that is, it's a run-time value. Other subclasses of FunctionItem are used for calls to user defined functions, Java extension functions, or constructor functions. But they are only ever used for dynamic calls, never for static calls.

Although there is no Function class, some functions are in fact represented as objects in their own right. The most important is UserFunction, which represents a user-written function in XSLT or XQuery. Another is InlineFunction (representing an anonymous inline function), and another is ExtensionFunctionDefinition, which represents a function whose implementation is provided in Java code: this is used both for user-written extension functions, and for many (but not all) vendor supplied extension functions in the Saxon namespace. But these classes have nothing in common with each other, there is no common superclass. This has the consequence that not only is there one set of machinery for static function calls and another quite different set for dynamic calls, but that in each case, there are many different flavours depending on what kind of function you happen to be calling.

Getting this right involves a cleaner data model. As always, a clean data model leads to cleanly structured code, and the data model should be designed to accurately reflect the specification, not for the convenience of the implementation. The specification says we have two objects of interest, a function, and a function call. Quite rightly, the spec no longer uses the term "function item" as something distinct from the function: a function is an item. There's also something called the "function implementation" which we should recognize, because two functions may share an implementation. For example, the expression name#0 returns a different function each time it is evaluated: these functions differ in the data they hold (a snapshot of the dynamic context), but they share the same implementation.

We should be trying to move towards a structure where we have a hierarchy of subclasses of Function (for system functions, constructor functions, user functions, Java extension functions, etc); where the Function is an Item; where a Function may reference a FunctionImplementation to provide for sharing; and where a FunctionCall is as far as possible independent of what kind of function it is calling, but simply has two flavours for static function calls and dynamic function calls.

It will be interesting to see how close we can get to that situation.

Redesigning the Saxon expression treehttps://dev.saxonica.com/blog/mike/2014/11/redesigning-the-saxon-expression-tree.html2014-11-11T23:45:00Z

I've been embarking on an exercise to redesign the Saxon expression tree. It's got a number of problems that it would be nice to fix; and there's major work ahead in being able to save and restore the tree as part of a compiled XSLT 3.0 package, so it would be nice to get the structure into better shape first.

One of the problems is that there's 150-odd different implementation classes for nodes on the expression tree, and that's before you start counting individual function calls; and there's far too much duplication between these classes. Implementing a new kind of expression involves far too much code. In addition, there's far too much scope to get this code wrong, leading to obscure bugs when a new kind of expression gets involved in some particular optimization rewrite, such as function or variable inlining. There's a steady flow of bugs, perhaps a couple a month, caused by tree corruptions of one kind or another, and it would be nice to make the whole structure more robust.

One thing I want to do is to be more systematic about the way in which static context information is held on the tree. At present the principle is that each expression saves that part of the static context that it thinks it might need. For example, some expressions need run-time access to the static base URI, some to the static namespace context, and some to the register of collation names. (This last case has been simplified in 9.6 because all collation names are now global to a Configuration, though the default collation can still vary from one expression to another). A recent shock discovery is that in XPath 3.0, general comparisons (that is, the humble "=" operator) can depend on the namespace context - if one operand turns out to be untyped atomic and another is a QName, then the untyped atomic value needs to be converted to a QName using the in-scope namespaces. This means the namespace context needs to be saved with every general comparison expression, which is heavy stuff. Given that the static context is almost always the same for all expressions in a module, it would be much better rather than saving what each expression thinks it might need, to save the changes to the static context on the tree, so each expression can discover the whole static context by processing a set of deltas held with its ancestors in the tree.

This leads to another point, which is that it would be nice to have parent pointers in the expression tree. At present you can navigate from a node to its children (that is, its subexpressions), but not in the other direction. Saxon gets around this by keeping a dynamic stack of expressions visited in some of its operations on the tree (such as type-checking and optimization) so the ancestor expressions are maintained dynamically during the recursive tree-walk rather than being maintained statically on the tree.

In 9.6, mainly to support XSLT 3.0 streamability analysis, we introduced a generalized mechanism for obtaining the subexpressions of any node: the operands() method. This returns a set of Operand objects, each of which contains a pointer to the child expression itself, and also properties in effect of the parent-child expression relationship, such as the XSLT 3.0 posture and sweep properties, and the operand usage, used in the 3.0 general streamability rules. This mechanism has proved very successful (and not just for streaming) in enabling more generalized operations on the tree. But a limitation is that modtifications to the tree, such as substitution one child expression for another (which is very common during type-checking and optimization) is still entirely ad-hoc, and has to be managed independently by each class of expression node.

As a first step in redesigning the tree for 9.7, I have extended the way in which we use Operand objects. As well as being used to navigate to subexpressions, they are also now used to modify subexpressions: the Operand object has a method setChildExpression() which can be used to replace the existing child expression with another. All structural changes to the tree are required to go via this method, which is enforced by encapsulating the reference to the child expression within the Operand object. The Operand also holds a reference to its "owner" expression, so when a child expression is changed, the single setChildExpression() method can take responsibility for housekeeping such as making sure the child expression has location information for use in error reporting, and making sure that expression properties cached on the parent node (such as the inferred type) are invalidated and recomputed when the children of the expression change.

This process is complicated by the fact that the nodes on the expression tree are highly diverse, in fact they don't even all represent expressions. The tree also has to cater, for example, for XSLT patterns and XQuery FLWOR expression clauses.

Making updates go through the Operand object enables many expressions to inherit a generic implementation of common rewrite methods such as typeCheck and optimize. For example, the default action of optimize is to call optimize() on each subexpression, and if any changes have occurred, replace the subexpression with its rewritten self. The redesign means that this "replace" operation can now be done in a generic way, meaning that the default optimize() method does not need to be tailor-made for each class of expression. The same is true of other methods such as primote().

I was hoping that channelling all updates through Operand.setChildExpression() would also make it easy to maintain parent pointers in the tree. Unfortunately this is not the case. It's easy enough, when B is set as a child of A, for B's parent pointer to be updated to point to A. The problem arises when B's parent pointer was previously set to C: what happens to C's children? Can B be a child of A and C simultaneously (making it not a tree at all?). It turns out that some rewrites on the tree involve creating a new structure over existing leaf nodes in the tree, which might then be discarded if not all the conditions for optimization are met. So we've updated the parent pointers in the leaf nodes to this new superstructure, which we then discarded, reverting to the original. It's difficult then to make sure that the parent pointers are reset properly when the rewrite is abandoned. It can be done in an ad-hoc way, of course, but we're looking for something more robust: an API for tree rewriting that doesn't allow the tree to become inconsistent. This is proving hard to achieve. We may have to resort to a different approach, doing a bulk sweep of the tree to set all parent pointers before the typecheck and optimize operations on each template or function. This is intellectually unsatisfactory because it means accepting that the tree could be temporarily inconsistent in the middle of such an operation, but it may be the best option available.

This is work in progress, but it's looking promising. I appreciate that for most users it's about as interesting as the repair works to the sewers beneath your street. Hopefully, though, the bottom line will be fewer product bugs, and more ability to take the product forward into new areas.



View XML
View XSL