More thoughts on XML differencing

Last year, I gave a talk on XML differencing at the Extreme Markup Languages 2005 conference in Montréal. Since then I didn’t have much time to work on that project.

The prototype app is opensource, and available at ssddiff.alioth.debian.org. The slides are there, too, and so is a link to the article.

XML differencing is actually a two phase process; I’ve been mostly working on the first phase. Traditional approaches intertwine the two phases.

In the first phase, you build a mapping between the two documents. I called this a “nodeset correspondence” in the slides. Basically you decide which node from the first document corresponds to which node in the second document.

When you have found this mapping (the “marked” output writer of ssddiff will output this information in form of two XML documents marked with numbers in a certain attribute), the actual differencing is done. But much of the work remains.

When doing a diff, the output (format) is just as important as the actual result.

While the “marked” output explained above is very precise, it doesn’t give the user an overview of what has changed in the document. That’s why I added two more output writers, the “xupdate” writer and the “merged” writer.

The “xupdate” writer does obviously produce an XUpdate script. It goes a long way to optimizing the output (read: minimizing the number of operations), and employs a longest-common-subsequence algorithm for that (LCS is what regular diff uses on lines!). Fortunately we already have a 1:1 mapping, so this boils down to a longest-increasing-subsequence which can be computed very efficiently.

Then the writer employs some XUpdate tricks to do real moves (by storing parts of the tree in variables to be inserted later), and avoids collapsing of text nodes (it will eventually insert dummy nodes for that). But all this gives just a glimpse of the complexity I’ve encountered in what I considered to be a “remote” part of my project. It isn’t even described in the paper, because I just hacked this code together to work somewhat (attribute handling is incomplete IIRC).

The “merged” writer turned out to be much more powerful than I had imagined and actually rather easy. After “merging” two version of an HTML document, I could add some CSS rules and have a visual presentation of the changes in my browser. Very cool. I’d love to add this html merge functionality to the prototype, but I didn’t have enough time yet (you’ll find some refactoring of the merged writer in SVN though to get there someday).

But let me get back to the difficulties in the “XUpdate” writer. Not all of these difficulties are actually of technical nature. It is (like with the semantics of differencing) a matter of the intended data model and use.

When you are doing a diff to be applied by a computer (e.g. via XUpdate), you may have the requirement, that the resulting document is identical to the original document. In that case, the diff is likely to have to restore whitespace. If you are interested in analyzing the differences, you will want to skip any irrelevant information.

Let’s assume we have an address book:

<abook><entry name="tom" /><entry name="jerry" /></abook>

Ordering of the entry nodes is likely to be irrelevant. You address book applications will sort the entries on demand. So a “semantic diff” shouldn’t include any changes to the ordering of “entry” nodes, should it?

The XUpdate writer doesn’t have any functionality in place to skip such changes from being written to the output file. This will likely be not possible without some DTD magic and extra user information.

However, to have the benefits of the “-u” context option of good old “diffutils” - fuzzy matches, and somewhat reliable merging of changes without having all revisions available - such semantics are needed, too. But this will even further increase complexity of the output writer, which for example would need to detect somehow that the “name” attribute in my address book is a unique key, and therefore xpath expressions like /abook/entry[@name='tom'] are preferrable to expressions such as /abook/entry[1], which will easily break when merging.

The specification of an “XUpdate-replacement with context information” is probably a whole thesis in itself, I guess.

P.S. I’m occasionally asked about a Java version of ssddiff. Sorry that I have to disappoint you there. First of all, I’m very short of time. Then I’m not too comfortable in Java, and I estimated that the Java version would need at least twice as much memory, because of certain optimizations I can’t easily access in Java that I can just hide in a template in C++. And memory is the key limiting factor at least in exact mode.