Monday, 21 March 2011

XPath is to XML as regex is to text

Anyone who has been a developer for a while gets familiar with regular expressions. They eat text for breakfast, and spit out desired answers. For all their cryptic terseness, they are at least in part reusable, and are based on only a handful (or two...) of rules. 'regexes' are a domain-specific micro-language for searching and filtering text.

But once we get outside of text, what is there?

With XML, we have XPath. I had one of those light-bulb realisations recently that what regexes are to text, XPath is to XML. And it made me think:

Why would I want to use a data substrate which doesn't have such a tool?

What I mean is this; text has regex. XML has XPath. RDBMS have SQL. Markup language of the day has... oh, it doesn't. Not really, in the sense of a standardised domain-specific micro-language. Regular expressions, XPath and SQL have history and mindshare. They 'work' specifically because they are DSLs, rather than high-level code. (OK, SQL is pushing it further than I'd like here, but it's still a DSL. Just not a micro-sized one.) To me, this is a problem which many 'NoSQL' tools have. I want the features of them, but CouchDB wants me to write map-reduce functions in JavaScript. MongoDB wants me to use a JSON-based query language. There is no commonality; no reuse; no lingua franca which will let me abstract the processing concepts away from the tools. Perhaps that will come in time for more data-representations (this seems to be an attempt for JSON, for example), but there is a significant barrier before such a tool gains widespread acceptance as a common abstraction across an entire data layer.

Pipelines and Data Processing

The 'UNIX philosophy' of connecting together a number of single-purpose programs to accomplish larger tasks is one of the keys to its power. These tools can be plugged together in ways which the original creators may never have thought of. Tools such as sed and awk are often employed as regex-based filters to command pipelines. I wish more tools had XML output options, because the tools we use in our pipelines often output structured data in textual format, often in tabular form. Tables are great for human consumption (provided they are modest in size), but when we start getting empty columns, cells flowing onto multiple lines, and other inconsistencies, it becomes a pain to parse. How great things could be if every tool following subversion's lead and had an --xml option:

svn diff -r $(svn log --stop-on-copy --xml | xpath -q -e '//log/logentry[last()]/@revision' | cut -d '"' -f 2):HEAD
(This command does a diff from a branch base to the most recent revision.  It still does some basic text processing, because the end result of XPath expressions are still text nodes).

Just imagine if POSIX defined an XML schema for each relevant command, and mandated an --xml option. Life would be so much easier. In many environments, data is structured but we still represent it as text. The pipeline philosophy might be nice, but it isn't exploited to the full when we need to write convoluted awk scripts and inscrutable regular expressions (or worse, Perl ;) ) to try and untangle the structure from the text. Consider something straightforward like the output of 'mount' on a *nix box. On my Mac it looks like this:

ben$ mount
/dev/disk0s2 on / (hfs, local, journaled)
devfs on /dev (devfs, local, nobrowse)
map -hosts on /net (autofs, nosuid, automounted, nobrowse)
map auto_home on /home (autofs, automounted, nobrowse)
/dev/disk1s1 on /Volumes/My Book (msdos, local, nodev, nosuid, noowners)

This is structured data, but getting the information out of that text blob would not be trivial, and would probably take many minutes of trial and error with regexes to get something reasonable. And the crucial thing is that you couldn't be sure it would always work. Plug a new device in which gets mounted in some new and interesting way, and who is to say that the new output of mount won't suddenly break your hand-crafted regex? That's where XML shines. Adding new information doesn't change anything in the old information. The way to access it doesn't change. Nothing breaks in the face of extension. Compare this to something like CSV, where the insertion of an extra column means all the indices from that column onwards need to change in every producer and consumer of the data.

XML and the Web

I'm somewhat saddened that XHTML didn't win outright in the last decade, and that XML on the web never really took off. I spent months at a previous job trying to convince everyone that 'XML-over-HTTP' was the best thing since sliced bread. A single source of data, which could be consumed by man (via XSLT & CSS in the browser) and machine alike. Just think how much energy the likes of Google could save if our web content didn't focus almost entirely on human consumption and discriminate against machines ;-)

One interesting thing which has happened as XML-on-the-web has declined is the increase in use of CSS selectors, first via frameworks such as Sizzle (used in jQuery), and later in the standard querySelectorAll DOM method. There is clearly a need for these DSL micro-languages, and as the 'CSS selector' DSL shows, they can quickly establish themselves if there is a clear need and sufficient backing from the community. Also apparent is that existing solutions can be usurped - users could do virtually everything CSS selectors could do (and far more besides) with XPath, but that didn't happen. Simplicity won here. But just because XPath was (arguably) wrong for Web development, doesn't mean it is wrong everywhere, and I contend that there are places where we have over-simplified, forcing regular expressions and text manipulation to (and beyond) breaking point, when XML processing would make things simpler everywhere.


In terms of practicalities, if I had ever spent too long in the world of Java, I would probably see XML as an unwelcome and persistent pest. But living in the happier climes of Python-ville, I have access to the wonderful ElementTree API, via both ElementTree itself (included in the standard library) and lxml.

Both of these support XPath as well as high-level abstractions of XML documents to and from lists and dictionaries. With ElementTree, XML access from Python is (almost) as easy as JSON access from JavaScript. And with technologies like XPath and XSLT available, I think it's worth it.

As a final thought, I've just had a quick glance through Greg Wilson's excellent Data Crunching, which contains chapters on Text, Regular Expressions, Binary data (rather a short ad-hoc chapter), XML, and Relational Databases. Perhaps the 'binary data' chapter is short because there simply aren't many patterns available. There is no language to describe unabstracted data. And perhaps when we consider the data layer we should be using, we should think not only of features and performance, but also the power, expressiveness, and concision of the languages available to reason about the information. Perhaps too often we settle for a lowest common denominator solution (text) when a higher level one might be more powerful, especially if we don't have to give up on the concepts of fine-grained interoperability which micro-DSLs such as XPath give us.

To be continued...


  1. I don't see why you discard MongoDB's queries as a full-featured query language... JSON makes it easy to create and parse in a program.

    Also it would be *hell* if Unixes would use XML instead of text :). It's much simpler to try a few "cuts and greps" and get what you want, preserving nice output for a human eye. In case of XML you would also have to *guess* the schema and the meaning of elements/attributes.

    I think you overestimate XML, it's a syntax only, neither nice for humans nor ideal for parsing (slow performance).

  2. @aradomir - I don't disagree with any of your three points.

    My argument is not that XML is great (I agree it has issues) but that having common meta-languages and patterns to reason about data (e.g. XPath) which are not tied to a specific tool but to the entire 'data substrate' is a powerful concept, and perhaps make it worth putting up with an otherwise sub-optimal representation.

    On your particular points:
    - do MongoDB's queries work in any other tool? they certainly aren't a standard way of dealing with all JSON documents).

    - 'preserving nice output for a human eye' shouldn't mean that we don't have options which make things easier for programs, though obviously without standard schemas just having XML output everywhere would seem a little pointless...

    - as for XML being a syntax only, form often dictates function. I'd far sooner add meta-data (in the form of attributes) if I knew the target representation was XML than if it were JSON. And vice-versa - sometimes we want to deter gratuitous meta-data.

  3. Hmm, I'm skeptical about "common meta-languages" as you describe them. Are your examples - XPath and regexps - really such common? Almost all programs use strings, in most cases they don't use regexps. Also using programmatic functions like "substring" is preferred over a regexp that defines a substring. The same with XPath - I've seen more examples of DOM traversal than using XPath expressions.

    Things like "common meta-languages" are cool from academic point of view (no offence here, I myself have such background), but in practice everything is a TOOL - XPath is a tool for extracting nodes, sometimes better than DOM traversal or a regexp... I don't see why it is THE language.