Anyone who has been a developer for a while gets familiar with regular expressions. They eat text for breakfast, and spit out desired answers. For all their cryptic terseness, they are at least in part reusable, and are based on only a handful (or two...) of rules. 'regexes' are a domain-specific micro-language for searching and filtering text.
But once we get outside of text, what is there?
With XML, we have XPath. I had one of those light-bulb realisations recently that what regexes are to text, XPath is to XML. And it made me think:
Why would I want to use a data substrate which doesn't have such a tool?
Pipelines and Data Processing
The 'UNIX philosophy' of connecting together a number of single-purpose programs to accomplish larger tasks is one of the keys to its power. These tools can be plugged together in ways which the original creators may never have thought of. Tools such as sed and awk are often employed as regex-based filters to command pipelines. I wish more tools had XML output options, because the tools we use in our pipelines often output structured data in textual format, often in tabular form. Tables are great for human consumption (provided they are modest in size), but when we start getting empty columns, cells flowing onto multiple lines, and other inconsistencies, it becomes a pain to parse. How great things could be if every tool following subversion's lead and had an --xml option:
svn diff -r $(svn log --stop-on-copy --xml | xpath -q -e '//log/logentry[last()]/@revision' | cut -d '"' -f 2):HEAD(This command does a diff from a branch base to the most recent revision. It still does some basic text processing, because the end result of XPath expressions are still text nodes).
Just imagine if POSIX defined an XML schema for each relevant command, and mandated an --xml option. Life would be so much easier. In many environments, data is structured but we still represent it as text. The pipeline philosophy might be nice, but it isn't exploited to the full when we need to write convoluted awk scripts and inscrutable regular expressions (or worse, Perl ;) ) to try and untangle the structure from the text. Consider something straightforward like the output of 'mount' on a *nix box. On my Mac it looks like this:
ben$ mount /dev/disk0s2 on / (hfs, local, journaled) devfs on /dev (devfs, local, nobrowse) map -hosts on /net (autofs, nosuid, automounted, nobrowse) map auto_home on /home (autofs, automounted, nobrowse) /dev/disk1s1 on /Volumes/My Book (msdos, local, nodev, nosuid, noowners)
This is structured data, but getting the information out of that text blob would not be trivial, and would probably take many minutes of trial and error with regexes to get something reasonable. And the crucial thing is that you couldn't be sure it would always work. Plug a new device in which gets mounted in some new and interesting way, and who is to say that the new output of mount won't suddenly break your hand-crafted regex? That's where XML shines. Adding new information doesn't change anything in the old information. The way to access it doesn't change. Nothing breaks in the face of extension. Compare this to something like CSV, where the insertion of an extra column means all the indices from that column onwards need to change in every producer and consumer of the data.
XML and the Web
I'm somewhat saddened that XHTML didn't win outright in the last decade, and that XML on the web never really took off. I spent months at a previous job trying to convince everyone that 'XML-over-HTTP' was the best thing since sliced bread. A single source of data, which could be consumed by man (via XSLT & CSS in the browser) and machine alike. Just think how much energy the likes of Google could save if our web content didn't focus almost entirely on human consumption and discriminate against machines ;-)
One interesting thing which has happened as XML-on-the-web has declined is the increase in use of CSS selectors, first via frameworks such as Sizzle (used in jQuery), and later in the standard
querySelectorAll DOM method. There is clearly a need for these DSL micro-languages, and as the 'CSS selector' DSL shows, they can quickly establish themselves if there is a clear need and sufficient backing from the community. Also apparent is that existing solutions can be usurped - users could do virtually everything CSS selectors could do (and far more besides) with XPath, but that didn't happen. Simplicity won here. But just because XPath was (arguably) wrong for Web development, doesn't mean it is wrong everywhere, and I contend that there are places where we have over-simplified, forcing regular expressions and text manipulation to (and beyond) breaking point, when XML processing would make things simpler everywhere.
As a final thought, I've just had a quick glance through Greg Wilson's excellent Data Crunching, which contains chapters on Text, Regular Expressions, Binary data (rather a short ad-hoc chapter), XML, and Relational Databases. Perhaps the 'binary data' chapter is short because there simply aren't many patterns available. There is no language to describe unabstracted data. And perhaps when we consider the data layer we should be using, we should think not only of features and performance, but also the power, expressiveness, and concision of the languages available to reason about the information. Perhaps too often we settle for a lowest common denominator solution (text) when a higher level one might be more powerful, especially if we don't have to give up on the concepts of fine-grained interoperability which micro-DSLs such as XPath give us.
To be continued...