Monday 18 July 2011

HTTP or it doesn't exist.

This is my 'or it doesn't exist' blog post[1][2][3]. I think everyone should have one ;-)

A big chunk of my life is processing electronic information. Since I would like it to be a (slightly) smaller chunk of my life, I want to automate as much as possible. Now ideally, I don't want a massive disconnect between what I have to do as a human processor of information and what I need to tell a computer to do to do that job done without my help. Because it's easier that way.

So when I hear that the information I need to process is in some spreadsheet or other on a Windows share, it makes me a little sad. When I hear that it is available via a sensible REST interface in a sensible format, my heart leaps for joy just a little.

With something like Python's standard library (and third-party package) support for HTTP (requests), XML (ElementTree) and JSON, I should be able to get my computer to do most of the manual data processing tasks which involve 'documents' of some form or other.

In a previous job I worked at convincing anyone who would listen that 'XML over HTTP' was the best thing since sliced bread. With appropriate XSLT and CSS links, the same data source (i.e. URI) could be happily consumed by both man and machine. Admittedly most of the information was highly structured data - wire protocols and the like, but it still needed to be understandable by real people and real programs.

I'm not an XML expert, but I think I 'get' it. I never understood why it needed so much baggage though, and can't say I'm sad that the whole web services thing seems to be quietly drifting into the background - though maybe it was always trying to.

A lot changes in web technology in a short time, and XML is no longer 'cool', so I won't be quite as passionate about 'XML over HTTP' as I once was. For short fragments it is far more verbose than JSON, though I'd argue that for longer documents, XML's added expressiveness makes the verbosity worth it. Maybe it was ever thus, but whenever two technologies have even the slightest overlap, there seems to be a territorial defensiveness which makes the thought of using both in one project seem somewhat radical. So while I've used JSON much more than XML in the last couple of years, I've not turned against it. If done right (Apple, what were you thinking with plist files!?) - it is great. Compared to JSON-like representations, the ability to have attributes for every node in the tree is a whole new dimension in making a data source more awesome or usable (or terribly broken and rubbish). I've seen too many XML documents where either everything is an attribute or nothing is, but it's not exactly rocket science.

Things I liked about XML:

I like to think I could write a parser for XML 1.0 without too much effort. If it's not well formed, stop. Except for trivial whitespace normalisation etc, there is a one-to-one mapping of structure to serialisation. Compare that with the mess of HTML parsers. While HTML5 might now specify how errored documents should be parsed (i.e. what the resulting DOM should be), I suspect that a HTML5 -> DOM parser is a far more complex beast.
Names! Sensible Names!
Because HTML is limited in its domain, it has a fixed (though growing thanks to the living standard[4] which is HTML) set of tags. When another domain is imposed on top of that, the class attribute tends to get pressed into service in a ugly and overloaded way. By allowing top-level tags to be domain-specific, we can make the document abstraction more 'square'[5].
Attributes allow metadata to be attached to document nodes. Just as a lower-level language is fully capable of creating a solution to any given problem, having 'zero mental cost' abstractions (such as the data structures provided by high-level languages) enables new ways of thinking about problems. In the same way, having attributes on data nodes doesn't give us anything we couldn't implement without them, but it provides another abstraction which I've found invaluable and missed when using or creating JSON data sources.

What does make me slightly(!) sad though is the practical demise of XHTML and any priority that browsers might give to processing XML. There is now a many-to-one mapping of markup to DOM, and pre HTML5 (and still in practice for the foreseeable future considering browser idiosyncrasies and bugs) - a many-to-many mapping. It wouldn't surprise me if XSLT transform support eventually disappeared from browsers.

Maybe there's a bit of elitism here - if you can't code well-formed markup and some decent XSLT (preferably with lots of convoluted functional programming thrown in) - then frankly 'get of my lawn!'. I love the new features in HTML(5), but part of me wishes that there was an implied background 'X' unquestionably preceding that, for all things. The success of the web is that it broke out of that mould. But in doing that it has compromised the formalisms which machines demand and require. Is the dream of the machine-readable semantic web getting further away - even as cool and accessible (and standards compliant - at last) web content finally looks like it might possibly start to achieve its goal? Is it too much (and too late) to dream of 'data' (rather than ramblings like this one) being available in the same form for both the human viewer and the computer automaton?

I'm prepared to be realistic and accept where we've come to. It's not all bad, and the speed with which technology is changing has never been faster. It's an exciting time to wield electronic information, and we've got the tools to move forward from inaccessible files stored on closed, disconnected systems. So where I used to say 'XML over HTTP', my new mantra shall now be 'HTTP or it doesn't exist'. At least for a while.