Monday, 18 July 2011

HTTP or it doesn't exist.

This is my 'or it doesn't exist' blog post[1][2][3]. I think everyone should have one ;-)

A big chunk of my life is processing electronic information. Since I would like it to be a (slightly) smaller chunk of my life, I want to automate as much as possible. Now ideally, I don't want a massive disconnect between what I have to do as a human processor of information and what I need to tell a computer to do to do that job done without my help. Because it's easier that way.

So when I hear that the information I need to process is in some spreadsheet or other on a Windows share, it makes me a little sad. When I hear that it is available via a sensible REST interface in a sensible format, my heart leaps for joy just a little.

With something like Python's standard library (and third-party package) support for HTTP (requests), XML (ElementTree) and JSON, I should be able to get my computer to do most of the manual data processing tasks which involve 'documents' of some form or other.

In a previous job I worked at convincing anyone who would listen that 'XML over HTTP' was the best thing since sliced bread. With appropriate XSLT and CSS links, the same data source (i.e. URI) could be happily consumed by both man and machine. Admittedly most of the information was highly structured data - wire protocols and the like, but it still needed to be understandable by real people and real programs.

I'm not an XML expert, but I think I 'get' it. I never understood why it needed so much baggage though, and can't say I'm sad that the whole web services thing seems to be quietly drifting into the background - though maybe it was always trying to.

A lot changes in web technology in a short time, and XML is no longer 'cool', so I won't be quite as passionate about 'XML over HTTP' as I once was. For short fragments it is far more verbose than JSON, though I'd argue that for longer documents, XML's added expressiveness makes the verbosity worth it. Maybe it was ever thus, but whenever two technologies have even the slightest overlap, there seems to be a territorial defensiveness which makes the thought of using both in one project seem somewhat radical. So while I've used JSON much more than XML in the last couple of years, I've not turned against it. If done right (Apple, what were you thinking with plist files!?) - it is great. Compared to JSON-like representations, the ability to have attributes for every node in the tree is a whole new dimension in making a data source more awesome or usable (or terribly broken and rubbish). I've seen too many XML documents where either everything is an attribute or nothing is, but it's not exactly rocket science.

Things I liked about XML:

Simplicity
I like to think I could write a parser for XML 1.0 without too much effort. If it's not well formed, stop. Except for trivial whitespace normalisation etc, there is a one-to-one mapping of structure to serialisation. Compare that with the mess of HTML parsers. While HTML5 might now specify how errored documents should be parsed (i.e. what the resulting DOM should be), I suspect that a HTML5 -> DOM parser is a far more complex beast.
Names! Sensible Names!
Because HTML is limited in its domain, it has a fixed (though growing thanks to the living standard[4] which is HTML) set of tags. When another domain is imposed on top of that, the class attribute tends to get pressed into service in a ugly and overloaded way. By allowing top-level tags to be domain-specific, we can make the document abstraction more 'square'[5].
Attributes
Attributes allow metadata to be attached to document nodes. Just as a lower-level language is fully capable of creating a solution to any given problem, having 'zero mental cost' abstractions (such as the data structures provided by high-level languages) enables new ways of thinking about problems. In the same way, having attributes on data nodes doesn't give us anything we couldn't implement without them, but it provides another abstraction which I've found invaluable and missed when using or creating JSON data sources.

What does make me slightly(!) sad though is the practical demise of XHTML and any priority that browsers might give to processing XML. There is now a many-to-one mapping of markup to DOM, and pre HTML5 (and still in practice for the foreseeable future considering browser idiosyncrasies and bugs) - a many-to-many mapping. It wouldn't surprise me if XSLT transform support eventually disappeared from browsers.

Maybe there's a bit of elitism here - if you can't code well-formed markup and some decent XSLT (preferably with lots of convoluted functional programming thrown in) - then frankly 'get of my lawn!'. I love the new features in HTML(5), but part of me wishes that there was an implied background 'X' unquestionably preceding that, for all things. The success of the web is that it broke out of that mould. But in doing that it has compromised the formalisms which machines demand and require. Is the dream of the machine-readable semantic web getting further away - even as cool and accessible (and standards compliant - at last) web content finally looks like it might possibly start to achieve its goal? Is it too much (and too late) to dream of 'data' (rather than ramblings like this one) being available in the same form for both the human viewer and the computer automaton?

I'm prepared to be realistic and accept where we've come to. It's not all bad, and the speed with which technology is changing has never been faster. It's an exciting time to wield electronic information, and we've got the tools to move forward from inaccessible files stored on closed, disconnected systems. So where I used to say 'XML over HTTP', my new mantra shall now be 'HTTP or it doesn't exist'. At least for a while.

Sunday, 5 June 2011

Bugs: they hide where you're not looking

I bought a new Pogoplug over the weekend (only £50 new at PC World), and after being genuinely impressed by the Pogoplug software, decided it was far too easy and put PlugApps Linux on it. These 'plug' type devices are fairly amazing - cheap, very low power (measured mine at just under 4 watts, with only a single USB flash stick), but with a decent 1.2GHz ARM processor. I'm already thinking my next computer might be another 'plug.

After hacking for I while (why won't my printer work?), I decided to check whether my pylibftdi package worked on it. To my shock, a quick 'pacman -S libftdi; pip install pylibftdi', installed fine, and I could open a device connection to a FTDI device! But then things got worse. Trying to run examples/lcd.py failed with an exception in BitBangDevice, and I quickly realised that the API changes I'd done in 0.8 to make device access more 'file-like' had broken things in BitBangDriver. I was slightly sad that I'd released something where the examples didn't even work, and part of the whole reason the package might be useful to people (the abstraction over bit-bang device operation) was broken.

pylibftdi is fairly simple, and basically consists of Driver, Device, and BitBangDevice classes. Most of the interesting is in the Device class - so this is where I started when I finally got round to adding some tests for the 0.8 release. Having achieved reasonable coverage (though shamefully less than the 100% Uncle Bob demands), I considered my initial testing 'done'. I knew there was more to add later, and had (and still have) full intentions to 'get around to it'.

What I failed to anticipate has the unintended side-effect of writing tests. In the same way teachers might teach how to pass an exam rather than the depth and breadth of a subject, once tests exist, the purpose can simply be to pass them. Old manual acceptance tests get ignored as the 'old way' of doing things. Ideally of course this isn't a problem, because full test cases exist for every feature and code-path in the system, but that was very far from the case here. So somehow, my standard acceptance test (do the example programs still work) got omitted, in preference for 'there are tests now, so it must be better! And the tests pass!'

So beware - a little testing can be a dangerous thing. The bugs hide where you're not looking for them. This is great motivation for achieving full test coverage, for automating acceptance testing (as well as unit / component level testing) so far as possible, and for being humble when it comes to writing tests. My motivations for writing them in the first place were two-fold: the feeling it was 'what I should do', and the idea that at some future point when I added or refactored things later I could be confident I hadn't broken things. I had no thought that the software was already broken; that I needed tests.

Anyway, pylibftdi 0.8.1 is now out, with trivial but important fixes and lots more tests.

Saturday, 7 May 2011

pylibftdi 0.8 released; testing, coverage, and mocking

pylibftdi is a file-like wrapper to FTDI USB devices such as the UB232R (a USB<->serial converter) and the UM245R (8 bit parallel I/O).

No big changes for the 0.8 release, but a couple of new things:

  • ability to specify the device identifier in the Device([device_id]) constructor as either a serial number (as previously), or a device description. So can now specify Device('UB232R'), and the first attached UB232R device will be opened. The code initially tries to open by serial number, and if that fails will try to open by description, which I'm fairly confident will be useful rather than annoying :-)
  • more file-like API functions (flush, readline()/readlines()/writelines(), iterating over lines). These probably only make sense for text over serial lines, but that's a use-case worth supporting, considering pylibftdi already has unicode support.

As well as that, I finally got round to adding some tests, and discovered something wonderful: checking test coverage isn't just practical icing on the cake to make sure things are tested well, but is a powerful and effective motivation for writing tests. I'm using coverage, and have to say it's one of those things I wish I had got round to sooner.

Speaking of which, at some point I'll probably end up saying the same about Mock, which I've read around and know I should probably start using, but it's just so easy in Python to knock up something like this:

fn_log = []
class SimpleMock(object):
    """
    This is a simple mock plugin for fdll which logs any calls
    made through it to fn_log, which is currently rather ugly
    global state.
    """
    def __init__(self, name="<base>"):
        self.__name = name

    def __getattr__(self, key):
        # This makes me smile :)
        return self.__dict__.get(key, SimpleMock(key))

    def __call__(self, *o, **k):
        fn_log.append(self.__name)
        # most fdll calls return 0 for success
        return 0

def get_calls(fn):
    "return the called function names which the fdll mock object made"
    del fn_log[:]
    fn()
    return fn_log

Sometimes I think Python makes 'clever' things like that too easy, and is perhaps the reason that although in the Python language there is only-one-way-to-do-it, in the Python ecosystem there is perhaps a tendency to reinvent the wheel over and over again. Because it's easy - and it's fun.

As always code is at bitbucket. For the next release (0.9) I'm planning to add more tests and docs (which are rather scarce), as well as one or two of the other things I've got planned (possible D2XX support, or at least some notes on libftdi on Windows, more examples & protocol adapters, maybe even a web interface for 8 bit IO...)

Monday, 18 April 2011

piefull - a simple javascript & canvas visualisation tool

OK, so calling piefull a visualisation tool is going a bit over the top, but it is a tool, and it does help with visualisation. It does one thing and one thing only - plot pie-charts indicating a single value. And it's even more restricted than that - the value it plots needs to be a percentage value.

The main use cases for this are things like task completion status (project and outliner applications, test coverage, etc) or resource allocation (disk space, CPU usage).

By choosing contrasting colours (these are configurable in piefull), the the overall outlook can be ascertained by a glance from a distance. By default the charts it generates are fairly small - 24px - which allows them to be used in-line in text, or as entries in tables. The general approach is derived from the sparklines example given in David Flanagan's Canvas Pocket Reference - replacing some textual data with a pictorial equivalent in the hope(!) it will be more quickly understandable. Of course this approach also lends itself well to graceful degradation, as the data is already there in the document itself.

There are plenty of other pie-chart generators - after all it's a fairly trivial thing to write with HTML5 canvas elements. But most of these tend to be fairly complex, with lots of options. I needed something where a fixed size display could represent a single value clearly, and piefull is the result.

Basically piefull turns this:

10%
20%
33%
20%
10%
18%
55%
33%
23%
12%
14%
35%
40%
21%
11%
12%
29%
14%
11%
5%
12%
17%
10%
9%
5%

(which looks like this in code:)

<table class="piefull">
    <tr>
       <td><div>10%</div></td>
       ...
       <td><div>5%</div></td>
    </tr>
</table>                 

into this:

10%
20%
33%
20%
10%
18%
55%
33%
23%
12%
14%
35%
40%
21%
11%
12%
29%
14%
11%
5%
12%
17%
10%
9%
5%

by doing this:

<script>
  window.onload = (function() { piefull.main("table.piefull td div"); });
</script>

where 'table.piefull td div' is a selector passed to querySelectorAll() to locate elements which will be replaced by little canvas piecharts. The contents of the selected elements are matched against a regular expression looking for a percentage value to extract (generally speaking, the first number), and the element is replaced with a canvas element displaying that value. The classes and id of the original element are preserved in the new element, allowing sensible CSS styling, and the canvas title takes on the text which it has replaced. As well as the selector, there are a (small) number of other parameters - size, 'yes' and 'no' colours. A value of e.g. 10 will display a 10% pie-segment in the 'yes' colour - the remainder will be the 'no' colour. (Like this: 10%.) These are all optional - even the selector defaults to '.piefull', which works great for a small number of spans or divs in some prose:

In other news, at least 66% of statistics are made up. No, wait - it should be a little higher now.

Note: If you're viewing this on IE8 or below, this probably makes no sense, as I've not included excanvas here. It is supported for IE8 with excanvas (but not less than IE8). One gotcha with IE is that block-level elements such as canvas don't work inside <p> elements. But you know you want to get a better browser... And why not get one with webgl support while you're at it?

The code for piefull can be found here

Wednesday, 6 April 2011

xmlcmd: adding an --xml option, one command at a time

In my last post, I wrote some thoughts on how using the text based meta-language[1] of regular expressions to filter and manipulate structured data from UNIX commands was not fully exploiting the jigsaw-puzzle approach of 'the unix philosophy'[2], and that XPath (and by implication XML) provided an alternative where structured data on the command line was concerned.[3]

I also mentioned how great things could be if, like subversion, every POSIX command line tool had an --xml switch which could output XML. (There are many programs with XML output, but the main POSIX ones[4] don't have this as an option)

Here's one I made earlier

I was always aware of the danger of overstating the case, but sometimes that can be helpful. Or at least fun. And I'd already started prototyping something which looked fun, dangerous, and potentially useful. This is intended to be illustrative rather than a serious suggestion, but there might be many other cases where the concepts can be used more seriously.

1. Add a path

There isn't any magic to what we're doing in adding --xml options, and we're not touching the original programs. We're just using the fact that the PATH in POSIX operating systems contains an ordered list of entries, and we're simply inserting a 'hook' early on in the path which can catch and redirect certain formats of command, while transparently forwarding others.

I tend to have a ~/bin directory on my path anyway (keeping good care that it is only writable by myself) - so I'm already set, but if not, you'll need a directory which appears first on the PATH.

ben$ mkdir -p ~/bin
add that path to the start of your login path (e.g. in .bashrc or .bash_profile):
export PATH=$HOME/bin:$PATH
Once that is done, anything in that directory will be run in preference to anything else. Put an 'ls' file in there something like the following:
#!/usr/bin/env python
print("These are not the files you're looking for")
make it executable (chmod +x ~/bin/ls) and you won't be able to run 'ls' anymore. Except you are running it, it's just a different ls, and not doing anything particularly helpful. You can always run the original ls with a fully specified path (or try using $(whereis ls)).

Two more things make this potentially useful:

  • Finding the next program on the PATH, which would have been run if something else hadn't sneaked in first
  • Selectively running either this 'original' program or some different code based on relevant criteria (e.g. existence of --xml in the command line options)
and the following makes things practical:
  • Making the two things above easily reusable for any command.

2. The magic of args[0]

Most of the time most programs ignore args[0] - the program's own name. But what if args[0] could be treated as a command line option, just like all the others? What makes this possible is having multiple symbolic links to a single program. args[0] is then the name of the original symlink by which the process was called, so although the same program is ultimately running, it can determine in what way it was called. It can therefore change its own operation. This technique is used in the busybox project to implement a generous number of commands in a single executable.

3. Introducing xmlcmd

xmlcmd is a Python package which supports this terrible corruption of POSIX as it should always be. The main xmlcmd module code is fairly straightforward, and is shown below. This finds the original program (which would have been run if we weren't first on the path), and then either execs that (if no --xml option is provided), or runs some Python code in an dynamically imported Python module (_xyz from the xmlcmd package, where xyz is the 'original' command name) if --xml is present.

#!/usr/bin/python
"""
xmlcmd.py - support for adding an --xml option to various commands

Ben Bass 2011. (Public Domain)
"""

import sys
import os
import which  # from PyPI 'which' package

def process_cmd(cmd_name, args, orig_cmd_path):
    """
    import and call the main() function from the module
    xmlcmd._{cmd}
    """
    module = __import__('xmlcmd._%s' % cmd_name,
                        fromlist=['_%s' % cmd_name])
    raise SystemExit(module.main(args, orig_cmd_path))

def main(args=None):
    """
    run system command from sys.argv[:], where sys.argv[0]
    implies the real command to run (e.g. via symlinks to us)
    """
    if args is None:
        args = sys.argv

    # args[0] will be a full path - we only want the command name
    cmd_name = os.path.basename(args[0])
    if cmd_name.startswith('xmlcmd'):
        raise SystemExit('xmlcmd should not be called directly')

    # get the command which would have run if we hadn't sneaked
    # ahead of it in the $PATH
    cmd_path_gen = which.whichgen(cmd_name)
    cmd_path_gen.next()   # skip first match (us)
    orig_cmd_path = cmd_path_gen.next()

    if '--xml' in args:
        args.remove('--xml')
        # forward to our xmlized version...
        process_cmd(cmd_name, args, orig_cmd_path)
    else:
        # execv *replaces* this process, so it has no idea it
        # wasn't called directly. Total transparency.
        os.execv(orig_cmd_path, args)

if __name__ == '__main__':
    main()

4. The implementations

The real work is all handled in the _{cmd} modules of course, so admittedly we've really only moved the problem around a bit. But the point of this exercise is about the ease with which we can add these new entry points into existing systems. Nothing slows down in any noticeable way, and it would be easy to extend an entire class of commands, one at a time, by nothing more than adding a Python module and creating a symlink.

For reference, the main() function from _ls.py looks something like this:

def main(args=None, orig_cmd_path=None):
    """very basic xml directory listing"""
    if len(args) > 1:
        target_dir = args[-1]
        if not os.path.isdir(target_dir):
            raise SystemExit('%s is not a directory' % (target_dir,))
    else:
        target_dir = os.getcwd()

    root = ET.Element('directory', name=target_dir)
    for fn in os.listdir(target_dir):
        stat = os.stat(os.path.join(target_dir, fn))
        f_el = ET.SubElement(root, 'file', mtime=str(stat.st_mtime))
        ET.SubElement(f_el, 'name').text = fn
        ET.SubElement(f_el, 'size').text = str(stat.st_size)
    ET.ElementTree(root).write(sys.stdout, 'utf-8')
    sys.stdout.write('\n')

5. Example

ben$ sudo pip install which xmlcmd
(yup, it's on PyPI) will install the xmlcmd Python package (and the 'which' dependency), and an xmlcmd wrapper script which should end up on the path. With that done, you can now create the magic symlinks:
ben$ ln -sf $(which xmlcmd) ~/bin/ls
ben$ ln -sf $(which xmlcmd) ~/bin/ps
And now, assuming things are working properly (a quick hash -r/rehash can't hurt), you should now be able to do wonderful things like this:
ben$ ps --xml aux | xpath '//process/command/text()[../../cpu > 2.5]'
which in this case displays the command name of all processes currently taking more than 2.5% of the CPU. Sure the XPath isn't exactly elegant. But the point is that patterns of this micro-language would be shared between tasks, and manipulating structured data on the UNIX command line would become as easy as text manipulation is now.

Here's some they made earlier...

Having said and done all that, a few searches later (for 'posix commands' in this case) brought up xmlsh.org, which seems to do some very similar things.

I also found (via [2]) xmltk, which at first glance seems to have beaten me to these ideas by about 9 years... :-)

Notes

[1]
'Regular expressions are notations for describing patterns of text and, in effect, make up a special-purpose language for pattern matching.' Brian Kerninghan, Beautiful Code (ed. Andy Oram & Greg Wilson, O'Reilly Media Inc).
[2]
The Art of Unix Programming, Eric S. Raymond. Especially the 'Rule of Composition'; see Chapter 1. (Note this book also praises text of course...)
[3]
What a pointlessly long sentence.
[4]
POSIX 2 (Commands and Utilities) covers these, e.g. see reference here

Monday, 21 March 2011

XPath is to XML as regex is to text

Anyone who has been a developer for a while gets familiar with regular expressions. They eat text for breakfast, and spit out desired answers. For all their cryptic terseness, they are at least in part reusable, and are based on only a handful (or two...) of rules. 'regexes' are a domain-specific micro-language for searching and filtering text.

But once we get outside of text, what is there?

With XML, we have XPath. I had one of those light-bulb realisations recently that what regexes are to text, XPath is to XML. And it made me think:

Why would I want to use a data substrate which doesn't have such a tool?

What I mean is this; text has regex. XML has XPath. RDBMS have SQL. Markup language of the day has... oh, it doesn't. Not really, in the sense of a standardised domain-specific micro-language. Regular expressions, XPath and SQL have history and mindshare. They 'work' specifically because they are DSLs, rather than high-level code. (OK, SQL is pushing it further than I'd like here, but it's still a DSL. Just not a micro-sized one.) To me, this is a problem which many 'NoSQL' tools have. I want the features of them, but CouchDB wants me to write map-reduce functions in JavaScript. MongoDB wants me to use a JSON-based query language. There is no commonality; no reuse; no lingua franca which will let me abstract the processing concepts away from the tools. Perhaps that will come in time for more data-representations (this seems to be an attempt for JSON, for example), but there is a significant barrier before such a tool gains widespread acceptance as a common abstraction across an entire data layer.

Pipelines and Data Processing

The 'UNIX philosophy' of connecting together a number of single-purpose programs to accomplish larger tasks is one of the keys to its power. These tools can be plugged together in ways which the original creators may never have thought of. Tools such as sed and awk are often employed as regex-based filters to command pipelines. I wish more tools had XML output options, because the tools we use in our pipelines often output structured data in textual format, often in tabular form. Tables are great for human consumption (provided they are modest in size), but when we start getting empty columns, cells flowing onto multiple lines, and other inconsistencies, it becomes a pain to parse. How great things could be if every tool following subversion's lead and had an --xml option:

svn diff -r $(svn log --stop-on-copy --xml | xpath -q -e '//log/logentry[last()]/@revision' | cut -d '"' -f 2):HEAD
(This command does a diff from a branch base to the most recent revision.  It still does some basic text processing, because the end result of XPath expressions are still text nodes).

Just imagine if POSIX defined an XML schema for each relevant command, and mandated an --xml option. Life would be so much easier. In many environments, data is structured but we still represent it as text. The pipeline philosophy might be nice, but it isn't exploited to the full when we need to write convoluted awk scripts and inscrutable regular expressions (or worse, Perl ;) ) to try and untangle the structure from the text. Consider something straightforward like the output of 'mount' on a *nix box. On my Mac it looks like this:

ben$ mount
/dev/disk0s2 on / (hfs, local, journaled)
devfs on /dev (devfs, local, nobrowse)
map -hosts on /net (autofs, nosuid, automounted, nobrowse)
map auto_home on /home (autofs, automounted, nobrowse)
/dev/disk1s1 on /Volumes/My Book (msdos, local, nodev, nosuid, noowners)

This is structured data, but getting the information out of that text blob would not be trivial, and would probably take many minutes of trial and error with regexes to get something reasonable. And the crucial thing is that you couldn't be sure it would always work. Plug a new device in which gets mounted in some new and interesting way, and who is to say that the new output of mount won't suddenly break your hand-crafted regex? That's where XML shines. Adding new information doesn't change anything in the old information. The way to access it doesn't change. Nothing breaks in the face of extension. Compare this to something like CSV, where the insertion of an extra column means all the indices from that column onwards need to change in every producer and consumer of the data.

XML and the Web

I'm somewhat saddened that XHTML didn't win outright in the last decade, and that XML on the web never really took off. I spent months at a previous job trying to convince everyone that 'XML-over-HTTP' was the best thing since sliced bread. A single source of data, which could be consumed by man (via XSLT & CSS in the browser) and machine alike. Just think how much energy the likes of Google could save if our web content didn't focus almost entirely on human consumption and discriminate against machines ;-)

One interesting thing which has happened as XML-on-the-web has declined is the increase in use of CSS selectors, first via frameworks such as Sizzle (used in jQuery), and later in the standard querySelectorAll DOM method. There is clearly a need for these DSL micro-languages, and as the 'CSS selector' DSL shows, they can quickly establish themselves if there is a clear need and sufficient backing from the community. Also apparent is that existing solutions can be usurped - users could do virtually everything CSS selectors could do (and far more besides) with XPath, but that didn't happen. Simplicity won here. But just because XPath was (arguably) wrong for Web development, doesn't mean it is wrong everywhere, and I contend that there are places where we have over-simplified, forcing regular expressions and text manipulation to (and beyond) breaking point, when XML processing would make things simpler everywhere.

Conclusion

In terms of practicalities, if I had ever spent too long in the world of Java, I would probably see XML as an unwelcome and persistent pest. But living in the happier climes of Python-ville, I have access to the wonderful ElementTree API, via both ElementTree itself (included in the standard library) and lxml.

Both of these support XPath as well as high-level abstractions of XML documents to and from lists and dictionaries. With ElementTree, XML access from Python is (almost) as easy as JSON access from JavaScript. And with technologies like XPath and XSLT available, I think it's worth it.

As a final thought, I've just had a quick glance through Greg Wilson's excellent Data Crunching, which contains chapters on Text, Regular Expressions, Binary data (rather a short ad-hoc chapter), XML, and Relational Databases. Perhaps the 'binary data' chapter is short because there simply aren't many patterns available. There is no language to describe unabstracted data. And perhaps when we consider the data layer we should be using, we should think not only of features and performance, but also the power, expressiveness, and concision of the languages available to reason about the information. Perhaps too often we settle for a lowest common denominator solution (text) when a higher level one might be more powerful, especially if we don't have to give up on the concepts of fine-grained interoperability which micro-DSLs such as XPath give us.

To be continued...

Sunday, 20 February 2011

Concurrent Queue.get() with timeouts eats CPU

...or how adding a timeout can make your program suffer

Call me lazy, but I like threads. Or at least I like the programming model they provide. I very rarely use explicit locks, and find the combination of threads and queues a great mental abstraction of parallel processing. Often though, the abstraction is so leaky that it gets me annoyed. Here is a case in point...

I noticed this problem in a Python process which sits in a loop doing readline() on a file object and dispatching the incoming lines to different worker threads to do some various asynchronous actions. With no input on the source file, the process was still taking 5% CPU. I would have expected next-to-nothing, since everything should have been blocking.

strace -fc -p $PID showed that the process was anything but idle though, and after further investigation, I found the culprit.

Concurrent Queue.get() with timeouts eats CPU.

A test case for this is the following Python (2 & 3) code. It intentionally doesn't do anything, simply starting WORKER threads, each of which performs a blocking Queue.get. The main thread simply waits for a newline on stdin. I wouldn't expect this to take any significant CPU time - in theory all the threads are blocked - either waiting on stdin input, or on something to be available in the various worker queues (which nothing ever gets sent to).

import threading, sys, time
try: import queue
except ImportError: import Queue as queue

WORKERS = 100

class Worker(threading.Thread):
    def __init__(self):
        self.queue = queue.Queue()
        threading.Thread.__init__(self)
        self.daemon = True

    def run(self):
        while True:
            next_item = self.queue.get()
            print(next_item)

def test():
    w_set = set()
    for i in range(WORKERS):
        new_w = Worker()
        new_w.start()
        w_set.add(new_w)

    print('Running: Press Enter to finish')
    sys.stdin.readline()

if __name__ == '__main__':
    test()

Sure enough, running and monitoring this shows 0% CPU usage, but WORKER+1 threads in use (I'm using OS X's Activity Monitor at the moment).

But let's suppose we want to change the worker threads to wake up occasionally to do some background activity. No problem: provide a timeout on the Queue.get():

class TimeoutWorker(Worker):
    def run(self):
        while True:
            try:
                next_item = self.queue.get(timeout=1)
            except queue.Empty:
                # do whatever background check needs doing
                pass
            else:
                print(next_item)

OK, so now the threads can wake up occasionally and perform whatever activity they want.

Except...

CPU usage just went up from ~0% to 10%. Increasing WORKERS shows that the CPU load of this program which still does nothing (the queues never get anything put in them) is proportional to the number of threads (95% at 1000 worker threads). I'm not inclined to look further than assuming this is some artifact of the GIL (pthread activity seems to be the culprit).

This is fairly independent of the length of the timeout. For very short timeouts, I'd expect CPU usage to go up, as the worker thread is spending more time doing work rather than being blocked. But there is no noticeable difference between timeout=10 and timeout=sys.maxint. In the latter case, the get() is never plausibly going to timeout, but the same high-CPU behaviour still occurs.

Fixing the code

I'm not inclined to delve deep into CPython to look at what Queue.get() is doing under the hood. It's clearly something very different depending on whether it has a timeout or not. For now I'm content to fix the code to eliminate the situations where these problems can occur. Hopefully the fact that I've written this will keep me aware of this potential issue and I'll manage to avoid it in future :)

The code where I found this issue was using a 1 second timeout to continually check the while condition and exit if required. This was easily fixed with sending a poison-pill of None into the queue rather than setting a flag on the thread instance, and checking for this once we've got a next_item. This is cleaner anyway, allowing immediate thread termination and the use of timeout-less get(). For more complex cases where some background activity is required in the worker threads, it might make more sense to keep all threads using timeout-less Queue.get()s and have a separate thread sending sentinel values into each queue according to some schedule, which cause the background activity to be run.

Conclusion

It seems fairly unintuitive that simply adding a timeout to a Queue.get() can totally change the CPU characteristics of a multi-threaded program. Perhaps this could be documented and explained. But then in CPython it seems many threading issues are entirely unintuitive. The scientific part of my brain won't stop thinking threads are wonderful, but the engineering part is becoming increasingly sceptical about threads and enamoured with coroutines, especially with PEP 380 on the horizon.

Wednesday, 9 February 2011

pylibftdi 0.7 - multiple device support

pylibftdi has always been about minimalism, which means that if you wanted to do something it didn't support, things got tricky. One of it's glaring deficiencies until now was that it only supported a single FTDI device - if you had multiple devices plugged in, it would pick one - seemingly - at random.

With pylibftdi 0.7, that has finally changed, and devices can now be opened by name. Or at least by serial number, which is nearly as good. A new example script (which I've just remembered is hideously raw and lacks any tidying up at all) examples/list_devices.py in the source distribution will enumerate the attached devices, displaying the manufacturer (which should be FTDI in all cases), description, and serial number.

The API has changed slightly to cope with this; whereas previously there was just a single Driver class, now the primary interface is the Device class. Driver still exists, and holds the CDLL reference, as well as supporting device enumeration and providing backwards compatibility.

(As an aside, using ftdi_usb_find_all was (not) fun - it sets a pointer to pointer which is then used to traverse a linked list. Trivial in C, an hour of frustration in ctypes. Anyway, I got there in the end).

>>> from pylibftdi import Device
>>> import time
>>> 
>>> # make some noise
>>> with Device('FTE4FFVQ', mode='b') as midi_dev:
...     midi_dev.baudrate = 31250
...     for count in range(3):
...         midi_dev.write(b'\x90\x3f\x7f')
...         time.sleep(0.5)
...         midi_dev.write(b'\x90\x3f\x00')
...         time.sleep(0.25)
... 

Both Device() and BitBangDevice take device_id as the (optional) first parameter to select the target device. If porting from an earlier version, one of the first changes is probably to use named parameters for options when instantiating these classes. My intention is that device_id will always be the first parameter, but the order and number of subsequent parameters could change.

Another change is that Devices are now opened implicitly on instantiation unless told not to (see the docstrings). Previously the Driver class only opened automatically when used as a context manager. There is no harm in opening devices multiple times though - subsequent open()s have no effect.

I've also finally figured out that I need to set long_description in setup.py to get documentation to appear on the PyPI front page. After all, without docs, it doesn't exist.

It's only been a few days since 0.6, but I wanted to get this release out - I think it is a big improvement since 0.5, and It'll probably be a while till the next release. In the mean time, I'll try and get a vaguely useful example going - which will probably involve MIDI and an LCD...

Sunday, 6 February 2011

pylibftdi 0.6 released: now with Python 3 goodness

pylibftdi 0.6 has been out the door and onto PyPI for the last few days, but I'm only just getting round to blogging about it. It's basically some minor work for Python 3 compatibility - the same code now works on both Python 2 (2.6/2.7) and Python 3. This means support for Python 2.5 has been dropped (due to use of bytearray/bytes types). I can always add it back in if people shout.

Other than trivially fixing a print statement to be a function call, the main change required was the expected bytes/string issue. The driver also gains a couple of parameters; mode = 't' ('t':text, 'b':binary) and encoding = 'latin1'.

In binary mode (the default - so no user code changes are required for this release), read() and write() take and return instances of type bytes. For text mode, write() will take either bytes/bytearray, or a string which it will encode with the given driver encoding, and read() will return a string. I've set the default to be latin1 rather than using utf-8 as it is an equivalence mapping over the first 256 code points.

Coming soon...

I've started work on 0.7 - the main feature of which is support for multiple devices. I had a few problems getting the right ctypes incantations to follow the linked-list which ftdi_usb_find_all sets, but that's sorted now. The bigger issue is that it really needs a split between driver and device, which could cause the API to change. I'm thinking of various ways to keep existing code working, and will probably go for something like:

  • 0.7 - set pylibftdi.SUPPORT_MULTIPLE to True to use new API / support multiple devices
  • 0.8 - set pylibftdi.SUPPORT_MULTIPLE to False to use old API / only support a single device / get a deprecation warning
  • 0.9 - SUPPORT_MULTIPLE no longer used; old API disappears.

So 0.7 is all about multiple device support, 0.8 will probably be support for Windows (supporting D2XX, for example), and 0.9 (or maybe just 1.0) will be a tidy-up / bug-fix / improve docs release. In parallel with all of this I'm writing some test code which will gradually bring this side of things up-to-standard. I'm not allowing myself to do a 1.0 release without decent testing & docs. All that will probably take a two months; I only get a couple of hours a week looking at this. But it could be sooner - or later.

pylibftdi 0.7 should be out in within a week or so, and I'll elaborate more then, hence the lack of any examples here. I'm on the case!