Sunday 20 February 2011

Concurrent Queue.get() with timeouts eats CPU

...or how adding a timeout can make your program suffer

Call me lazy, but I like threads. Or at least I like the programming model they provide. I very rarely use explicit locks, and find the combination of threads and queues a great mental abstraction of parallel processing. Often though, the abstraction is so leaky that it gets me annoyed. Here is a case in point...

I noticed this problem in a Python process which sits in a loop doing readline() on a file object and dispatching the incoming lines to different worker threads to do some various asynchronous actions. With no input on the source file, the process was still taking 5% CPU. I would have expected next-to-nothing, since everything should have been blocking.

strace -fc -p $PID showed that the process was anything but idle though, and after further investigation, I found the culprit.

Concurrent Queue.get() with timeouts eats CPU.

A test case for this is the following Python (2 & 3) code. It intentionally doesn't do anything, simply starting WORKER threads, each of which performs a blocking Queue.get. The main thread simply waits for a newline on stdin. I wouldn't expect this to take any significant CPU time - in theory all the threads are blocked - either waiting on stdin input, or on something to be available in the various worker queues (which nothing ever gets sent to).

import threading, sys, time
try: import queue
except ImportError: import Queue as queue


class Worker(threading.Thread):
    def __init__(self):
        self.queue = queue.Queue()
        self.daemon = True

    def run(self):
        while True:
            next_item = self.queue.get()

def test():
    w_set = set()
    for i in range(WORKERS):
        new_w = Worker()

    print('Running: Press Enter to finish')

if __name__ == '__main__':

Sure enough, running and monitoring this shows 0% CPU usage, but WORKER+1 threads in use (I'm using OS X's Activity Monitor at the moment).

But let's suppose we want to change the worker threads to wake up occasionally to do some background activity. No problem: provide a timeout on the Queue.get():

class TimeoutWorker(Worker):
    def run(self):
        while True:
                next_item = self.queue.get(timeout=1)
            except queue.Empty:
                # do whatever background check needs doing

OK, so now the threads can wake up occasionally and perform whatever activity they want.


CPU usage just went up from ~0% to 10%. Increasing WORKERS shows that the CPU load of this program which still does nothing (the queues never get anything put in them) is proportional to the number of threads (95% at 1000 worker threads). I'm not inclined to look further than assuming this is some artifact of the GIL (pthread activity seems to be the culprit).

This is fairly independent of the length of the timeout. For very short timeouts, I'd expect CPU usage to go up, as the worker thread is spending more time doing work rather than being blocked. But there is no noticeable difference between timeout=10 and timeout=sys.maxint. In the latter case, the get() is never plausibly going to timeout, but the same high-CPU behaviour still occurs.

Fixing the code

I'm not inclined to delve deep into CPython to look at what Queue.get() is doing under the hood. It's clearly something very different depending on whether it has a timeout or not. For now I'm content to fix the code to eliminate the situations where these problems can occur. Hopefully the fact that I've written this will keep me aware of this potential issue and I'll manage to avoid it in future :)

The code where I found this issue was using a 1 second timeout to continually check the while condition and exit if required. This was easily fixed with sending a poison-pill of None into the queue rather than setting a flag on the thread instance, and checking for this once we've got a next_item. This is cleaner anyway, allowing immediate thread termination and the use of timeout-less get(). For more complex cases where some background activity is required in the worker threads, it might make more sense to keep all threads using timeout-less Queue.get()s and have a separate thread sending sentinel values into each queue according to some schedule, which cause the background activity to be run.


It seems fairly unintuitive that simply adding a timeout to a Queue.get() can totally change the CPU characteristics of a multi-threaded program. Perhaps this could be documented and explained. But then in CPython it seems many threading issues are entirely unintuitive. The scientific part of my brain won't stop thinking threads are wonderful, but the engineering part is becoming increasingly sceptical about threads and enamoured with coroutines, especially with PEP 380 on the horizon.

Wednesday 9 February 2011

pylibftdi 0.7 - multiple device support

pylibftdi has always been about minimalism, which means that if you wanted to do something it didn't support, things got tricky. One of it's glaring deficiencies until now was that it only supported a single FTDI device - if you had multiple devices plugged in, it would pick one - seemingly - at random.

With pylibftdi 0.7, that has finally changed, and devices can now be opened by name. Or at least by serial number, which is nearly as good. A new example script (which I've just remembered is hideously raw and lacks any tidying up at all) examples/ in the source distribution will enumerate the attached devices, displaying the manufacturer (which should be FTDI in all cases), description, and serial number.

The API has changed slightly to cope with this; whereas previously there was just a single Driver class, now the primary interface is the Device class. Driver still exists, and holds the CDLL reference, as well as supporting device enumeration and providing backwards compatibility.

(As an aside, using ftdi_usb_find_all was (not) fun - it sets a pointer to pointer which is then used to traverse a linked list. Trivial in C, an hour of frustration in ctypes. Anyway, I got there in the end).

>>> from pylibftdi import Device
>>> import time
>>> # make some noise
>>> with Device('FTE4FFVQ', mode='b') as midi_dev:
...     midi_dev.baudrate = 31250
...     for count in range(3):
...         midi_dev.write(b'\x90\x3f\x7f')
...         time.sleep(0.5)
...         midi_dev.write(b'\x90\x3f\x00')
...         time.sleep(0.25)

Both Device() and BitBangDevice take device_id as the (optional) first parameter to select the target device. If porting from an earlier version, one of the first changes is probably to use named parameters for options when instantiating these classes. My intention is that device_id will always be the first parameter, but the order and number of subsequent parameters could change.

Another change is that Devices are now opened implicitly on instantiation unless told not to (see the docstrings). Previously the Driver class only opened automatically when used as a context manager. There is no harm in opening devices multiple times though - subsequent open()s have no effect.

I've also finally figured out that I need to set long_description in to get documentation to appear on the PyPI front page. After all, without docs, it doesn't exist.

It's only been a few days since 0.6, but I wanted to get this release out - I think it is a big improvement since 0.5, and It'll probably be a while till the next release. In the mean time, I'll try and get a vaguely useful example going - which will probably involve MIDI and an LCD...

Sunday 6 February 2011

pylibftdi 0.6 released: now with Python 3 goodness

pylibftdi 0.6 has been out the door and onto PyPI for the last few days, but I'm only just getting round to blogging about it. It's basically some minor work for Python 3 compatibility - the same code now works on both Python 2 (2.6/2.7) and Python 3. This means support for Python 2.5 has been dropped (due to use of bytearray/bytes types). I can always add it back in if people shout.

Other than trivially fixing a print statement to be a function call, the main change required was the expected bytes/string issue. The driver also gains a couple of parameters; mode = 't' ('t':text, 'b':binary) and encoding = 'latin1'.

In binary mode (the default - so no user code changes are required for this release), read() and write() take and return instances of type bytes. For text mode, write() will take either bytes/bytearray, or a string which it will encode with the given driver encoding, and read() will return a string. I've set the default to be latin1 rather than using utf-8 as it is an equivalence mapping over the first 256 code points.

Coming soon...

I've started work on 0.7 - the main feature of which is support for multiple devices. I had a few problems getting the right ctypes incantations to follow the linked-list which ftdi_usb_find_all sets, but that's sorted now. The bigger issue is that it really needs a split between driver and device, which could cause the API to change. I'm thinking of various ways to keep existing code working, and will probably go for something like:

  • 0.7 - set pylibftdi.SUPPORT_MULTIPLE to True to use new API / support multiple devices
  • 0.8 - set pylibftdi.SUPPORT_MULTIPLE to False to use old API / only support a single device / get a deprecation warning
  • 0.9 - SUPPORT_MULTIPLE no longer used; old API disappears.

So 0.7 is all about multiple device support, 0.8 will probably be support for Windows (supporting D2XX, for example), and 0.9 (or maybe just 1.0) will be a tidy-up / bug-fix / improve docs release. In parallel with all of this I'm writing some test code which will gradually bring this side of things up-to-standard. I'm not allowing myself to do a 1.0 release without decent testing & docs. All that will probably take a two months; I only get a couple of hours a week looking at this. But it could be sooner - or later.

pylibftdi 0.7 should be out in within a week or so, and I'll elaborate more then, hence the lack of any examples here. I'm on the case!