Sunday 20 February 2011

Concurrent Queue.get() with timeouts eats CPU

...or how adding a timeout can make your program suffer

Call me lazy, but I like threads. Or at least I like the programming model they provide. I very rarely use explicit locks, and find the combination of threads and queues a great mental abstraction of parallel processing. Often though, the abstraction is so leaky that it gets me annoyed. Here is a case in point...

I noticed this problem in a Python process which sits in a loop doing readline() on a file object and dispatching the incoming lines to different worker threads to do some various asynchronous actions. With no input on the source file, the process was still taking 5% CPU. I would have expected next-to-nothing, since everything should have been blocking.

strace -fc -p $PID showed that the process was anything but idle though, and after further investigation, I found the culprit.

Concurrent Queue.get() with timeouts eats CPU.

A test case for this is the following Python (2 & 3) code. It intentionally doesn't do anything, simply starting WORKER threads, each of which performs a blocking Queue.get. The main thread simply waits for a newline on stdin. I wouldn't expect this to take any significant CPU time - in theory all the threads are blocked - either waiting on stdin input, or on something to be available in the various worker queues (which nothing ever gets sent to).

import threading, sys, time
try: import queue
except ImportError: import Queue as queue

WORKERS = 100

class Worker(threading.Thread):
    def __init__(self):
        self.queue = queue.Queue()
        threading.Thread.__init__(self)
        self.daemon = True

    def run(self):
        while True:
            next_item = self.queue.get()
            print(next_item)

def test():
    w_set = set()
    for i in range(WORKERS):
        new_w = Worker()
        new_w.start()
        w_set.add(new_w)

    print('Running: Press Enter to finish')
    sys.stdin.readline()

if __name__ == '__main__':
    test()

Sure enough, running and monitoring this shows 0% CPU usage, but WORKER+1 threads in use (I'm using OS X's Activity Monitor at the moment).

But let's suppose we want to change the worker threads to wake up occasionally to do some background activity. No problem: provide a timeout on the Queue.get():

class TimeoutWorker(Worker):
    def run(self):
        while True:
            try:
                next_item = self.queue.get(timeout=1)
            except queue.Empty:
                # do whatever background check needs doing
                pass
            else:
                print(next_item)

OK, so now the threads can wake up occasionally and perform whatever activity they want.

Except...

CPU usage just went up from ~0% to 10%. Increasing WORKERS shows that the CPU load of this program which still does nothing (the queues never get anything put in them) is proportional to the number of threads (95% at 1000 worker threads). I'm not inclined to look further than assuming this is some artifact of the GIL (pthread activity seems to be the culprit).

This is fairly independent of the length of the timeout. For very short timeouts, I'd expect CPU usage to go up, as the worker thread is spending more time doing work rather than being blocked. But there is no noticeable difference between timeout=10 and timeout=sys.maxint. In the latter case, the get() is never plausibly going to timeout, but the same high-CPU behaviour still occurs.

Fixing the code

I'm not inclined to delve deep into CPython to look at what Queue.get() is doing under the hood. It's clearly something very different depending on whether it has a timeout or not. For now I'm content to fix the code to eliminate the situations where these problems can occur. Hopefully the fact that I've written this will keep me aware of this potential issue and I'll manage to avoid it in future :)

The code where I found this issue was using a 1 second timeout to continually check the while condition and exit if required. This was easily fixed with sending a poison-pill of None into the queue rather than setting a flag on the thread instance, and checking for this once we've got a next_item. This is cleaner anyway, allowing immediate thread termination and the use of timeout-less get(). For more complex cases where some background activity is required in the worker threads, it might make more sense to keep all threads using timeout-less Queue.get()s and have a separate thread sending sentinel values into each queue according to some schedule, which cause the background activity to be run.

Conclusion

It seems fairly unintuitive that simply adding a timeout to a Queue.get() can totally change the CPU characteristics of a multi-threaded program. Perhaps this could be documented and explained. But then in CPython it seems many threading issues are entirely unintuitive. The scientific part of my brain won't stop thinking threads are wonderful, but the engineering part is becoming increasingly sceptical about threads and enamoured with coroutines, especially with PEP 380 on the horizon.

5 comments:

  1. CPython just uses whatever timed lock acquisition primitives the platform has available, so exactly how the behaviour changes will vary by platform.

    I know you said you didn't want to dig into the source code, but the three most common examples can be seen at:
    Posix threads: http://svn.python.org/view/python/branches/py3k/Python/thread_pthread.h?view=markup
    Windows: http://svn.python.org/view/python/branches/py3k/Python/thread_nt.h?view=markup

    (There are two variants in the pthreads file - one that uses semaphores and another that uses condition variables when a working semaphore implementation isn't provided by the platform)

    ReplyDelete
  2. The code is fairly new and was added during the development phase of Python 3.2. All Python 2.x versions and 3.1 are using a busy loop with an increasing timeout. AFAIR the timeout starts at 1ms and goes up to 50ms. This means every lock with a timeout (conditions, events, queues ...) is waken up the CPU 20 times a second.

    The threading2 library may offer a solution for you. I haven't tested or reviewed it myself yet.

    ReplyDelete
  3. Thanks both, some useful info. I was aware the GIL code was changing in 3.2, but if this timeout behaviour is improved too then hopefully this issue will disappear in time.

    I'm content being aware of the issue and working round it till then...

    ReplyDelete
  4. Unfortunately my memory has served me wrong. threading2 has no optimized timeout for locks. I guess I have to roll my own backport of Python 3.2's lock ...

    ReplyDelete
  5. Yeah, sorry, I've been meaning to add native lock timeouts to threading2 to address this very problem, but just haven't got around to it...

    ReplyDelete