Friday 31 December 2010

HTTP 'streaming' from Python generators

One of the more annoying things about HTTP is that it wants to send things in complete chunks: you ask for an object with a particular URL, and some point later you get that object. There isn't a lot you can do (at least from JavaScript) until the complete resource has loaded. For more fine grained control, it's a bit annoying.

Of course web sockets will solve all of this (maybe) once the spec has gone through the mill a few more times. But in the mean-time there is often an impedance mismatch between how we'd like to be able to do things on the server-side and how we are forced to do them because of the way HTTP works.

The following is an example of one way to manage splitting things up. It allows Python generators to be used on the server side and sends an update to the client on every yield, with the client doing long-polling to get the data. This shouldn't be confused with CherryPy's support for yield streaming single responses back to the server (which is discouraged) - the yield functionality is hijacked for other purposes if the method decorator is applied. Also note that this is only of any help for clients which can use AJAX to repeatedly poll the generator.

Example

Let's suppose that we want to generate a large (or infinite) volume of data and send it to a web client. It could be a long text document served line-by-line. But let's use the sequence of prime numbers (because that's good enough for aliens). We want to send it to the client, and have it processed as it arrives. The principle is to use a generator on the server side rather than a basic request function, but wrap that in something which translates the generator into a sequence of responses, each serving one chunk of the response.

Server implementation using CherryPy - note the json_yield decorator.

import cherrypy

class PrimeGen(object):
    @cherrypy.expose
    def index(self):
        return INDEX_HTML # see below

    @cherrypy.expose
    @json_yield   # see below
    def prime(self):
        # this isn't supposed to be efficient.
        probe = 1
        while True:
           for i in range(2, probe):
               if probe % i == 0:
                   break
           else:
               yield probe
           probe += 2

cherrypy.quickstart(PrimeGen)

The thing which turns this generator into something usable with long-polling is the following 'json_yield' decorator.

Because we might want more than one such generator on the server (not to mention generator instances from multiple clients), we need a key - passed in from the client - which associates a particular client with the generator instance. This isn't really handled in this example, see the source file download at the end of the post for that.

The major win is that the client doesn't have to store a 'next-index' or anything else. State is stored implicitly in the Python generator on the server side. Both client and server code should be simpler. Of course this goes against REST principles, where one of the fundamental tenets is that state should be stored only on the Client. But there is a place for everything.

import functools
import json

def json_yield(fn):
    # each application of this decorator has its own id
    json_yield._fn_id += 1

    # put it into the local scope so our internal function
    # can use it properly
    fn_id = json_yield._fn_id

    @functools.wraps(fn)
    def _(self, key, *o, **k):
        """
        key should be unique to a session.
        Multiple overlapping calls with the same 
        key should not happen (will result in
        ValueError: generator already executing)
        """

        # create generator if it hasn't already been
        if (fn_id,key) not in json_yield._gen_dict:
            new_gen = fn(self, *o, **k)
            json_yield._gen_dict[(fn_id,key)] = new_gen

        # get next result from generator
        try:
           # get, assuming there is more.
           gen = json_yield._gen_dict[(fn_id, key)]
           content = gen.next()
           # send it
           return json.dumps({'state': 'ready',
                              'content':content})
        except StopIteration:
           # remove the generator object
           del json_yield._gen_dict[(fn_id,key)]
           # signal we are finished.
           return json.dumps({'state': 'done',
                              'content': None})
    return _
# some function data...
json_yield._gen_dict = {}
json_yield._fn_id = 0

The HTML to go with this is basic long-polling, and separating out the state from the content. Here I'm using jQuery:

INDEX_HTML = """
<html>
<head>
<script src="http://code.jquery.com/jquery-1.4.4.min.js">
</script>
<script>
$(function() {

  function update() {
    $.getJSON('/prime?key=1', {}, function(data) {
      if (data.state != 'done') {
        $('#status').text(data.content);
        //// alternative for appending:
        // $('#status').append($('
'+data.content+'
')); setTimeout(update, 0); } }); } update(); }); </script> </head> <body> <div id="status"></div> </body> </html> """

Uses

The example above is contrived, but there are plenty of possibilities if the json_yield decorator is extended in a number of ways. Long running server side processes can send status information back to the client with minimal hassle. Client-side processing of large text documents can begin before they have finished downloading. One issue is that a chunk of the data should be semantically understandable. Using things like this on binary files or XML (where it is only valid once the root element is closed) won't have sensible results.

There are plenty of possibilities in extending this; the decorator could accumulate the content (rather than the client) and send the entire results up-to-now back, or (given finite memory) some portion of it using a length-limited deque. Additional meta-data (e.g. count of messages so far, or the session key) could be added to the JSON information sent to the client each poll.

Disclaimer: I've not done any research into how others do this, because coding is more fun than research. There are undoubtedly better ways of accomplishing similar goals. In particular there are issues with memory usage and timeouts which aren't handled with json_yield. Also note that the example is obviously silly - it would be much faster to compute the primes on the client.

Download Files

1 comment: