Monday, 16 November 2009

Python lists aren't

Names are important, and doubly so in programming. So when a programmer sees 'list', they think they know what that means, and similarly when they see 'array'.
The fundamental difference would be something like this:

array performance:
random access - O(1)
insertion / deletion of known element - O(n)

list performance:
random access - O(n)
insertion / deletion of known element - O(1)

The performance guarantees of a programming languages' data structures form part of the functional specification of that type, not just some incidental extra information.

I bumped into this when using streams of bytes (represented as lists of integers each < 256) with the following codec-like code:

def process(packet):
    itemlen = work_out_length(packet)
    item, packet = packet[:itemlen], packet[itemlen:]
    # do something with item
    return packet

packet = some data
while packet:
    packet = process(packet)

which is equivalent to this...
a = some data
while a:
    head, a = a[0], a[1:]
    process(head)


(The actual problem wasn't as easy to solve as the above case, as this assumes that the 'head' item is always a single byte; in reality it could be any number of bytes, and the packet would have to be streamed using multiple recursive loops like the above to process it.  But the fundamentals are the same.)

Anyway, it all works fine until a large packet arrives.  And then the interactive program suddenly stops; what took on the order of a millisecond suddenly takes half-an-hour, which to any user looks like the program has crashed.

This is a functional programming idiom, but it just doesn't work with Python lists in the general case of large lists.  It didn't just slow it down, it completely broke it.

Solutions in this specific case are deques (collections.deque) or using iterators. But that's for another time...


In the C++ STL, part of the specification is a performance guarantee for each algorithm on each container type (http://www.sgi.com/tech/stl/complexity.html).  In anything other than toy programs this information is critical, and they give the C++ developer an additional criteria in selecting the appropriate collection types.  It changes 'worse/better' into 'wrong/right'.  'If [these specifications] are ignored [in an STL implementation], the performance of the resulting program will often render it useless.' - from previous link.  The very separation of algorithms and data structures which the C++ STL enables (see the Elements of Programming book for a up-to-date discussion of the underlying paradigm of the STL - without STL-specific code), makes possible the enumeration of performance guarantees (other than specifying it for every function in every types' API). So while the Python documentation for lists warns that inserts and deletes at the beginning are O(n), this information isn't part of a coherent bigger picture which guides me to the right methods and data structures.

2 comments:

  1. In summary, Python lists have roughly the complexity guarantees of a C++ STL vector.

    I'm learning Python for the first time, via "Dive Into Python". The author teaches that lists have random access and arbitrary inserts, but makes no mention of complexity. There are three possibilities, all of which are useful for solving different problems:

    (1) Inserts in constant time, random access in linear time. This is a linked list or STL list.

    (2) Inserts in linear time, random access in constant time. This is an array, STL vector, STL deque, or Python list.

    (3) Inserts and random access both in (possibly amortized) logarithmic time. This is various binary tree implementations, e.g. a red-black tree.

    ReplyDelete
  2. You may appreciate:

    http://wiki.python.org/moin/TimeComplexity

    which actually does include time complexity guarantees by data type

    ReplyDelete