Tuesday, 24 November 2009

killing Python processes...

I'm playing with some midi based Python stuff at the moment, doing raw file access to /dev/midi1 (midisport uno), and most of the time it's just sitting on a file.read(1). In Windows, the user can break into this with Ctrl-Break (Ctrl-C doesn't work on Windows or Linux), but there doesn't seem to be an equivalent to Ctrl-Break in the Linux world. So it's Ctrl-Z to put it in the background, ps -a, find the program, then terminate it hastily with sudo kill -9. All of which is a bit excessive. Rather than put something proper in there to terminate it (is select() the way to do these things on Linux?), I decided to get the wrong way a little less wrong.
I now have a half-way house to getting it a bit simpler:
alias killpy="sudo kill -9 \`ps -a | grep python | head -c 6\`"
And it's then just Ctrl-Z, killpy. I'm sure this isn't the best way of doing things, but I like the fact that it does things the same way I do, but using pipes to communicate, instead of the eye-brain-hand-keyboard loop. And it's the first time I've used the -c option on head (number of bytes to copy to stdout), so I've learnt something!

Of course it doesn't work if there is more than one Python process, or whatever, and yes I do need the sudo and -9.

Monday, 16 November 2009

Python lists aren't

Names are important, and doubly so in programming. So when a programmer sees 'list', they think they know what that means, and similarly when they see 'array'.
The fundamental difference would be something like this:

array performance:
random access - O(1)
insertion / deletion of known element - O(n)

list performance:
random access - O(n)
insertion / deletion of known element - O(1)

The performance guarantees of a programming languages' data structures form part of the functional specification of that type, not just some incidental extra information.

I bumped into this when using streams of bytes (represented as lists of integers each < 256) with the following codec-like code:

def process(packet):
    itemlen = work_out_length(packet)
    item, packet = packet[:itemlen], packet[itemlen:]
    # do something with item
    return packet

packet = some data
while packet:
    packet = process(packet)

which is equivalent to this...
a = some data
while a:
    head, a = a[0], a[1:]

(The actual problem wasn't as easy to solve as the above case, as this assumes that the 'head' item is always a single byte; in reality it could be any number of bytes, and the packet would have to be streamed using multiple recursive loops like the above to process it.  But the fundamentals are the same.)

Anyway, it all works fine until a large packet arrives.  And then the interactive program suddenly stops; what took on the order of a millisecond suddenly takes half-an-hour, which to any user looks like the program has crashed.

This is a functional programming idiom, but it just doesn't work with Python lists in the general case of large lists.  It didn't just slow it down, it completely broke it.

Solutions in this specific case are deques (collections.deque) or using iterators. But that's for another time...

In the C++ STL, part of the specification is a performance guarantee for each algorithm on each container type (http://www.sgi.com/tech/stl/complexity.html).  In anything other than toy programs this information is critical, and they give the C++ developer an additional criteria in selecting the appropriate collection types.  It changes 'worse/better' into 'wrong/right'.  'If [these specifications] are ignored [in an STL implementation], the performance of the resulting program will often render it useless.' - from previous link.  The very separation of algorithms and data structures which the C++ STL enables (see the Elements of Programming book for a up-to-date discussion of the underlying paradigm of the STL - without STL-specific code), makes possible the enumeration of performance guarantees (other than specifying it for every function in every types' API). So while the Python documentation for lists warns that inserts and deletes at the beginning are O(n), this information isn't part of a coherent bigger picture which guides me to the right methods and data structures.

Thursday, 12 November 2009

James Bond, Parenting, Refactoring

So it is 1:00 in the morning, and our youngest son wakes up screaming. At 18 months it isn't always clear what the problem is, but with a bit of attention he soon settles. Except the same thing happened an hour ago, though my wife got up then. And, almost exactly an hour later, he wakes again, and does his 'muuuuu-meeeeee' type noises in-between crying. Except I've woken up first, which sort of means it's my job to go to him, again... And at this point I remember that vital software principle: don't repeat youself (DRY). The tempting thing is to give him back his dummy, give him a cuddle, and within 2 minutes he could be back asleep - and in 3 so could I. This is the bet I am making: there's a small chance he'll sleep through the rest of the night. It's always possible. But far more likely is that he'll wake again, and I won't get much sleep at all tonight. Because if he woke 3 hours on the trot when he normally sleeps through without problem, there's probably a reason - maybe even a reason I could fix (bets on a dirty nappy?) But I'm tired, and tiredness makes me even more lazy than usual, and... I hope he settles and I go back to bed.

As in many areas of life, software developers continually have to make the choice between short-term ease against the risk of long-term disaster. If the disaster was certain, the choice would be clear, if not easy. But there is always the chance that it will never happen, and if the cost of averting that potential disaster is significant (e.g. lost business due to competition in time-to-market), it is no longer clear-cut. But each time the risk is seen and ignored, the likelihood of getting it done right decreases. If I get up at each hour from 1am till 5am to settle my son, am I really going to bother doing anything different at 6am?

So what are we to do? Recognise the need early, when the cost is least and the confidence of knowing that the potential disaster has already been averted can have the longest effect. Make the commitment early, not counting the short-term effort as a cost, but as a decision well-made.

I leave the quantitative analysis to Ian Fleming:
'Once is happenstance, Twice is coincidence, The third time is enemy action'
- Ian Fleming, Goldfinger

Enemy action must be countered with force of will, or we shall be defeated.

Square Abstractions

Managing complexity is at the heart of Software Engineering, and abstraction is the tool by which we accomplish this.  But what do our abstractions look like, and how should we judge them?

Abstractions should be square.

Or cubic. Possibly n-dimensional hypercubes.  But not rectangles.  And lines are right out.  G.A. Miller wrote a classic psychology paper in 1956 with the far-reaching conclusion that in uni-dimensional data-sets, humans have a typical classification capacity of between 2 and 3 bits - between 4 and 8 items.  His paper is titled 'The Magical Number Seven, Plus or Minus Two: Some Limits on our Capacity for Processing Information'.  How does this apply to software abstraction? It gives us a quantitive key to determining whether an abstraction (which implies a reduction in complexity) is of sufficient quality.  It also gives us a clue to resolving the issue of abstractions still retaining too much complexity: add another dimension.

By square abstractions, I mean that a good set of abstractions in the software domain, from an arbitrarily complex starting point to the most understandable abstraction of that idea, should have approximately equal complexity in each dimension.  If the result is that each (and all, since we have decreed equality) dimension of abstraction is still too complex, we must re-dimension, refactor, and re-abstract.

Soap bubbles form perfect spheres not just because they find it aesthetically pleasing, but because they are most comfortable like that.  It takes the least effort.  In software we should similarly strive to find the solution which satisfies the constraints with the least energy.  Spheres might be nature's solution, but in software we tend to seek orthogonal abstractions - leading to squares, cubes, hypercubes, and so-on.

Getting practical for a moment, remember that every program, library, and API is an abstraction.  An application containing a single 100,000 file (yes, really...) might be perfectly good internally, but is missing out on a key abstraction in terms of translation units, modules, whatever else maps to files.  So split it into one-hundred 1000 line files - we've added a dimension and reduced the maximum unidimensional complexity.  But we should continue - 100 is more than an order of magnitude greater than our magic 7 plus or minus 2.  Directories, packages, folders: another level of abstraction.  And because we are being square, we aim to have approximately 10 directories with 10 files in each.   This stretches 7 +/- 2, but not sufficiently that any more abstraction would necessarily be helpful - adding a dimension has a cost too.
Why 100 files of 1000 lines, and not 316 files of 316 lines?  Because not all abstractions have the same cost, and we can apply additional abstractions within those files.  Like, um, classes, methods and functions.

So next time you (or I) think about adding that 100th method to our widget API, think about adding a new dimension instead.  And if it isn't obvious what that new dimension might be, then get creative and invent something new.