Thursday, January 13

Geek

Meanwhile, Down At The Old Message-Packing Plant...

At my day job, we need to encode about 3000 incoming messages per second and send them whizzing around our network, to be plucked from the stream and decoded by any or all of a couple of dozen subsystems. No serialisers and deserialisers are something of a hobby of mine.

Mostly, we use JSON. (And thank you JSON for all but killing off XML. XML sucks and deserved to die.)

Every so often I run a little bake-off to see if there's anything significantly better than a good implementation of JSON. Until now there hasn't been anything with equivalent ease of use and significantly better performance.

Until now:

Codec
Dump
Load
Bytes
JSON
2.37
2.19
1675
Pickle
3.03
2.37
1742
MessagePack
1.37
0.29
1498
MessagePack/List

1.60

MessagePack/Schema

0.38

YAML
95.27
78.04
1631
BSON
2.85
1.05
1735

(Times in seconds for 100,000 records.)

No, it's not YAML. YAML is nice for easy-to-edit config files, but it's lousy for encoding and shipping lots of data. It's too flexible, and hence too slow.

MessagePack, on the other hand, is a compact binary JSON-style encoding. It's about twice as fast as JSON (I use SimpleJSON, a fast C implementation) for dumping data, and up to ten times as fast for loading it. In situations like a database where reads can be 100x as common as writes, that's a huge win.

One thing I noticed is that MessagePack for Python returns arrays as tuples (which are immutable) rather than lists. There are two ways to fix this in my use case. First, MessagePack itself allows you to provide a callback function used when creating arrays, so you can tell it to create lists instead. Second, I could use an advisory schema to look at the data after it's been unpacked and make corrections as needed.

As you'd expect, one of these methods takes a lot longer than the other and leaks memory like mad. Guess which one.

Oh, and the other thing - in my testing, I tried this out in both native Python and my usual environment of Python 2.6 with the Psyco JIT. In the latter environment, I was doing callbacks from the Cython implementation of the MessagePack wrapper - that is, statically compiled Python - to the Pysco list-creation wrapper function - dynamically compiled Python. And it worked. I thought at first that that was causing my memory leak, but the same leak showed up in native mode.

Update: Added BSON, using the C implementation in PyMongo (which provides a Python wrapper for the C library and a pure-Python version).  BSON is comparable with JSON and Pickle on encoding, and falls between JSON and MessagPack on decoding.

The advantage of BSON (to me) is that it natively supports dates and times, which JSON and MessagePack do not.  Pickle supports almost any Python object, but that's a problem in itself, and it's Python-specific.  YAML supports dates, but it's far too slow for intensive use.

Slight disadvantage of BSON is that everything has to be a document - a Python dict - at the top level.  You can't encode an array (list, tuple, etc) without wrapping it in a dict.

Posted by: Pixy Misa at 11:30 AM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 533 words, total size 5 kb.

Comments are disabled. Post is locked.
49kb generated in CPU 0.0107, elapsed 0.0968 seconds.
54 queries taking 0.09 seconds, 341 records returned.
Powered by Minx 1.1.6c-pink.