Sunday, August 07
Couch Tomato
So I did a bunch of reading on CouchDB, and it seems pretty interesting, and while flawed, it's not flawed in the direction of oops I just lost all your data (I'm looking at you, MongoDB) and it's not like I've found a database without flaws, so it's worth a shot.
Only none of the available Python libraries supports all the functionality of CouchDB, and one of the main ones is just abysmally slow. But on the other hand, one of the main weaknesses of CouchDB - it only has a REST API - means that it's easy to write a library - after all, it only has a REST API.
So I'm doing exactly that. The goal is to have a simple but complete library with the best performance I can manage. It will cover all the REST API calls but not try to do anything beyond that. I've isolated out the actual call mechanism so that it can be trivially swapped out if someone comes up with something more efficient. (Or if Couchbase creates a binary API.)
Soon as I have it mostly working I'll open source it and put it up on Bitbucket or something.
As it stands, you can connect to CouchDB, authenticate, create and delete databases, and read and write records, as well as various housekeeping thingies that were easy to implement. It supports convenient dict-style access
It's called Settee, unless I think of something better. It's a sort of spartan couch.
And in a simple test of 1000 separate record inserts:
Settee: 3.668s
CouchDB-Python: 41.53s
Either one would be much faster using the batch update functions, but there's no reason to grind to a complete halt when you need to do many smal queries.
(Scrabbling sounds ensue for a couple more hours.)
Okay.
CouchDB-Python is slow specifically for record creates and updates; it seems to be fine otherwise. There's a fork called CouchDB-Python-Curl that is much faster on writes - slightly faster than my code too - and uses significantly less CPU time apparently because it uses libcurl, which is written in C. Only problem there is that libcurl is horrible.
I've tweaked my code now so that it supports both the stylish and elegant Requests module and the simple and functional Httplib2, both of which are miles ahead of Python's core library in functionality and PyCurl in friendliness. PyCurl is still the fastest, though:
CouchDB-Curl runs only slightly faster than Settee-H2, but uses about half the CPU time, which is a nice advantage. Running with the Psyco JIT compiler makes up much of the difference - speeding up Settee while increasing the system time on CouchDB-Curl. Can't do anything for poor CouchDB-Python, though. I think it's been Nagled, though I don't know why it only affects that part of the test.
Update: Bulk insert of records, 100 at a time, single-threaded:
Settee-Rq: ~7000
Settee-H2: ~7700
That's healthy enough. Still trying to get the bulk read working the way I want it, though.
Update: Okay, so there's a fair bit of overhead in the API - write queries seem to take a minimum of 2.5ms and read queries at least 1.5ms. (MySQL queries get down to about 0.5ms, and Redis is even faster.)
But the underlying data engine suffers from no such limitations: I'm firing 20 million small (~250 byte) JSON objects at it, single-threaded, batches of 100, and it's happily loading them at about 6,000 records per second, 5 million records in. That's with CouchDB at about 60% of a CPU, Python at about 6%, and disk (6-disk software RAID-5) at about 10% busy.
That's a pretty good showing, though I'm not stressing it at all yet - the data is being loaded in key order. We'll have to see how it copes with multiple compound indexes next.
Comments are disabled.
Post is locked.
So I did a bunch of reading on CouchDB, and it seems pretty interesting, and while flawed, it's not flawed in the direction of oops I just lost all your data (I'm looking at you, MongoDB) and it's not like I've found a database without flaws, so it's worth a shot.
Only none of the available Python libraries supports all the functionality of CouchDB, and one of the main ones is just abysmally slow. But on the other hand, one of the main weaknesses of CouchDB - it only has a REST API - means that it's easy to write a library - after all, it only has a REST API.
So I'm doing exactly that. The goal is to have a simple but complete library with the best performance I can manage. It will cover all the REST API calls but not try to do anything beyond that. I've isolated out the actual call mechanism so that it can be trivially swapped out if someone comes up with something more efficient. (Or if Couchbase creates a binary API.)
Soon as I have it mostly working I'll open source it and put it up on Bitbucket or something.
As it stands, you can connect to CouchDB, authenticate, create and delete databases, and read and write records, as well as various housekeeping thingies that were easy to implement. It supports convenient dict-style access
db[key]
as well as db.get()
and db.set()
.It's called Settee, unless I think of something better. It's a sort of spartan couch.
And in a simple test of 1000 separate record inserts:
Settee: 3.668s
CouchDB-Python: 41.53s
Either one would be much faster using the batch update functions, but there's no reason to grind to a complete halt when you need to do many smal queries.
(Scrabbling sounds ensue for a couple more hours.)
Okay.
CouchDB-Python is slow specifically for record creates and updates; it seems to be fine otherwise. There's a fork called CouchDB-Python-Curl that is much faster on writes - slightly faster than my code too - and uses significantly less CPU time apparently because it uses libcurl, which is written in C. Only problem there is that libcurl is horrible.
I've tweaked my code now so that it supports both the stylish and elegant Requests module and the simple and functional Httplib2, both of which are miles ahead of Python's core library in functionality and PyCurl in friendliness. PyCurl is still the fastest, though:
Write/s | Read/s | User | Sys | Real | |
CouchDB-Python | 24 | 603 | 1.248 | 1.108 | 85.7 |
CouchDB-Psyco | 24 | 654 | 1.112 | 0.848 | 85.3 |
CouchdDB-Curl | 422 | 665 | 0.536 | 0.368 | 7.93 |
CouchdDB-Curl-Psyco | 421 | 711 | 0.576 | 0.452 | 7.93 |
Settee-Requests | 272 | 418 | 2.704 | 1.564 | 12.4 |
Settee-Requests-Psyco | 305 | 453 | 1.96 | 1.47 | 11.2 |
Settee-Httplib2 | 376 | 636 | 0.98 | 0.784 | 8.48 |
Settee-H2-Psyco | 392 | 687 | 0.6 | 0.77 | 8.48 |
CouchDB-Curl runs only slightly faster than Settee-H2, but uses about half the CPU time, which is a nice advantage. Running with the Psyco JIT compiler makes up much of the difference - speeding up Settee while increasing the system time on CouchDB-Curl. Can't do anything for poor CouchDB-Python, though. I think it's been Nagled, though I don't know why it only affects that part of the test.
Update: Bulk insert of records, 100 at a time, single-threaded:
Settee-Rq: ~7000
Settee-H2: ~7700
That's healthy enough. Still trying to get the bulk read working the way I want it, though.
Update: Okay, so there's a fair bit of overhead in the API - write queries seem to take a minimum of 2.5ms and read queries at least 1.5ms. (MySQL queries get down to about 0.5ms, and Redis is even faster.)
But the underlying data engine suffers from no such limitations: I'm firing 20 million small (~250 byte) JSON objects at it, single-threaded, batches of 100, and it's happily loading them at about 6,000 records per second, 5 million records in. That's with CouchDB at about 60% of a CPU, Python at about 6%, and disk (6-disk software RAID-5) at about 10% busy.
That's a pretty good showing, though I'm not stressing it at all yet - the data is being loaded in key order. We'll have to see how it copes with multiple compound indexes next.
Posted by: Pixy Misa at
05:52 PM
| Comments (3)
| Add Comment
| Trackbacks (Suck)
Post contains 698 words, total size 7 kb.
1
Will you be going deeper into the flaws you found in CouchDB? Or are they taken care of by complementing it with Redis?
Posted by: dgo.a at Saturday, August 13 2011 10:04 AM (jnFfr)
2
I'm talking about limitations (either by design or implementation) rather than bugs there.
One is the API latency, which is significantly higher than even something like MySQL (let alone Redis).
Another is the inability to derive a view from an existing view. I can see how that would complicate the view model (and changing the upstream view would of course invalidate any downstream views), but it would be extremely useful.
And another is JSON and its very limited set of data types.
Oh, and the choice of Javascript as the embedded language might be pragmatic, but I don't have to like it. Javascript is not a good language. Javascript will never be a good language. Python and Ruby are good languages, but not ideal for embedding. Lua is a good language, and is ideal for embedding. Javascript needs to die.
One is the API latency, which is significantly higher than even something like MySQL (let alone Redis).
Another is the inability to derive a view from an existing view. I can see how that would complicate the view model (and changing the upstream view would of course invalidate any downstream views), but it would be extremely useful.
And another is JSON and its very limited set of data types.
Oh, and the choice of Javascript as the embedded language might be pragmatic, but I don't have to like it. Javascript is not a good language. Javascript will never be a good language. Python and Ruby are good languages, but not ideal for embedding. Lua is a good language, and is ideal for embedding. Javascript needs to die.
Posted by: Pixy Misa at Saturday, August 13 2011 11:08 AM (PiXy!)
3
I know it's been a while, but... any chance we could have a peek at the code anytime soon?
Posted by: Anonymous Hero at Wednesday, May 30 2012 06:09 PM (fnWFn)
52kb generated in CPU 0.0129, elapsed 0.1028 seconds.
56 queries taking 0.0936 seconds, 345 records returned.
Powered by Minx 1.1.6c-pink.
56 queries taking 0.0936 seconds, 345 records returned.
Powered by Minx 1.1.6c-pink.