CAN I BE OF ASSISTANCE?
Shut it!
Shut it!
Monday, August 08
Is CouchDB The Anti-Redis?
By which I don't mean, since Redis is cool, CouchDB is uncool. More like is CouchDB Yuri to Redis's Kei? Uh, do they complement each other nicely?
Because it sure looks that way to me.
Inspired by this very handy comparison of some of the top NoSQL databases, I've compiled a simpler item-by-item comparison of CouchDB and Redis, and it appears to be that CouchDB is strong precisely where Redis is weak (storing large amounts of rarely-changing but heavily indexed data), and Redis is strong precisely where CouchDB is weak (storing moderate amounts of fast-changing data).
That is, CouchDB seems to make a great document store (blog posts and comments, templates, attachments), where Redis makes a great live/structured data store (recent comment lists, site stats, spam filter data, sessions, page element cache).
Redis keeps all data in memory so that you can quickly update complex data structures like sorted sets or individual hash elements, and logs updates to disk sequentially for robust, low-overhead persistence (as long as you don't need to restart often).
CouchDB uses an append-only single-file (per database) model - including both B-tree and R-tree indexes - so again, it offers very robust persistence, but will grow rapidly if you update your documents frequently.
With Redis, since the data is all in memory, you can run a snapshot at regular intervals and drop the old log files. With CouchDB you need to run a compaction process, which reads data back from disk and rewrites it, a slower process.
Redis provides simple indexes and complex structures; CouchDB provides complex indexes and simple structures. Redis is all about live data, while CouchDB is all about storing and retrieving large numbers of documents.
Now, MongoDB offers a both a document store and high-performance update-in-place, but its persistence model is fling it at the wall and hope that it sticks, with a recovery log tacked on since 1.7. It's not intrinsically robust, you can't perform backups easily, and its write patterns aren't consumer-SSD-friendly. I do not trust MongoDB with my data.
One of the most unhappy elements of Minx is its interface with MySQL - writing the complex documents Minx generates back to SQL tables is painful. I've tried a couple of different ORMs, and they've proven so slow that they're completely impractical for production use (for me, anyway).
MongoDB offered me most of the features I needed with the API I was looking for, but it crashed unrecoverably early in testing and permanently soured me on its persistance model.
CouchDB is proving to be great for the document side of things, but less great for the non-document side. But I was looking at deploying Redis as a structured data cache, and it makes an even better partner with CouchDB than it does with MySQL.
It's really looking like I've got a winning team here.
Anyway, here's the feature matrix I mentioned:
* Coming in the near future.
By which I don't mean, since Redis is cool, CouchDB is uncool. More like is CouchDB Yuri to Redis's Kei? Uh, do they complement each other nicely?
Because it sure looks that way to me.
Inspired by this very handy comparison of some of the top NoSQL databases, I've compiled a simpler item-by-item comparison of CouchDB and Redis, and it appears to be that CouchDB is strong precisely where Redis is weak (storing large amounts of rarely-changing but heavily indexed data), and Redis is strong precisely where CouchDB is weak (storing moderate amounts of fast-changing data).
That is, CouchDB seems to make a great document store (blog posts and comments, templates, attachments), where Redis makes a great live/structured data store (recent comment lists, site stats, spam filter data, sessions, page element cache).
Redis keeps all data in memory so that you can quickly update complex data structures like sorted sets or individual hash elements, and logs updates to disk sequentially for robust, low-overhead persistence (as long as you don't need to restart often).
CouchDB uses an append-only single-file (per database) model - including both B-tree and R-tree indexes - so again, it offers very robust persistence, but will grow rapidly if you update your documents frequently.
With Redis, since the data is all in memory, you can run a snapshot at regular intervals and drop the old log files. With CouchDB you need to run a compaction process, which reads data back from disk and rewrites it, a slower process.
Redis provides simple indexes and complex structures; CouchDB provides complex indexes and simple structures. Redis is all about live data, while CouchDB is all about storing and retrieving large numbers of documents.
Now, MongoDB offers a both a document store and high-performance update-in-place, but its persistence model is fling it at the wall and hope that it sticks, with a recovery log tacked on since 1.7. It's not intrinsically robust, you can't perform backups easily, and its write patterns aren't consumer-SSD-friendly. I do not trust MongoDB with my data.
One of the most unhappy elements of Minx is its interface with MySQL - writing the complex documents Minx generates back to SQL tables is painful. I've tried a couple of different ORMs, and they've proven so slow that they're completely impractical for production use (for me, anyway).
MongoDB offered me most of the features I needed with the API I was looking for, but it crashed unrecoverably early in testing and permanently soured me on its persistance model.
CouchDB is proving to be great for the document side of things, but less great for the non-document side. But I was looking at deploying Redis as a structured data cache, and it makes an even better partner with CouchDB than it does with MySQL.
It's really looking like I've got a winning team here.
Anyway, here's the feature matrix I mentioned:
Couchdb | Redis | |
Written in | Erlang | C |
License | Apache | BSD |
Release | 1.1.0, 2.0 preview | 2.2.12, 2.4.0RC5 |
API | REST | Telnet-style |
API Speed | Slow | Fast |
Data | JSON documents, binary attachments | Text, binary, hash, list, set, sorted set |
Indexes | B-tree, R-tree, Full-text (with Lucene), any combination of data types via map/reduce | Hash only |
Queries | Predefined view/list/show model, ad-hoc queries require table scans | Individual keys |
Storage | Append-only on disk | In-memory, append-only log |
Updates | MVCC | In-place |
Transactions | Yes, all-or-nothing batches | Yes, with conditional commands |
Compaction | File rewrite | Snapshot |
Threading | Many threads | Single-threaded, forks for snapshots |
Multi-Core | Yes | No |
Memory | Tiny | Large (all data) |
SSD-Friendly | Yes | Yes |
Robust | Yes | Yes |
Backup | Just copy the files | Just copy the files |
Replication | Master-master, automatic | Master-slave, automatic |
Scaling | Clustering (BigCouch) | Clustering (Redis cluster*) |
Scripting | JavaScript, Erlang, others via plugin | Lua* |
Files | One per database | One per database |
Virtual Files | Attachments | No |
Other | Changes feed, Standalone applications | Pub/Sub, Key expiry |
* Coming in the near future.
Posted by: Pixy Misa at
11:27 AM
| Comments (4)
| Add Comment
| Trackbacks (Suck)
Post contains 639 words, total size 7 kb.
Sunday, August 07
Couch Tomato
So I did a bunch of reading on CouchDB, and it seems pretty interesting, and while flawed, it's not flawed in the direction of oops I just lost all your data (I'm looking at you, MongoDB) and it's not like I've found a database without flaws, so it's worth a shot.
Only none of the available Python libraries supports all the functionality of CouchDB, and one of the main ones is just abysmally slow. But on the other hand, one of the main weaknesses of CouchDB - it only has a REST API - means that it's easy to write a library - after all, it only has a REST API.
So I'm doing exactly that. The goal is to have a simple but complete library with the best performance I can manage. It will cover all the REST API calls but not try to do anything beyond that. I've isolated out the actual call mechanism so that it can be trivially swapped out if someone comes up with something more efficient. (Or if Couchbase creates a binary API.)
Soon as I have it mostly working I'll open source it and put it up on Bitbucket or something.
As it stands, you can connect to CouchDB, authenticate, create and delete databases, and read and write records, as well as various housekeeping thingies that were easy to implement. It supports convenient dict-style access
It's called Settee, unless I think of something better. It's a sort of spartan couch.
And in a simple test of 1000 separate record inserts:
Settee: 3.668s
CouchDB-Python: 41.53s
Either one would be much faster using the batch update functions, but there's no reason to grind to a complete halt when you need to do many smal queries.
(Scrabbling sounds ensue for a couple more hours.)
Okay.
CouchDB-Python is slow specifically for record creates and updates; it seems to be fine otherwise. There's a fork called CouchDB-Python-Curl that is much faster on writes - slightly faster than my code too - and uses significantly less CPU time apparently because it uses libcurl, which is written in C. Only problem there is that libcurl is horrible.
I've tweaked my code now so that it supports both the stylish and elegant Requests module and the simple and functional Httplib2, both of which are miles ahead of Python's core library in functionality and PyCurl in friendliness. PyCurl is still the fastest, though:
CouchDB-Curl runs only slightly faster than Settee-H2, but uses about half the CPU time, which is a nice advantage. Running with the Psyco JIT compiler makes up much of the difference - speeding up Settee while increasing the system time on CouchDB-Curl. Can't do anything for poor CouchDB-Python, though. I think it's been Nagled, though I don't know why it only affects that part of the test.
Update: Bulk insert of records, 100 at a time, single-threaded:
Settee-Rq: ~7000
Settee-H2: ~7700
That's healthy enough. Still trying to get the bulk read working the way I want it, though.
Update: Okay, so there's a fair bit of overhead in the API - write queries seem to take a minimum of 2.5ms and read queries at least 1.5ms. (MySQL queries get down to about 0.5ms, and Redis is even faster.)
But the underlying data engine suffers from no such limitations: I'm firing 20 million small (~250 byte) JSON objects at it, single-threaded, batches of 100, and it's happily loading them at about 6,000 records per second, 5 million records in. That's with CouchDB at about 60% of a CPU, Python at about 6%, and disk (6-disk software RAID-5) at about 10% busy.
That's a pretty good showing, though I'm not stressing it at all yet - the data is being loaded in key order. We'll have to see how it copes with multiple compound indexes next.
So I did a bunch of reading on CouchDB, and it seems pretty interesting, and while flawed, it's not flawed in the direction of oops I just lost all your data (I'm looking at you, MongoDB) and it's not like I've found a database without flaws, so it's worth a shot.
Only none of the available Python libraries supports all the functionality of CouchDB, and one of the main ones is just abysmally slow. But on the other hand, one of the main weaknesses of CouchDB - it only has a REST API - means that it's easy to write a library - after all, it only has a REST API.
So I'm doing exactly that. The goal is to have a simple but complete library with the best performance I can manage. It will cover all the REST API calls but not try to do anything beyond that. I've isolated out the actual call mechanism so that it can be trivially swapped out if someone comes up with something more efficient. (Or if Couchbase creates a binary API.)
Soon as I have it mostly working I'll open source it and put it up on Bitbucket or something.
As it stands, you can connect to CouchDB, authenticate, create and delete databases, and read and write records, as well as various housekeeping thingies that were easy to implement. It supports convenient dict-style access
db[key]
as well as db.get()
and db.set()
.It's called Settee, unless I think of something better. It's a sort of spartan couch.
And in a simple test of 1000 separate record inserts:
Settee: 3.668s
CouchDB-Python: 41.53s
Either one would be much faster using the batch update functions, but there's no reason to grind to a complete halt when you need to do many smal queries.
(Scrabbling sounds ensue for a couple more hours.)
Okay.
CouchDB-Python is slow specifically for record creates and updates; it seems to be fine otherwise. There's a fork called CouchDB-Python-Curl that is much faster on writes - slightly faster than my code too - and uses significantly less CPU time apparently because it uses libcurl, which is written in C. Only problem there is that libcurl is horrible.
I've tweaked my code now so that it supports both the stylish and elegant Requests module and the simple and functional Httplib2, both of which are miles ahead of Python's core library in functionality and PyCurl in friendliness. PyCurl is still the fastest, though:
Write/s | Read/s | User | Sys | Real | |
CouchDB-Python | 24 | 603 | 1.248 | 1.108 | 85.7 |
CouchDB-Psyco | 24 | 654 | 1.112 | 0.848 | 85.3 |
CouchdDB-Curl | 422 | 665 | 0.536 | 0.368 | 7.93 |
CouchdDB-Curl-Psyco | 421 | 711 | 0.576 | 0.452 | 7.93 |
Settee-Requests | 272 | 418 | 2.704 | 1.564 | 12.4 |
Settee-Requests-Psyco | 305 | 453 | 1.96 | 1.47 | 11.2 |
Settee-Httplib2 | 376 | 636 | 0.98 | 0.784 | 8.48 |
Settee-H2-Psyco | 392 | 687 | 0.6 | 0.77 | 8.48 |
CouchDB-Curl runs only slightly faster than Settee-H2, but uses about half the CPU time, which is a nice advantage. Running with the Psyco JIT compiler makes up much of the difference - speeding up Settee while increasing the system time on CouchDB-Curl. Can't do anything for poor CouchDB-Python, though. I think it's been Nagled, though I don't know why it only affects that part of the test.
Update: Bulk insert of records, 100 at a time, single-threaded:
Settee-Rq: ~7000
Settee-H2: ~7700
That's healthy enough. Still trying to get the bulk read working the way I want it, though.
Update: Okay, so there's a fair bit of overhead in the API - write queries seem to take a minimum of 2.5ms and read queries at least 1.5ms. (MySQL queries get down to about 0.5ms, and Redis is even faster.)
But the underlying data engine suffers from no such limitations: I'm firing 20 million small (~250 byte) JSON objects at it, single-threaded, batches of 100, and it's happily loading them at about 6,000 records per second, 5 million records in. That's with CouchDB at about 60% of a CPU, Python at about 6%, and disk (6-disk software RAID-5) at about 10% busy.
That's a pretty good showing, though I'm not stressing it at all yet - the data is being loaded in key order. We'll have to see how it copes with multiple compound indexes next.
Posted by: Pixy Misa at
05:52 PM
| Comments (3)
| Add Comment
| Trackbacks (Suck)
Post contains 698 words, total size 7 kb.
Thursday, August 04
RabbitMQ
RabbitMQ gets the exceedingly rare Doesn't Suck award for not only doing what it's says it does, but allowing you to look at it and see that yes, it is inded doing what it says it does.
After a couple of years of ActiveMQ, this is very refreshing.
If you have a couple or twelve Python applications and you need to ship messages between them cleanly and reliably, be it a few thousand a day or (like me) half a billion, RabbitMQ and the Kombu library are the way to go.
RabbitMQ gets the exceedingly rare Doesn't Suck award for not only doing what it's says it does, but allowing you to look at it and see that yes, it is inded doing what it says it does.
After a couple of years of ActiveMQ, this is very refreshing.
If you have a couple or twelve Python applications and you need to ship messages between them cleanly and reliably, be it a few thousand a day or (like me) half a billion, RabbitMQ and the Kombu library are the way to go.
Posted by: Pixy Misa at
08:26 PM
| No Comments
| Add Comment
| Trackbacks (Suck)
Post contains 92 words, total size 1 kb.
CouchDB Enlightenment
Doh.
When I last looked at CouchDB, I somehow completely missed the point of views. They're not a query mechanism, they're an index definition mechanism.*
If they were a query mechanism, they'd suck. As an index definition mechanism, though, they do exactly - exactly - what I've been looking for: They look into the record structure and generate appropriate index entries based on any criteria I can think of. Thus, my need for a compound index on an array and a timestamp is trivially easy to implement.
Just how good CouchDB is beyond that feature I don't know yet, but I'm sure going to find out.
Couple of other points: The last time I was looking I was dubious about using Erlang-based software; my experience with RabbitMQ since then has improved my opinion of Erlang considerably. And CouchDB uses a consumer-SSD-friendly append-only B-Tree structure for its indexes, making it cheap to deploy a very high-performance system (for small-to-middling databases, i.e. in the tens to hundreds of gigabytes). And, because it's append-only, you can safely back it up just by copying the files. I do like that.
Oh, and this: The couchdb-python package comes with a view server to allow you to write views in Python instead of JavaScript. That's kind of significant.
* Well, they're more than that, but that's the key distinction. Views don't query the data; they define a way to query the data - including the necessary index.
Doh.
When I last looked at CouchDB, I somehow completely missed the point of views. They're not a query mechanism, they're an index definition mechanism.*
If they were a query mechanism, they'd suck. As an index definition mechanism, though, they do exactly - exactly - what I've been looking for: They look into the record structure and generate appropriate index entries based on any criteria I can think of. Thus, my need for a compound index on an array and a timestamp is trivially easy to implement.
Just how good CouchDB is beyond that feature I don't know yet, but I'm sure going to find out.
Couple of other points: The last time I was looking I was dubious about using Erlang-based software; my experience with RabbitMQ since then has improved my opinion of Erlang considerably. And CouchDB uses a consumer-SSD-friendly append-only B-Tree structure for its indexes, making it cheap to deploy a very high-performance system (for small-to-middling databases, i.e. in the tens to hundreds of gigabytes). And, because it's append-only, you can safely back it up just by copying the files. I do like that.
Oh, and this: The couchdb-python package comes with a view server to allow you to write views in Python instead of JavaScript. That's kind of significant.
* Well, they're more than that, but that's the key distinction. Views don't query the data; they define a way to query the data - including the necessary index.
Posted by: Pixy Misa at
05:20 AM
| No Comments
| Add Comment
| Trackbacks (Suck)
Post contains 243 words, total size 2 kb.
68kb generated in CPU 0.0226, elapsed 0.4724 seconds.
52 queries taking 0.4622 seconds, 360 records returned.
Powered by Minx 1.1.6c-pink.
52 queries taking 0.4622 seconds, 360 records returned.
Powered by Minx 1.1.6c-pink.