A cricket bat!
Twelve years, and four psychiatrists!
I kept biting them!
They said you weren't real.

Saturday, November 30


Toku! Toku! Toku!

My test app (at my day job) continues to burble along happily under TokuMX.

Some more observations:
  • Compression is fantastic. We have an application with fine-grained access control, and in my test database I'm creating millions of users so that we can find scaling issues before those millions of users actually start hitting our servers.

    Under MongoDB, the user permissions table is one of the largest in the database, because each user has a couple of hundred settings. Under TokuMX, we get a compression ratio of 30:1, turning it from a problem to be solved into just another table.

    Compression ratios on the other tables is lower, but is still at least 7:1 on data and 3:1 on indexes on my test dataset.

  • The collection.stats command provides all sorts of useful information on data and index size and index utilisation. You can see not only how much space each index takes on disk, but how often it is being used by queries. In real time. That's a brilliant tool for devops.

  • Performance seems to suffer when running mongodump backups.

    Oh, just on that subject: TokuMX comes in two versions, the Enterprise edition with hot backup support, and the Community edition without. That doesn't mean you can't do online backups on the Community edition - the mongodump and mongoexport utilities work just fine. The Enterprise version comes with a high-performance snapshot backup utility - that is, you get a transactional, consistent point-in-time backup, which mongodump won't do.

    You could probably do that with a simple script if your database size is modest, using TokuMX's transaction support. Just start a transaction, iterate through the collections and their contents, and write them out in your preferred format.

    Anyway, my test app slows down significantly when I run a backup; I need to investigate further to determine whether this is due to resource starvation (I only have a small server to run these tests, and it doesn't have a lot of I/O or CPU), or whether it's contention within TokuMX.

  • TokuMX only stores indexes. Each index gets written to a separate file in your data directory. This may lead you to ask, where the hell is my data? The answer is that the _id index for each collection is a clustered index - that is, the data payload of each record is stored alongside the key in the fractal tree.

    You can create additional clustered indexes if you need them; this could be a significant win in data warehousing applications, particularly if you are writing to disk rather than SSD. If your application reads records in index order, performance from disk could approach performance from SSD at a much lower cost.

  • Transactions are a little fiddly due to choices made by driver developers. Many of the MongoDB drivers (in this case, PyMongo) use transparent connection pools to support multi-threaded applications without requiring huge numbers of database connections.

    That doesn't work at all for transactions, because the transaction is bound to the connection, and if you have a connection pool, there is no guarantee (in fact, pretty much the opposite) that you will get the same connection for all the operations within your transaction.

    The approach we've taken for our web application is a multi-process / single thread model, with one connection per process. We're using Green Unicorn at the moment, but the same would work with other servers as long as they are configured appropriately.  For non-web apps, just make sure you create a connection pool per thread, and limit the pool to one connection.

    Update: Actually, PyMongo has a mechanism that helps with this.  Though not specifically designed for transaction support, requests bind your thread to a specific connection.  So you can leave the connection pool alone and just write a little wrapper to handle transactions.
I haven't yet tested replication or sharding. Tokutek say that their replication model generates an even lower I/O load on the replica than on the primary. That's a nice thing if you're running many MongoDB databases (or shards) because you could potentially combine multiple replicas onto a single server. (Assuming that not all your primary servers die at the same time.)

There are a few things that TokuMX doesn't do yet that I'd like to see:
  • As I noted, TokuMX stores all your indexes as individual files in your data directory. I strongly prefer this to MongoDB's indivisible database blobs, but if you have a lot of databases, each with a lot of collections, each with several indexes, you end up with a whole bunch of files piled into that one directory. Having it organised with a subdirectory for each database would be nice. (I might even suggest a subdirectory for each collection.)

  • Since we can see the individual indexes, a thought arises. Sooner or later, something horrible is going to happen to your database. If you have replicas or good backups, you can recover with little pain. But if something horrible happens to your replica and your backup at the same time, the worst thing in the world is to be left with a database that is right there on your disk but won't start up because one record somewhere is corrupted.

    I'd like to see a use-at-own-risk utility that can dump out the contents of a TokuMX clustered index file as a BSON object file (in the same format as mongodump), and just skip over damaged blocks with a warning.

  • mongoexport is oddly slow. I'm pretty sure I could write a Python script that outruns it. This is not Tokutek's fault, though; they've inherited that from MongoDB.  I'd love a faster way to dump to JSON.

  • TokuMX doesn't currently support MongoDB's geospatial or full-text indexing. I don't see the lack of full-text indexing as a big deal; ElasticSearch offers a much more powerful search engine than MongoDB and is very easy to install and manage.

    I would like to see geospatial support - it's not critical for my applications, but having it available would allow us to develop new functionality. Full-text search is something of an afterthought in MongoDB; you're better off with ElasticSearch. But geospatial support is something of an afterthought in ElasticSearch, so having it in TokuMX would potentially save deploying a third database to provide that one requirement.

    (Actually, reading up what MongoDB does, it also looks like quadtrees hacked into 2d materialised paths on top of B-trees, but with some intelligence on top of that to handle distances in spherical geometry.  So adding that sort of geospatial indexing to TokuMX shouldn't be very difficult.)

  • Counting records is kind of slow. I'd love to see TokuMX's indexes implement counted trees so that indexed count operations would be lightning fast. (I don't know if that's feasible with fractal tree indexes, but I don't know any reasons why it wouldn't be.)

  • A default compression setting.  If you're using an ODM like MongoEngine, it's not necessarily easy to set the compression on each table and index.
My conclusion: If you are using MongoDB, and don't depend on full-text search or geospatial indexing, you should definitely look into moving to TokuMX. (If you use MongoDB's full-text search, you should look at moving to TokuMX and ElasticSearch. ElasticSearch has data compression too, so the two combined are still going to use less disk space than MongoDB by itself.)

I first looked at MongoDB early in 2010. Half an hour after I started testing it, my database was a smoking wreck, due to the OOM behaviour of OpenVZ and Mongo's storage engine, which at that time was frankly not ready for use.

OpenVZ and MongoDB have since fixed that, so that MongoDB runs under OpenVZ without crashing, and MongoDB doesn't destroy your data if it does crash, but my reservations over the fundamental architecture of the MongoDB storage engine remain.

TokuMX isn't perfect (yet), but it delivers a serious, production-quality storage engine with performance at least as good as vanilla MongoDB while requiring a small fraction of the disk space, and fine-grained locking that provides far greater potential scalability. (My test server is too small to really test that.) And transactions. It's what I was looking for when I first tested MongoDB.

TokuMX gets my coveted Doesn't Suck award.

Oh, and here's the slides to a talk given by John Schulz of AOL on their testing of TokuMX vs. MongoDB. His conclusions:
• Space per document for MongoDB databases will be reduced by at least 66%. Likely as much as 75%

• Host memory while important is no longer a serious resource constraint. Now CPUs and to a lesser extent disk I/O bandwidth are the principle constrained resources.

• We should be able to make full use of the available persistent storage on each host.

• It is reasonable to assume that we can put 3X to 4X the amount of data and associated workload on a host compared to MongoDB.

• TokuMX provides more consistent operation times than MongoDB does, improving the customer experience.

• TokuMX has the potential to save significant hardware cost

Posted by: Pixy Misa at 11:38 AM | Comments (7) | Add Comment | Trackbacks (Suck)
Post contains 1509 words, total size 11 kb.

Wednesday, November 27



Another interesting datapoint: Informix (now owned by IBM) has announced support for MongoDB data structures and API access.

In fact, the new version allows you to access BSON (MongoDB format) data from SQL, and access SQL tables transparently using the MongoDB API.

It has the same limitations as TokuMX (no full-text or geospatial indexes on BSON data, even though Informix itself supports full-text and geospatial indexes), but it does support indexing on arrays and nested fields.

I don't know if I'll ever use it, but it's great to have another option if you're deploying applications on MongoDB.

Edit: And DB2 as well.  Very interesting.  Wonder what the pricing is for a low-end DB2 deployment these days.

Edit: Well, DB2 Express-C is free.  It used to be limited to 4GB of memory, though that's not a major issue since the operating system can still use free memory to cache the filesystem.  That's been increased to 16GB.  Still only supports two cores, but two 3.5GHz Ivy Bridge or Haswell cores can get a lot of work done.  It supports databases up to 15TB, which is pretty big by mee.nu standards.  (And not small by the standards of my day job, either, for a single instance.  We have 1.5PB of data, but that's spread across many servers.)

DB2 Express is $2210 per server per year, and supports 8 cores and 64GB of memory per instance.  The previous version was limited to 8GB, so that's a huge increase.  Again, 15TB per database, but I don't see that as a problem; managing a 15TB production database is going to cost you a lot more than $2210 a year.

Posted by: Pixy Misa at 12:22 PM | Comments (7) | Add Comment | Trackbacks (Suck)
Post contains 274 words, total size 2 kb.

Tuesday, November 26



I started writing a short note on TokuMX and it turned into a history of databases and the relative value of relational vs. non-relational systems vis-a-vis traditional and non-traditional use cases.  Which would be great except that I really don't have time to write a book today.

So, quickly: Tokutek, who brought us the nifty TokuDB engine for MySQL, have done the same thing for the new default database, MongoDB.

TokuMX takes MongoDB and swaps out the rather questionable storage engine for something substantially more scalable and robust, based on fractal tree indexes.  This means three things:
  1. The database doesn't lock itself up when load gets high.  This was a big problem with MySQL in the old days (you can still see it when I do background processing on Minx, because I haven't had a chance to convert the tables to a newer format).  It's still a problem with MongoDB; less so than what MySQL was, but more so than MySQL is now.

  2. It has transactions.  This is basic database technology.  If I want to take ten dollars from my account and pay it into your account, the last thing I want to see is for the server to go down after the first step but before the second one.  Transactions make sure that either both steps happen, or neither.  MongoDB doesn't have transactions.*  MySQL didn't either, years ago.

  3. It's compressed.  It is, in fact, extraordinarily compressed relative to MongoDB.  My test system at work (which uses real-world, albeit slightly odd, data) shrank from 19GB to 1.9GB. 
TokuMX lacks two things that MongoDB does have: Full-text search, and geospatial indexing.

I don't see either as a major issue.  MongoDB's full-text search is neat if you really must have just one database for everything, but it's far less powerful than ElasticSearch.  Using a search engine as well as a database means duplicating all your data, but (a) you can set up ElasticSearch so that it automatically indexes your MongoDB / TokuMX data, and (b) since ElasticSearch also compresses data automatically, TokuMX and ElasticSearch combined require a fraction of the disk space of MongoDB alone.

(ElasticSearch also supports geospatial queries; it looks to me as if they've used materialised paths to fudge quadtrees into their inverted indexes.  Clever, and should suffice for most use cases.  I'd never considered multidimensional materialised paths before.)

The reduced disk space is more significant than it might seem at first glance.  Smaller databases mean that it's more feasible to keep everything on SSD, which means much better performance.  Also, if your database is one fifth the size (assuming you replace MongoDB with a combination of TokuMX and ElasticSearch) you can cache five times as much data in memory.  

Last week I was working with a 50GB MongoDB databases and a 5GB ElasticSearch index, which was a little slow on my 32GB server.  Now I can work with twice as much data and have it all fit in memory, which is a huge win.

So, I'm waiting to see if this is going to blow up in my face in some weird way, as shiny new things usually do.  But so far it is all looking very promising.

* It has atomic operations, and you can futz around with those to construct your own transaction manager if your really want to, but it doesn't support transactions out of the box.

Posted by: Pixy Misa at 06:25 PM | Comments (5) | Add Comment | Trackbacks (Suck)
Post contains 563 words, total size 4 kb.

Sunday, November 24


Fangirls Spontaneously Combust

There is hope for the future.

Posted by: Pixy Misa at 11:00 PM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 9 words, total size 1 kb.


Night And Day

Clara: I think there's three of them now.
Kate: There's a precedent for that.

Posted by: Pixy Misa at 09:32 PM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 17 words, total size 1 kb.

Wednesday, November 20


Mission Accomplished-ish

One of my goals for the next version of Minx was to reduce build times by a factor of ten, from around 100ms to 10ms for simple pages, and for larger pages from 1s to 100ms.

Here's a post with 1400+ comments. Before (worst case, with neither the element cache or the MySQL query cache in effect):
573kb generated in CPU 4.09, elapsed 3.662 seconds.
66 queries taking 1.0599 seconds, 1591 records returned.
573kb generated in CPU 0.02, elapsed 0.0534 seconds.
41 queries taking 0.0396 seconds, 55 records returned.
The new element cache eliminates the slow database queries, comment text filtering, and much of the template processing, and the performance improvement is dramatic.

Here's my main page, before:
97kb generated in CPU 0.88, elapsed 0.5679 seconds.
31 queries taking 0.2722 seconds, 121 records returned.
And after:
97kb generated in CPU 0.01, elapsed 0.0064 seconds.
14 queries taking 0.0029 seconds, 25 records returned.
What I need to do now is make the cache smarter, so that I can deliver more from the cache rather than rebuilding it. The cached performance is fantastic, but we're only hitting the cache about 40% of the time. I want to get that up above 90%.

Posted by: Pixy Misa at 09:00 PM | Comments (2) | Add Comment | Trackbacks (Suck)
Post contains 195 words, total size 1 kb.

Tuesday, November 19


F1 Hornsby

No, not the race, the tornado.

(Well, technically the Bureau of Meteorology estimates it as an EF1, but hasn't yet issued a final assessment.)

Posted by: Pixy Misa at 01:00 PM | Comments (4) | Add Comment | Trackbacks (Suck)
Post contains 26 words, total size 1 kb.

Monday, November 18


I Missed All The Fun!

I was in the office in the city today, and missed all the fun at home.

Fortunately, the tornado* missed my house completely and clobbered the shopping district.

I didn't know about it until I got back to Hornsby station this evening and tried to go to the shops to pick up some groceries, only to find the whole shopping centre taped off with police and emergency services in attendance.

No-one killed, though six people were inside a portable building at the railway station that flipped over, and they must have had a hell of a fright.

My train home was a just few minutes late.  Awesome work by NSWGR and the SES, given that there was a tree on the tracks this afternoon.

Update: Found another picture of that portable building from a different angle, which allowed me to identify it.  Two observations: First, it didn't just tip over, it travelled a good twenty feet and landed completely upside down.  Second, I was standing right next to it three hours earlier.

It wasn't just my suburb that caught it today, either; this view from the Manly ferry looks more like a fishing trawler in a storm in the North Atlantic.

On the other hand, at least the fires are out.

Update Two: From the sound of things, the mini-tornado/storm cell took a path right through the centre of town.  It hit the big Westfield shopping centre, blowing out the roof of the cinema multiplex (and trashing the cinemas pretty badly), took part of the roof off the hotel across the mall, crossed over the public library (no reports of damage there), hit the railway station where it flipped that portable building and at least one car, then hit the local technical college, the police station, and the council offices.

The small number of injuries can probably be attributed to two things: First, it was a miserable day here and few people were standing around outside to get hit by debris, and second, where the glass roof blew off in two places in the shopping centre, it sounds like the wind came in through the doors and blew the panes of glass up and out rather than inwards.  In some pictures some of the misplaced panes are visible, resting on the intact ones.

Update Three: News report.  Apparently the library was damaged, possibly badly.

* Possibly;** witnesses have described a funnel and a debris vortex.  Whatever it was, it was highly localised and strong enough to flip cars over.  Possibly an earth elemental.  Or a really cranky stick insect.
** Now confirmed.  Not a stick insect.

Posted by: Pixy Misa at 07:33 PM | Comments (5) | Add Comment | Trackbacks (Suck)
Post contains 441 words, total size 4 kb.


Ceci N'est Pas Une Post

Just testing, nothing to see...

Posted by: Pixy Misa at 09:33 AM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 10 words, total size 1 kb.

Sunday, November 17



Test version:
Hello Pixy Misa, you are logged in to Minx.
98kb generated in CPU 0.0, elapsed 0.0054 seconds.
14 queries taking 0.003 seconds, 25 records returned.
Powered by Minx 1.1.6c-pink.
The CPU timer has a ~10ms resolution, so it's not seeing anything. 3ms for database access, 5ms total to generate my home page.

Only issue is that posts and comments take up to a minute to percolate up to the home page and the sidebar.

How this will work:

When including a template [include ...] or invoking an applet [applet ...] you will be able to set cache directives:

cache - cache with the default system TTL (time-to-live, currently 60 seconds)
nocache - do not cache
ttl=N - set a custom TTL

Example: [include Posts ttl=30]

Applets are cached by default; regular template includes are not. You have to take care; if a template is included within a loop, and you cache it, it will evaluate once, and then repeat the content over and over. Probably not what you want.

Now I need to work on smarter cache eviction.

Another example, from one of Ace's crazy comment threads:
Hello Pixy Misa, you are logged in to Minx.
572kb generated in CPU 0.02, elapsed 0.0207 seconds.
41 queries taking 0.0086 seconds, 55 records returned.
Powered by Minx 1.1.6c-pink.
20 milliseconds to display 1408 comments. Not too shabby. Before caching, it took 1-2 seconds.

For Ace I wrote a custom template; the template is kind of clunky, but it works great. What I do is check the number of comments against 100-comment chunks, caching the full chunks, and leaving the final chunk uncached.

That way, the bulk of a long thread is cached, but the last few comments aren't, so new comments show up instantly.

Posted by: Pixy Misa at 03:59 PM | Comments (6) | Add Comment | Trackbacks (Suck)
Post contains 285 words, total size 2 kb.

<< Page 1 of 2 >>
92kb generated in CPU 0.0822, elapsed 0.2877 seconds.
57 queries taking 0.2692 seconds, 394 records returned.
Powered by Minx 1.1.6c-pink.
Using https / https://ai.mee.nu / 392