Say Weeeeeee!
Ahhhhhh!

Tuesday, February 09

Geek

TreeDict

Well, that's annoying.  I just wrote one of these to speed up the Minx template parser, but this one is probably faster (it's a Cython module) and certainly better documented.

The functionality is identical as far as I can see, but I called mine a DictTree.

Actually, no, mine has one extra feature: It allows you to reference a value using mapping-style lookup (tags['a.b']) that would be a subtree in an attribute-style lookup.  If tags.a.b.c is set, tags.a.b is necessarily the tree containing b; try anything else and you blow yourself out of the water.  But tags['a.b'] can be anything you like.

I need to do that because I designed the Minx template language that way.  (Oops.)  It lets you reference, for example, post.date as a date value, and also post.date.month to find just the month of the date of the post.  You can't do that with dicts; you could probably do it with a smarter class, but bang would go my generality.

Since that trick is used on both dates and strings, I'd need to make all my dates and strings into custom classes to make the attribute syntax work directly, and that's just too messy to contemplate.

Posted by: Pixy Misa at 02:09 AM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 200 words, total size 1 kb.

Monday, February 08

Geek

Pita

Or, The Screw It, I'm Writing A Database Post

Okay, I've had it up to here with databases that suck.*

So I'm writing my own.**  It is called, for obvious reasons, Pita.

The plan is to steal as much of the low ground as possible.  Anything hard, it just won't do.

First, it won't be written in C.  It will be written in Python.  Yes, performance will suffer, but Python can actually deliver suprisingly well, and it has some very well-optimised and well-tested libraries available.  My Python benchmark script for MongoDB*** was moving as much as 90MB of data per second, which would about saturate a gigabit ethernet link or SATA disk.

The design goals are as follows:
  • Doesn't lose your data.
  • Doesn't lose your data, ever.
  • Doesn't have to go offline for schema changes, including adding or removing indexes.
  • Doesn't have a query language.  This may end up making complex queries more complex.  That's okay, becase it makes simple CRUD operations dead easy.
  • Doesn't have any avoidable hard-coded limits that aren't insanely large.  Whatever bit-size seems reasonable for a counter, I'll double it (if it's not already an arbitrary precision value).
Notice how I'm mostly concerned with what it doesn't do?  Yeah.  I'm trying to keep this feasible.
  • Doesn't do random I/Os except for indexes and one (of seven) table types.  Well, two, I guess.
  • Doesn't do row-level locking.
  • Doesn't do multi-table or multi-record transactions, let alone two-phase commit.  Look, if you're a bank, you already have the money, go buy DB2 and stop bothering me.
  • Doesn't do joins...  Well, it sort of does.  We'll get to that.
  • Doesn't guarantee high performance, full consistency, versioning, and multi-master replication on any single table type.  It will have all of those, but you can only choose one from column A and one from column B.
  • Doesn't scale to ridiculous sizes.  If at any point, though, it looks like it can't at least scale to pretty darn big, I'll drop it.
  • Doesn't ever do random access of random-length data to disk.  Um, except for indexes, where the random-length data will be divvied up into fixed-length blocks anyway.  And for the initial development versions, indexes are likely to be memory-based unless I find some nice (simple, fast, efficient) on-disk indexing libraries.
  • Doesn't lose your data.  I mean it.  Even when it loses your data, it doesn't lose your data.
Okay, that's what it won't do.  What will it do?
  • Supports multiple table types, each optimised for a different task.  So even though there's no one table type that does everything, there should be something good enough for most cases.

    Specifically:

    • Store - a log-structured, versioned, indexed document store.  Documents are never deleted or changed.  All changes are added as a new version of the document at the end of the table.  Everything is stored in JSON or YAML.

      Advantages - no random I/Os for data, roll back to any point in time, back up simply by copying the data segments.  The entire table is in text, so even if everything goes splatooie, it's easy to write a program to parse it and salvage your data.

      Disadvantages - if your records change frequently, it will eat disk space like candy.  You can do an online file pack to purge old versions of documents, but that's a chore.  Restore from backup would require an index rebuild.
    • Fixed - a fixed length, indexed, random-access table with change log.  Pretty much the diametric opposite of Store.

      Advantages - fast access to data because the position in on disk is a function of the record number, and it's all binary so there's no parsing required.  Every record is versioned, so by copying the data file plus the change log you can roll forward to a consistent state.  You can't necessarily roll back, though.

      Disadvantages - fixed length.
    • Table - a combination of a Store for the documents and a Fixed for numeric fields, dates, booleans and any other fixed-length data.

      Advantages - all of the advantages of Store and Fixed together.  The default on-disk table type, hence the name.  Supports full roll-forward and roll-back recovery.

      Disadvantages - the fixed fields are written twice on a new major document version, to both the Store and the Fixed files.  On the other hand, changes to the fixed fields only don't require an update of the Store, significantly reducing storage requirements when you have documents with a few numeric fields that change frequently.  Similarly, reading a document requires reading two files, and you can't readily get the fixed values for minor versions (where the Store wasn't updated).
    • Space - a non-versioned, indexed, in-memory document store, with snapshot plus log persistence.

      Advantages - should be very very fast, since it's entirely held in memory.  Also very flexible for the same reason.  Roll-forward recovery, though no roll-back.  Backups are still as simple as copying the files on disk.

      Disadvantages - if your server crashes, or even if you need to do a normal reboot, you can't use the table until it's reloaded itself from disk, resynced, and rebuilt the indexes.  Thus primarily suited for frequently-accessed but relatively small tables.
    • Cache - a non-versioned, indexed, in-memory document store with a fixed size and LRU policy.

      Advantages - it's a memory-based cache with the exact same semantics as your database.

      Disadvantages - I'd be surprised if it gets within a factor of 3 of the speed of dedicated caches like memcached.
    • Queue - a disk- or memory-based document queue, i.e. first-in first-out.  Disk-based queues use segmented files and a transaction log for recovery and efficient space reclamation.

      Advantages - it's a queue with the same semantics as your database.  Well, kind of.  I don't know that I'll actually support all the fancy stuff.  Does only sequential disk I/O.

      Disadvantages - won't have some of the fancy features of something like ActiveMQ.  However, probably won't arbitrarily run out of memory and wedge itself.  At least if it crashes outright you can restart it.
    • Stack - a disk- or memory-based document stack, i.e. last-in first-out.  All the reads and writes alike happen at the end of the file.

      Advantages - it's a stack.

      Disadvantages - has to lock the file for every operation to prevent screwups, so won't be super-efficient.
    • Others under consideration
      • Array - a shared fixed-structure matrix with atomic matrix operations.
      • Cloud - a master-master eventually-consistent Space.
      • Merge - a (probably read-only) table granting easy access to multiple other tables.
      • Graph - an in-memory graph structured table , for example, for social network relationship data, with snapshot/log persistence.
      • Deque - a double-ended queue, combining the functions of Stacks and Queues.
      • Batch - a batch-oriented in-memory table with automatic snapshot persistence.
      • Share - a cached Store, ideal for data that can afford to be a little out of date but can't be inconsistent.
  • Support multiple data types that (mostly) map closely to Python's own:
    • Int
    • Char
    • Date
    • Time
    • Float
    • String
    • Money
    • Number
    • Geometry
      • Point
      • Line
      • Square
      • Rectangle
      • Circle
      • Ellipse
    • Text
    • Auto
    • Logical
    • Encrypted
    • Binary
  • Support multiple data structures within documents that closely map to Python's own:
    • Map
    • Set
    • Bag
    • List
    • Array
    • Variant
    • Reference
  • Support multiple index types and modes:
    • B-tree/B+-tree (of some sort) for primary keys, unique indexes, and general purpose indexy stuff.
    • R-tree and/or Quadtree for GIS stuff.
    • Full-text index, which will probably start out as a hacky B-tree of some sort.
    • Indexing of structures (lists, maps etc) within documents.
    • Partial indexes.
  • Triggers and stored procedures.  The embedded Lua interpreter I'm putting into the next version of Minx will do nicely.
  • Embedded database.  Don't need a full-fledged dataserver?  Just run the whole thing within your app.  The code will be split into a database library and a dataserver that runs on top.
  • Replication -  for Store, Queue and Stack, a choice of multi-master replication with eventual consistency or master-slave.  For Fixed, Table, and Space, master-slave replication.  For Cache, no replication.  (It's a cache!)
  • Sharding - for Store (probably) and Cache, easy sharding across servers.  For other table types (probably), no sharding.
  • Uses JSON or YAML everywhere, for data storage, data logs, config files, schema files, APIs and anywhere else a standard format is required.

    Advantages - no XML.

    Disadvantages - none.
  • Pure-ish Python.  The plan is to write it all in Python, with some optimisations in place for Cython.

    You can do that - it's kind of neat.  The exact same code can run interpreted with regular Python, JIT-compiled with Psyco, binary-compiled with Cython, on the JVM with Jython, or on .NET with IronPython.  And that's the plan; to make it run everywhere, but include optimisations for regular Python on Linux on x86 or x86_64.  And avoid those horrifying string concatenations if I can.

    One catch I know already - for the Space snapshots, I'm planning to use the Unix fork semantics, which are copy-on-write, i.e. you get a static snapshot of all your data structures at an amortised cost so that you can easily write it back to disk while online.  Windows' fork semantics are different and don't let you do that, so snapshots would stall the database, or at least the table.  Still, with commodity hard drives achieving peaks of over 100MB/second and modest RAID arrays reaching a few hundred MB/second, even writing a few GB of data to second once a day shouldn't take too long.
  • Designed to take advantage of SSDs and HDDs.  Put the random I/O load on your SSDs and the sequential-update bulk data on your HDDs.  Or put everything on SSDs, that works too.  I'm not going to bother to try to make random I/O work super-efficiently on HDDs; that's simply a losing game.  For small databases just use Array tables and load everything into memory; for larger databases buy yourself a nice Intel X25-E.

Okay, that's a quick rundown on the planned features.  Tomorrow, I'll start writing up my thoughts on the programming and data APIs. 

Update: Found Python implementations of B+Trees, R*-Trees, and a B-Tree based sorted list and dict module.  That'll save some time!

* That is, do not meet my current requirements.  Or in some cases, actually spread pneumonic plague.  YMMV.
** Maybe.
*** The MongoDB server is C, but the benchmark program is Python, and it ships a whole lot of data back and forth.  Which tells me that a Python program can ship a whole lot of data back and forth.  The program can create 32,000 1K or 22,000 2K records per second, and read 50,000 1k or 32,000 2K records per second.  The 90MB per second was a achieved with 10K records.

Posted by: Pixy Misa at 09:13 PM | Comments (7) | Add Comment | Trackbacks (Suck)
Post contains 1703 words, total size 12 kb.

Sunday, February 07

Geek

Seeking A Database That Doesn't Suck

Anyone?

Quick recap of databases that suck - or at least, suck for my purposes - and some that I'm still investigating.

SQL
  • MySQL Lacks intersection and except, lacks array support, has only so-so full-text indexing, offers either concurrency or full-text indexes and GIS, but not both.
  • PostgreSQL Provides arrays, concurrency and full-text indexes and GIS, but a lot of this is lost in a twisty maze of plugins and thoroughly non-standard operators. And the full-text indexing sucks.
  • Ingres Ingres is free these days (if you don't need a support contract). It's a good, solid database, but doesn't actually offer anything I can't get from MySQL with InnoDB.
  • Firebird Doesn't seem to offer anything more than MySQL or PostgreSQL or Ingres. Which doesn't mean that it's bad, but doesn't really help me either.
  • SQL Server Needs Windows, which is worth an automatic 6 demerits, even though I can get enterprise-level Windows and SQL Server products for free. (My company is a Microsoft Bizspark member.) Full-text and GIS, intersect and except are all there, but still no arrays.
  • IBM DB2 Costs too much.
  • Oracle Costs way too much.
  • Progress / OpenEdge Solid database, lovely 4GL, but still, the last time I looked at it (2006) mired in 16-bitness (!) and the 4GL is too slow for anything complicated. Also expensive and has a screwed-up pricing model. Would use it if I could.

NoSQL
  • Redis Nice feature set, and looks very useful for small systems, but the current version is strictly memory-based. (It's persistent through snapshots and logging, but the whole database must fit in memory.) The developers are working on this, though. The API could do with a tidy-up too; it has different calls for the same operation on different data structures.
  • MongoDB Very nice feature set. It's a document database, but it stores the documents in a JSON-like structure (called BSON) and can nest documents arbitrarily and inspect the fields within a document and build indexes on them. But its memory handling is lousy; while it's not explicitly memory-based, I wouldn't want to run it on anything but a dedicated physical server with more memory than my total database size. I could just throw money at it and put another 24GB of RAM in the server (far cheaper than a commercial RDBMS license) which would last us for a while, but I have serious doubts about its robustness as well.
  • CouchDB Written in Erlang, which is always a warning sign. Erlang programmers seem to care about performance and reliability far more than they care about making a product that anyone would want to use. In this case, instead of MongoDB's elegant query-by-example (with extensions) I write map/reduce functions in JavaScript and send them to the server. In what universe is that an improvement on SQL? On the plus side, it apparently has replication. On the minus side, it's an Apache project, and I have yet to meet an Apache project that didn't suck in some major way.
  • HBase Looks good if you have billions of very regular rows (which I do at my day job, but not here). Nothing wrong with it, but not a good fit.
  • Project Voldemort Pure evil. No, wait. This one came out of LinkedIn. It's one of the recent flock of inherently scalable (automatic sharding and multi-master replication) key/value databases. In their own words, [it] is basically just a big, distributed, persistent, fault-tolerant hash table. That's a very useful thing, but I need defined ordering (multiple defined orderings for the same dataset, in fact) which a hash table can't give me.
  • Cassandra This is Facebook's distributed hash table thingy (it's like the old days when every server company designed their own CPU). May have some vague concept of ordering, so I'll take a closer look.
  • Jackrabbit It's a Java/XML datastore from the Apache Foundation. Uh-uh. I've used ActiveMQ guys. You can't fool me twice. I'd sooner chew rusty nails.
  • Riak Bleah. Another key/value/map/reduce thing. In their own words:
    A "map phase" is essentially just a function ("F") and an argument ("A") that is defined as part of a series of phases making up a given map/reduce query. The phase will receive a stream of inputs ("I"), each of which consists of a key identifying a Riak object and an optional additional data element to accompany that object. As each input is received by the phase, a node that already contains the document ("D") corresponding to "I" will run F(D,A) and stream along the results to the next phase. The point here is that your function can be executed over many data items, but instead of collecting all of the data items in one place it will execute wherever the data is already placed.
    Clear? Right. Not interested at all.
  • LightCloud A distributed key-value store from Plurk. On the plus side, it's written in Python, supports Tokyo Tyrant and/or Redis for a back end, and "plurk" is fun to say. On the downside, it seems to be just a key/value database and not all that fast; it doesn't seem to expose the more interesting features of Tokyo Cabinet or Redis. It does at least have some update-in-place operations.
  • GT.M  GT.M is a crocodile.  That's not an aspersion, exactly.  Crocodiles were contemporaries of the dinosaurs, but when the dinosaurs went extinct, the crocodiles survived, and they are still around and occasionally snacking on jumped-up bipeds today.  It's a hierachical key-value store with a variety of access mechanisms.  It's unquestionably powerful, but it looks clunky; the MUMPS code reminds me of the systems I was employed to replace as little boy programmer in the 80's, and the Python interface doesn't actually look like Python, but more like some odd offspring of Cobol and Pascal.
  • Neo4j  Neo4j is a graph database, which is not something I've worked with before.  Graphs are a very general data structure for mapping relationships; while relational databases model parent-child relationships well, graphs are the natural model for networks of friends (for example) where you can end up back where you started by any number of differing paths.  The shortcoming of graphs is that they do not have a defined order, which is something I need a lot of around here.
Libraries and Other
  • Berkeley DB An oldie but a goodie. An embedded, transactional database. You can shove pretty much anything into it; it doesn't care. No query language, but does have indexes. One problem is this clause from the license:
    Redistributions in any form must be accompanied by information on how to obtain complete source code for the DB software and any accompanying software that uses the DB software. The source code must either be included in the distribution or be available for no more than the cost of distribution plus a nominal fee, and must be freely redistributable under reasonable conditions.
    Any code that uses the DB software? I assume they mean direct code embedding/linking, but that's pretty broad. And it's really just a library, albeit a good one; it could serve as the basis for a database server, but it isn't that by itself.
  • Metakit Metakit is a column-oriented database library, with a very nice, clean Python interface. For example, to display all posts by user 'Pixy Misa', you could simply write:
    for post in posts.select(user = 'Pixy Misa'):
      print post.title, post.date
    The problem is, it doesn't scale. I tried using it for the first pass at Minx, about four years ago, and it broke long before it reached our current database size. Like MongoDB, nice semantics, not so great on the implementation.
  • Tokyo Cabinet / Tokyo Tyrant / Tokyo Dystopia, VertexDB Tokyo Cabinet is a database library similar to Berkeley DB, but licensed under the no-worries LGPL. Tyrant is a "lightweight" database server built on Cabinet, Dystopia a full-text search engine built on Cabinet, and VertexDB a graph database built on Cabinet. I haven't explored these in depth yet because the standard Tokyo Cabinet distribution doesn't include Python libraries (Perl, Ruby, Java and Lua, but no Python?), but there are third-party libraries available.
  • Xapian and Omega  Xapian is a full-text search library, and Omega a search engine built on Xapian.  In fact, Xapian is more than that; it can do range searches on strings, numbers, and dates as well, and can store arbitrary documents.  It's quite good for searches, but not really suited to general database work.

Posted by: Pixy Misa at 01:17 AM | Comments (67) | Add Comment | Trackbacks (Suck)
Post contains 1381 words, total size 10 kb.

Saturday, February 06

Geek

Mongo Fail!

MongoDB ran out of memory and crashed during benchmarking.  I headed off to look for the appropriate parameters to tune its memory consumption, and discovered that there aren't any.

MongoDB uses memory-mapped files for storage - as far as I can tell, it maps them in, and then puts all its structures in them, directly to memory, relying on the operating system to handle paging.  On OpenVZ, that approach seems unlikely to work.  And without at least a synchronous recovery log, it seems destined to destroy your database sooner or later anyway.

So, nice features, shame about the functionality.

Sigh.

Posted by: Pixy Misa at 02:41 PM | Comments (2) | Add Comment | Trackbacks (Suck)
Post contains 102 words, total size 1 kb.

Geek

Mongo Angry! Mongo Smash!

One of the most intractable problems I have with Minx stems from it's inherent many-to-many structure.

Minx supports many sites.
Each site can have many folders.
Each folder can contain many thread.
Each thread can appear in many folders (even on different sites).
Each thread can have many items (posts, comments, and various other less used thingies).

What this means is that to display the 20 most recent comments on your blog, I have to - at least in theory - perform a four-way join, sort, and select.  I actually play some tricks to reduce it to a three-way join on a subset of the data, but once you start to page through the comments the tricks begin to break down.  Not enough that it's noticeably slow at present, but enough that it won't scale to really large numbers of users or really large sites.

I call it the grandparent problem.  If you're looking for one record - an individual comment - no problem, it's O(ln n).  If you're looking for the children of a record - comments on a post, comments by a particular user - no problem, it's O(n + log n).  But if you're looking for grandchildren of a record, its O(n log n), and that n no longer bears any relation to the number of records you actually want; you have to do a huge join, then sort the results, then select the handful you actually want.

MongoDB has a set of features that, put together, look like they solve exactly this problem.

First, you can have arrays in your records.  So, where I currently create duplicate thread records to place a thread in multiple folders (categories, for example), I can just add the category IDs to the array.

By itself that wouldn't be so useful, were it not for feature two: You can index arrays.  So I can create an index on the category array and post time, and simply adding and removing category IDs from that array will make the post show correctly in your folders with no performance hit.  In fact, it's more efficient (both in space and time) than the current technique.

So far so good.  Now for feature three: Arrays in records can contain not just single data values (like a category ID), but other records.  So I can put the posts and comments inside the thread record, and when I fetch a thread, I can fetch the entire thread contents in one go.

Now that wouldn't be so useful either except for feature four: You can build an index on a field in a record in an array in a record in your table.

That is, you can shove all your comments straight into the thread record, and then pick them out 50 at a time for paged comments, or in an entirely different order - say, the last 10 comments posted on your blog, no matter what post they're on.

Magic!

The one thing that seems slightly tricky is that MongoDB is document based.  You don't read fields from the database, you read documents.  You can store documents one inside another (comments inside threads, for example), and then you can get one or more of those comments without reading the whole thing.  But if you have information in the thread record itself, you can only get at it by reading the whole thread, comments and all.  For an Ace of Spades post with 1000+ comments, that would burn up all the performance I just gained.

There are some ways around that with a little bit of data duplication and other hackery, though it would be nice if MongoDB let you simply select a subset of fields to be returned.  It already has ways of updating individual fields inside a document, so something like that might already be on the way.

Anyway, that's where I'll be this weekend.

Update: MongoEngine provides a rather nice ORM - um, ODM - for Python and MongoDB.

Posted by: Pixy Misa at 02:50 AM | Comments (8) | Add Comment | Trackbacks (Suck)
Post contains 667 words, total size 4 kb.

Geek

Virtual, Shmirtual

Ten little virtual servers are we,
Freshly created with OpenVZ,
Ten little servers running FreeBSD CentOS 5.4,
Ten little virtual servers.

Everything is installed from source,
Automated by script of course,
Enough packages to choke a baby horse,
Ten little virtual servers.

Ten little virtual servers swiftly,
Updated to to run MongoDB,
No more MySQL for this Pixy!
Ten little virtual servers,
Ten little virtual servers.
more...

Posted by: Pixy Misa at 01:15 AM | Comments (1) | Add Comment | Trackbacks (Suck)
Post contains 182 words, total size 4 kb.

<< Page 3 of 3 >>
133kb generated in CPU 0.0779, elapsed 0.2038 seconds.
55 queries taking 0.1739 seconds, 420 records returned.
Powered by Minx 1.1.6c-pink.
Using http / http://ai.mee.nu / 418