Monday, February 08

Geek

Pita

Or, The Screw It, I'm Writing A Database Post

Okay, I've had it up to here with databases that suck.*

So I'm writing my own.**  It is called, for obvious reasons, Pita.

The plan is to steal as much of the low ground as possible.  Anything hard, it just won't do.

First, it won't be written in C.  It will be written in Python.  Yes, performance will suffer, but Python can actually deliver suprisingly well, and it has some very well-optimised and well-tested libraries available.  My Python benchmark script for MongoDB*** was moving as much as 90MB of data per second, which would about saturate a gigabit ethernet link or SATA disk.

The design goals are as follows:
  • Doesn't lose your data.
  • Doesn't lose your data, ever.
  • Doesn't have to go offline for schema changes, including adding or removing indexes.
  • Doesn't have a query language.  This may end up making complex queries more complex.  That's okay, becase it makes simple CRUD operations dead easy.
  • Doesn't have any avoidable hard-coded limits that aren't insanely large.  Whatever bit-size seems reasonable for a counter, I'll double it (if it's not already an arbitrary precision value).
Notice how I'm mostly concerned with what it doesn't do?  Yeah.  I'm trying to keep this feasible.
  • Doesn't do random I/Os except for indexes and one (of seven) table types.  Well, two, I guess.
  • Doesn't do row-level locking.
  • Doesn't do multi-table or multi-record transactions, let alone two-phase commit.  Look, if you're a bank, you already have the money, go buy DB2 and stop bothering me.
  • Doesn't do joins...  Well, it sort of does.  We'll get to that.
  • Doesn't guarantee high performance, full consistency, versioning, and multi-master replication on any single table type.  It will have all of those, but you can only choose one from column A and one from column B.
  • Doesn't scale to ridiculous sizes.  If at any point, though, it looks like it can't at least scale to pretty darn big, I'll drop it.
  • Doesn't ever do random access of random-length data to disk.  Um, except for indexes, where the random-length data will be divvied up into fixed-length blocks anyway.  And for the initial development versions, indexes are likely to be memory-based unless I find some nice (simple, fast, efficient) on-disk indexing libraries.
  • Doesn't lose your data.  I mean it.  Even when it loses your data, it doesn't lose your data.
Okay, that's what it won't do.  What will it do?
  • Supports multiple table types, each optimised for a different task.  So even though there's no one table type that does everything, there should be something good enough for most cases.

    Specifically:

    • Store - a log-structured, versioned, indexed document store.  Documents are never deleted or changed.  All changes are added as a new version of the document at the end of the table.  Everything is stored in JSON or YAML.

      Advantages - no random I/Os for data, roll back to any point in time, back up simply by copying the data segments.  The entire table is in text, so even if everything goes splatooie, it's easy to write a program to parse it and salvage your data.

      Disadvantages - if your records change frequently, it will eat disk space like candy.  You can do an online file pack to purge old versions of documents, but that's a chore.  Restore from backup would require an index rebuild.
    • Fixed - a fixed length, indexed, random-access table with change log.  Pretty much the diametric opposite of Store.

      Advantages - fast access to data because the position in on disk is a function of the record number, and it's all binary so there's no parsing required.  Every record is versioned, so by copying the data file plus the change log you can roll forward to a consistent state.  You can't necessarily roll back, though.

      Disadvantages - fixed length.
    • Table - a combination of a Store for the documents and a Fixed for numeric fields, dates, booleans and any other fixed-length data.

      Advantages - all of the advantages of Store and Fixed together.  The default on-disk table type, hence the name.  Supports full roll-forward and roll-back recovery.

      Disadvantages - the fixed fields are written twice on a new major document version, to both the Store and the Fixed files.  On the other hand, changes to the fixed fields only don't require an update of the Store, significantly reducing storage requirements when you have documents with a few numeric fields that change frequently.  Similarly, reading a document requires reading two files, and you can't readily get the fixed values for minor versions (where the Store wasn't updated).
    • Space - a non-versioned, indexed, in-memory document store, with snapshot plus log persistence.

      Advantages - should be very very fast, since it's entirely held in memory.  Also very flexible for the same reason.  Roll-forward recovery, though no roll-back.  Backups are still as simple as copying the files on disk.

      Disadvantages - if your server crashes, or even if you need to do a normal reboot, you can't use the table until it's reloaded itself from disk, resynced, and rebuilt the indexes.  Thus primarily suited for frequently-accessed but relatively small tables.
    • Cache - a non-versioned, indexed, in-memory document store with a fixed size and LRU policy.

      Advantages - it's a memory-based cache with the exact same semantics as your database.

      Disadvantages - I'd be surprised if it gets within a factor of 3 of the speed of dedicated caches like memcached.
    • Queue - a disk- or memory-based document queue, i.e. first-in first-out.  Disk-based queues use segmented files and a transaction log for recovery and efficient space reclamation.

      Advantages - it's a queue with the same semantics as your database.  Well, kind of.  I don't know that I'll actually support all the fancy stuff.  Does only sequential disk I/O.

      Disadvantages - won't have some of the fancy features of something like ActiveMQ.  However, probably won't arbitrarily run out of memory and wedge itself.  At least if it crashes outright you can restart it.
    • Stack - a disk- or memory-based document stack, i.e. last-in first-out.  All the reads and writes alike happen at the end of the file.

      Advantages - it's a stack.

      Disadvantages - has to lock the file for every operation to prevent screwups, so won't be super-efficient.
    • Others under consideration
      • Array - a shared fixed-structure matrix with atomic matrix operations.
      • Cloud - a master-master eventually-consistent Space.
      • Merge - a (probably read-only) table granting easy access to multiple other tables.
      • Graph - an in-memory graph structured table , for example, for social network relationship data, with snapshot/log persistence.
      • Deque - a double-ended queue, combining the functions of Stacks and Queues.
      • Batch - a batch-oriented in-memory table with automatic snapshot persistence.
      • Share - a cached Store, ideal for data that can afford to be a little out of date but can't be inconsistent.
  • Support multiple data types that (mostly) map closely to Python's own:
    • Int
    • Char
    • Date
    • Time
    • Float
    • String
    • Money
    • Number
    • Geometry
      • Point
      • Line
      • Square
      • Rectangle
      • Circle
      • Ellipse
    • Text
    • Auto
    • Logical
    • Encrypted
    • Binary
  • Support multiple data structures within documents that closely map to Python's own:
    • Map
    • Set
    • Bag
    • List
    • Array
    • Variant
    • Reference
  • Support multiple index types and modes:
    • B-tree/B+-tree (of some sort) for primary keys, unique indexes, and general purpose indexy stuff.
    • R-tree and/or Quadtree for GIS stuff.
    • Full-text index, which will probably start out as a hacky B-tree of some sort.
    • Indexing of structures (lists, maps etc) within documents.
    • Partial indexes.
  • Triggers and stored procedures.  The embedded Lua interpreter I'm putting into the next version of Minx will do nicely.
  • Embedded database.  Don't need a full-fledged dataserver?  Just run the whole thing within your app.  The code will be split into a database library and a dataserver that runs on top.
  • Replication -  for Store, Queue and Stack, a choice of multi-master replication with eventual consistency or master-slave.  For Fixed, Table, and Space, master-slave replication.  For Cache, no replication.  (It's a cache!)
  • Sharding - for Store (probably) and Cache, easy sharding across servers.  For other table types (probably), no sharding.
  • Uses JSON or YAML everywhere, for data storage, data logs, config files, schema files, APIs and anywhere else a standard format is required.

    Advantages - no XML.

    Disadvantages - none.
  • Pure-ish Python.  The plan is to write it all in Python, with some optimisations in place for Cython.

    You can do that - it's kind of neat.  The exact same code can run interpreted with regular Python, JIT-compiled with Psyco, binary-compiled with Cython, on the JVM with Jython, or on .NET with IronPython.  And that's the plan; to make it run everywhere, but include optimisations for regular Python on Linux on x86 or x86_64.  And avoid those horrifying string concatenations if I can.

    One catch I know already - for the Space snapshots, I'm planning to use the Unix fork semantics, which are copy-on-write, i.e. you get a static snapshot of all your data structures at an amortised cost so that you can easily write it back to disk while online.  Windows' fork semantics are different and don't let you do that, so snapshots would stall the database, or at least the table.  Still, with commodity hard drives achieving peaks of over 100MB/second and modest RAID arrays reaching a few hundred MB/second, even writing a few GB of data to second once a day shouldn't take too long.
  • Designed to take advantage of SSDs and HDDs.  Put the random I/O load on your SSDs and the sequential-update bulk data on your HDDs.  Or put everything on SSDs, that works too.  I'm not going to bother to try to make random I/O work super-efficiently on HDDs; that's simply a losing game.  For small databases just use Array tables and load everything into memory; for larger databases buy yourself a nice Intel X25-E.

Okay, that's a quick rundown on the planned features.  Tomorrow, I'll start writing up my thoughts on the programming and data APIs. 

Update: Found Python implementations of B+Trees, R*-Trees, and a B-Tree based sorted list and dict module.  That'll save some time!

* That is, do not meet my current requirements.  Or in some cases, actually spread pneumonic plague.  YMMV.
** Maybe.
*** The MongoDB server is C, but the benchmark program is Python, and it ships a whole lot of data back and forth.  Which tells me that a Python program can ship a whole lot of data back and forth.  The program can create 32,000 1K or 22,000 2K records per second, and read 50,000 1k or 32,000 2K records per second.  The 90MB per second was a achieved with 10K records.

Posted by: Pixy Misa at 09:13 PM | Comments (7) | Add Comment | Trackbacks (Suck)
Post contains 1703 words, total size 12 kb.

1 My first thought, after only reading your first few sentences, is "hope having a life was nice."

Posted by: RickC at Tuesday, February 09 2010 01:32 AM (US8a/)

2 Yeah.  I do not want to write a database.  But...

I've found nice Rtree and Quadtree libraries, so GIS support is a go.

It's actually harder to find a good, general, Python-supported B+tree library, even though they're far more widely used, but I've known how B+trees work for 25 years so ... no, wait, 23 years ... so I can certainly write one myself.

Posted by: Pixy Misa at Tuesday, February 09 2010 02:23 AM (PiXy!)

3 Oh, and I haven't had a life since about 2006. sad

Posted by: Pixy Misa at Tuesday, February 09 2010 02:23 AM (PiXy!)

4 Sob, Pixy gave his life so that I can post pictures of Hakufu's boobs...

Posted by: Steven Den Beste at Tuesday, February 09 2010 03:25 AM (+rSRq)

5 Oh just use Tokyo Cabinet already, you lover of square wheels, jeez.

Posted by: Pete Zaitcev at Tuesday, February 09 2010 06:06 AM (/ppBw)

6 razz

Sound advice, Pete.  Give me a week or so to recover from my momentary insanity and I may even take it.

Posted by: Pixy Misa at Tuesday, February 09 2010 08:57 AM (PiXy!)

7 Just putting my own wrapper around Tokyo Cabinet could certainly make sense.  I still need to work out the semantics of the API though.

Posted by: Pixy Misa at Tuesday, February 09 2010 09:00 AM (PiXy!)

Hide Comments | Add Comment

Comments are disabled. Post is locked.
58kb generated in CPU 0.017, elapsed 0.1225 seconds.
56 queries taking 0.1112 seconds, 345 records returned.
Powered by Minx 1.1.6c-pink.