Wednesday, February 10


View From A Pita

Quick list of the planned view types for Pita.  Not all of these will be supported in initial releases, though only two (Braid and Chord) involve known unknowns.  (A view is what SQL would call a table; a derived view in Pita is still called a view.)

Type Storage Records Indexing Recovery Distribution
Basic Memory Variable Hash Snapshot Broker
Space Memory Variable Atree Snapshot, Log Broker
Cloud Memory Variable Atree Replication Replication, Sharding
Scene Memory Variable Any memory Snapshot, Log Broker
Batch Memory Variable Atree Snapshot Broker
Array Memory Fixed Array Snapshot
Graph Memory Variable Graph Snapshot, Log Broker
Queue Memory Variable None Snapshot, Log Broker
Stack Memory Variable None Snapshot, Log Broker
Deque Memory Variable None Snapshot, Log Broker
Cache Memory Variable Hash None Local
Image Memory Variable Any memory Snapshot Broker
Store Disk Variable Btree Versioning Broker
Shelf Disk Variable Btree Versioning Broker
Fixed Disk Fixed Btree Log Broker
Table Disk Variable, Fixed Btree Versioning, Log Broker
Share Disk Variable, Fixed Btree Versioning, Log Caching
Shard Disk Variable, Fixed Btree Replication Replication, Sharding
Index Disk Variable Btree None Broker
Chord Disk Variable Btree Versioning Caching
Merge Virtual Variable, Fixed Virtual Virtual Broker
Braid Virtual Variable, Fixed Any memory Virtual Broker
Slice Virtual Variable, Fixed Any memory Virtual Broker
Remap Virtual Variable, Fixed Virtual Virtual Broker
Magic Virtual Variable, Fixed Virtual Virtual Broker

Storage - Memory, Disk, or Virtual

The immediate storage mechanism.  Memory views are accessed in memory (and must fit in memory), but have persistency to disk.  Disk views are read and written from disk.  Virtual views are representations of other views or of programmatic data.

Records - Fixed, Variable

Pita suports three main modes of storing record data: Fixed-length random access on disk, variable-length sequential access on disk, or any format in memory.  This is to avoid the complications and overhead of dynamically managing space on disk (it does handle that for Btree indexes, though).

Indexing - Hash, Atree, Btree, Qtree, Rtree, Array, Graph, Virtual

Hashes are simple one-to-one mappings of key to data, unordered, and are availble for in-memory of on-disk data.  Atrees (actually AVL trees), Qtrees (quadtrees) Rtrees, Arrays, and Graphs are available indexing methods for in-memory data.  Btrees are available for on-disk data.

Recovery - Log, Snapshot, Versioning, Replication

Crash recovery is handled by a variety of mechanisms depending on the view type.  Log recovery involves replaying a log of recent operations to make sure the data is consistent and up-to-date.  Snapshot recovery simply reloads a snapshot of data from disk.  Versioning records every change to the data as a new version, and the only recovery needed is to ensure the indexes are up-to-date.  Replication recovery is done by reading the data back from a replica.

Distribution - Broker, Caching, Local, Replication, Sharding

Broker distribution means that in a multi-server environment, client requests are directed to the dataserver that owns the desired view.  Caching is broker distribution plus a local cache of recently accessed rows; caching is suitable for frequently-accessed data where consistency is not critical.  Local distribution means that the view is available only on the local dataserver, that is, the server to which the client is directly connected.  Replication means that the data is replicated, and can be accessed from any replica node in the mesh.  Sharding means that the data is split up across nodes in the mesh, and a request may need to be sent to multiple nodes to be completed.  Sharding is intended at this point to be simple rather than optimal.


I now have libraries for all the required index structures, so that part is relatively easy. 

One problem, as I mentioned, is that the snapshot semantics don't work on Windows.  So on Windows you won't get Image views, and backups (and probably index creation) of memory views will be blocking operations.

Image views can be read-only or read-write, and single- or multi-user as required.

Shelf views will initially be read-only (that's simplest, for obvious reasons), but can be made read-write by creating an append log for the shelf and adding a shelf value to the version field in the indexes.  Images can be read-only or read-write as you prefer.  Taking either disk or memory snapshots is effectively instantaneous and uses almost no resources (CPU, memory, or disk), however, maintaining an Image or a writeable Shelf will gradually use an increasing amount of space as the original and the snapshot diverge.

Snapshots are used internally by the system for backups, recovery, and online index building, but can also be created by the user for manual point-in-time recovery, for reporting and data extracts (where it is required that the data be consistent to a point in time), and for testing - you can easily snapshot your data, run any operations you like on the snapshot, and then just drop the snapshot when you are done.  You can have multiple snapshots open at once, and you can take a snapshot of a snapshot, though of course this can multiple the resource requirements.  A single snapshot operation can be used to simulataneously snapshot all the memory-based views on that dataserver, even across multiple databases.

Also, particularly with CPython (regular Python) deployments, the dataserver is likely to be a single process.  Multi-threaded, but a single process, so it will be limited by the Python global interpreter lock, so a single dataserver can only effectively use a single core.  However, a snapshot is automatically assigned to a new process, so backups can run on a second core, and so can your reports, data extracts, and testing.

At any time when working with an Image view, you have the option of dropping it, committing it as the live version of the original view (losing any changes that happened to the original in the meantime), or committing it as a new, persistent view.  You can also simply snapshot one memory view into a new persistent view; this is also an online operation that takes almost no time or resources, and your new view is immediately recoverable by using the existing snapshot and log of the base view plus an ongoing log on the new view.

A database can contain any assortment of table types, and you don't need to know a table's type when using it, except when it comes to specific features like the expected performance of a range or spatial lookup, or reverting a record to a previous version (which of course is only available on versioned tables).

The storage and recovery mechanisms are intended to allow consistent, recoverable backups of entire databases simply by copying the files on disk using standard operating system commands.  This would require an index rebuild, but it's simple, reliable, and doesn't requires no locking of any structures.

The two partial exceptions are Batches and Arrays.  Batches are designed to be programmatically checkpointed (using the snapshot mechanism), and are for cases where it is better for the database to recover to a known point than to the latest point.  So, when you run a batch process that stores information in its own memory, it can use a Batch view to store external data.  In the event of a crash, the server will restore the Batch to its last checkpoint and you can simply re-run the program.

Arrays are intended to be high performance, and support atomic matrix and sub-matrix manipulations that can modify thousands or millions of values at once.  For this reason, they are not logged, but as with Batches you can issue checkpoint operations as required.

I'll post some more details on the planned special view types - like Index, Braid, and Chord - soon.

P.S.  Yes, I'm aware that all the view types have five-letter names.  And yes, I am mildly OCD.

Update: Come to think of it, since Pita doesn't support cross-table transactions, I could hand off each database or even each table to a separate process and scale to multiple cores that way.  The potential problem there is that Merges and Braids would require interprocess traffic rather than running in a single shared memory space.  Hmm, and I couldn't maintain a single Lua instance per connection via the embedded API.  Okay, on third thoughts, that won't work on a per-table basis.  Per database is probably fine though.

Posted by: Pixy Misa at 08:41 AM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 1381 words, total size 14 kb.

Comments are disabled. Post is locked.
57kb generated in CPU 0.0132, elapsed 0.0956 seconds.
54 queries taking 0.0864 seconds, 339 records returned.
Powered by Minx 1.1.6c-pink.