Ambient Irony

Wednesday, February 10

View From A Pita

Quick list of the planned view types for Pita. Not all of these will be supported in initial releases, though only two (Braid and Chord) involve known unknowns. (A view is what SQL would call a table; a derived view in Pita is still called a view.)

Type	Storage	Records	Indexing	Recovery	Distribution
Basic	Memory	Variable	Hash	Snapshot	Broker
Space	Memory	Variable	Atree	Snapshot, Log	Broker
Cloud	Memory	Variable	Atree	Replication	Replication, Sharding
Scene	Memory	Variable	Any memory	Snapshot, Log	Broker
Batch	Memory	Variable	Atree	Snapshot	Broker
Array	Memory	Fixed	Array	Snapshot	Broker
Graph	Memory	Variable	Graph	Snapshot, Log	Broker
Queue	Memory	Variable	None	Snapshot, Log	Broker
Stack	Memory	Variable	None	Snapshot, Log	Broker
Deque	Memory	Variable	None	Snapshot, Log	Broker
Cache	Memory	Variable	Hash	None	Local
Image	Memory	Variable	Any memory	Snapshot	Broker
Store	Disk	Variable	Btree	Versioning	Broker
Shelf	Disk	Variable	Btree	Versioning	Broker
Fixed	Disk	Fixed	Btree	Log	Broker
Table	Disk	Variable, Fixed	Btree	Versioning, Log	Broker
Share	Disk	Variable, Fixed	Btree	Versioning, Log	Caching
Shard	Disk	Variable, Fixed	Btree	Replication	Replication, Sharding
Index	Disk	Variable	Btree	None	Broker
Chord	Disk	Variable	Btree	Versioning	Caching
Merge	Virtual	Variable, Fixed	Virtual	Virtual	Broker
Braid	Virtual	Variable, Fixed	Any memory	Virtual	Broker
Slice	Virtual	Variable, Fixed	Any memory	Virtual	Broker
Remap	Virtual	Variable, Fixed	Virtual	Virtual	Broker
Magic	Virtual	Variable, Fixed	Virtual	Virtual	Broker

Storage - Memory, Disk, or Virtual

The immediate storage mechanism. Memory views are accessed in memory (and must fit in memory), but have persistency to disk. Disk views are read and written from disk. Virtual views are representations of other views or of programmatic data.

Records - Fixed, Variable

Pita suports three main modes of storing record data: Fixed-length random access on disk, variable-length sequential access on disk, or any format in memory. This is to avoid the complications and overhead of dynamically managing space on disk (it does handle that for Btree indexes, though).

Indexing - Hash, Atree, Btree, Qtree, Rtree, Array, Graph, Virtual

Hashes are simple one-to-one mappings of key to data, unordered, and are availble for in-memory of on-disk data. Atrees (actually AVL trees), Qtrees (quadtrees) Rtrees, Arrays, and Graphs are available indexing methods for in-memory data. Btrees are available for on-disk data.

Recovery - Log, Snapshot, Versioning, Replication

Crash recovery is handled by a variety of mechanisms depending on the view type. Log recovery involves replaying a log of recent operations to make sure the data is consistent and up-to-date. Snapshot recovery simply reloads a snapshot of data from disk. Versioning records every change to the data as a new version, and the only recovery needed is to ensure the indexes are up-to-date. Replication recovery is done by reading the data back from a replica.

Distribution - Broker, Caching, Local, Replication, Sharding

Broker distribution means that in a multi-server environment, client requests are directed to the dataserver that owns the desired view. Caching is broker distribution plus a local cache of recently accessed rows; caching is suitable for frequently-accessed data where consistency is not critical. Local distribution means that the view is available only on the local dataserver, that is, the server to which the client is directly connected. Replication means that the data is replicated, and can be accessed from any replica node in the mesh. Sharding means that the data is split up across nodes in the mesh, and a request may need to be sent to multiple nodes to be completed. Sharding is intended at this point to be simple rather than optimal.

Notes

I now have libraries for all the required index structures, so that part is relatively easy.

One problem, as I mentioned, is that the snapshot semantics don't work on Windows. So on Windows you won't get Image views, and backups (and probably index creation) of memory views will be blocking operations.

Image views can be read-only or read-write, and single- or multi-user as required.

Shelf views will initially be read-only (that's simplest, for obvious reasons), but can be made read-write by creating an append log for the shelf and adding a shelf value to the version field in the indexes. Images can be read-only or read-write as you prefer. Taking either disk or memory snapshots is effectively instantaneous and uses almost no resources (CPU, memory, or disk), however, maintaining an Image or a writeable Shelf will gradually use an increasing amount of space as the original and the snapshot diverge.

Snapshots are used internally by the system for backups, recovery, and online index building, but can also be created by the user for manual point-in-time recovery, for reporting and data extracts (where it is required that the data be consistent to a point in time), and for testing - you can easily snapshot your data, run any operations you like on the snapshot, and then just drop the snapshot when you are done. You can have multiple snapshots open at once, and you can take a snapshot of a snapshot, though of course this can multiple the resource requirements. A single snapshot operation can be used to simulataneously snapshot all the memory-based views on that dataserver, even across multiple databases.

Also, particularly with CPython (regular Python) deployments, the dataserver is likely to be a single process. Multi-threaded, but a single process, so it will be limited by the Python global interpreter lock, so a single dataserver can only effectively use a single core. However, a snapshot is automatically assigned to a new process, so backups can run on a second core, and so can your reports, data extracts, and testing.

At any time when working with an Image view, you have the option of dropping it, committing it as the live version of the original view (losing any changes that happened to the original in the meantime), or committing it as a new, persistent view. You can also simply snapshot one memory view into a new persistent view; this is also an online operation that takes almost no time or resources, and your new view is immediately recoverable by using the existing snapshot and log of the base view plus an ongoing log on the new view.

A database can contain any assortment of table types, and you don't need to know a table's type when using it, except when it comes to specific features like the expected performance of a range or spatial lookup, or reverting a record to a previous version (which of course is only available on versioned tables).

The storage and recovery mechanisms are intended to allow consistent, recoverable backups of entire databases simply by copying the files on disk using standard operating system commands. This would require an index rebuild, but it's simple, reliable, and doesn't requires no locking of any structures.

The two partial exceptions are Batches and Arrays. Batches are designed to be programmatically checkpointed (using the snapshot mechanism), and are for cases where it is better for the database to recover to a known point than to the latest point. So, when you run a batch process that stores information in its own memory, it can use a Batch view to store external data. In the event of a crash, the server will restore the Batch to its last checkpoint and you can simply re-run the program.

Arrays are intended to be high performance, and support atomic matrix and sub-matrix manipulations that can modify thousands or millions of values at once. For this reason, they are not logged, but as with Batches you can issue checkpoint operations as required.

I'll post some more details on the planned special view types - like Index, Braid, and Chord - soon.

P.S. Yes, I'm aware that all the view types have five-letter names. And yes, I am mildly OCD.

Update: Come to think of it, since Pita doesn't support cross-table transactions, I could hand off each database or even each table to a separate process and scale to multiple cores that way. The potential problem there is that Merges and Braids would require interprocess traffic rather than running in a single shared memory space. Hmm, and I couldn't maintain a single Lua instance per connection via the embedded API. Okay, on third thoughts, that won't work on a per-table basis. Per database is probably fine though.

Posted by: Pixy Misa at 08:41 AM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 1381 words, total size 14 kb.

Comments are disabled. Post is locked.

57kb generated in CPU 0.0802, elapsed 0.1414 seconds.
54 queries taking 0.1336 seconds, 342 records returned.
Powered by Minx 1.1.6c-pink.

Wednesday, February 10

Praise for Ambient Irony

Contact Support

Contact Pixy

Business News

Search Thingy

Recent Comments

Topics

Monthly Traffic

Content

Categories

Archives

A Fine Selection of Aldebaran Liqueurs

That Ol' Janx Spirit

Mostly Harmless

MuNu Blogroll

Dish of the Day

Feeds