Sunday, March 07
Asus almost make a really brilliant little server motherboard, the P7F-E.
I say almost, because it's full of zombies.
See that third ethernet port on the back? It's for remote management - KVM-over-IP. Except that it doesn't actually work unless you add a KVM-over-IP card.
Those eight extra SATA ports? Brilliant! Except that they don't actually work unless you add an eight-port RAID controller. Which, coincidentally, costs more than the motherboard.
Zombie I/O ports. I hate zombie I/O ports.
Update: The same goes for the Z8PE-12DX, which I was going to order for my day job. Now I'll need to budget for a RAID controller as well.
Posted by: Pixy Misa at
02:10 AM
| Comments (4)
| Add Comment
| Trackbacks (Suck)
Post contains 109 words, total size 1 kb.
Saturday, March 06
They've stopped making the PC-V600, the best small computer case I've ever used. All four of my desktop machines (Haruhi, Yurie, Nagi and Tanarotte) are in these cases, silver for the Windows boxes, black for Linux.
I ordered another two of them while they're still in stock, one of each colour. If I ever end up with more than six desktop machines running, I'll have worse problems to deal with than mismatching cases...
I discovered the discontinuation while shopping around for a decent midi-tower server case, something that seems to be almost extinct. Why the hell do manufactures manufacture cases with 7 5.25" bays and 2 3.5"? Who actually uses 7 DVD drives at once? I'm looking at colocating a small server at a budget colo facility, and it's cheaper and easier to build a tower box than a rack mount one, and costs no more to host.
I did find one case that suits my needs - the Fractal Design Define R2. It has eight 3.5" drive bays behind two 120mm fans (which is far more than the V600's 3 bays, with another 3 via an optional converter that fits 3x3.5" drives into 2x5.25" bays) and room for another 5 120mm fans in various locations. And it's available in three colours. In case I want to run yet another operating system, I guess.
Posted by: Pixy Misa at
05:53 PM
| Comments (2)
| Add Comment
| Trackbacks (Suck)
Post contains 231 words, total size 2 kb.
Friday, February 26
Now with added mouseoverness.
Posted by: Pixy Misa at
02:00 AM
| Comments (7)
| Add Comment
| Trackbacks (Suck)
Post contains 7 words, total size 2 kb.
Thursday, February 25

The Grand Unified Minx Theory

The mee.nu User Domains

The Minx Components
Posted by: Pixy Misa at
05:40 PM
| Comments (2)
| Add Comment
| Trackbacks (Suck)
Post contains 13 words, total size 1 kb.
Wednesday, February 24
Naturally I had to try this...
100%

50%

Oops!
Now, that's a deliberately constructed corner case, but there is a problem there.
Posted by: Pixy Misa at
12:40 PM
| Comments (2)
| Add Comment
| Trackbacks (Suck)
Post contains 24 words, total size 1 kb.
Monday, February 22
Posted by: Pixy Misa at
05:44 PM
| Comments (3)
| Add Comment
| Trackbacks (Suck)
Post contains 2 words, total size 1 kb.
Sunday, February 21
I'm in here:

(Click for full screenshot. Thanks go to Steam and GOG's insane holiday sales.)
Actually, I'm not; I'm doing work for my day job, making some progress with Pita, reorganising Meta, and have finally come to a design decision on Miko (all parts of the Minx project for those who haven't been paying attention), redoing the documentation in Sphinx - which will itself be supported in an upcoming version of Meta - and planning for this year's server upgrade.* I did play a bit of Dragon Age over the holidays, but games are taking a back seat for a while.** Despite the fact that I have 224 of them currently installed.
* If things go right we'll be moving from a lowly 8-processor (16-thread) 2.26GHz server with 24GB of RAM to a spiffy new 12-processor (24-thread) 2.66GHz server with 48GB of RAM. That's at least partly to prepare for the move to Pita, which loves to store stuff in memory. Because I can just copy the OpenVZ virtual machines across, the move should be quick and painless.
** Apart from Billy vs. SNAKEMAN!

Posted by: Pixy Misa at
02:16 PM
| Comments (8)
| Add Comment
| Trackbacks (Suck)
Post contains 188 words, total size 1 kb.
Saturday, February 20
In SQL* you say
select sum(sales) from accounts where state="NY". In Pita, the way to do this is:results = accounts.aggregate(state='NY')**which will calculate for you the count, length, sum, minimum and maximum, as appropriate, for all the fields in the table at once, so the value you need is results.sales.sum. Since the table scan is typically slower than any calculations you're likely to be doing, this seems a reasonable approach.In addition, I've added a
results = accounts.stats()which provides all those, plus mean,*** median, mode, standard deviation, and geometric and harmonic means. Aaaaand standard error, coefficient of variation, sample and population variance, skewness and kurtosis. I even sort of know what kurtosis is.I'm working on two more functions now,
group and break, though I may need to come up with another name for the latter because break is a Python keyword. This:for result in accounts.group('state', country='US'): ...would give you the aggregate sales figures for each state in the US, sensibly enough. And this:for result in accounts.break('state', country='US'):
...would give you the individual sales figures, and then automatically provide totals after the last sales record for each state.As long as I don't come down with kurtosis...
Update: Kang and jag. Or rather, agg and tab. For aggregate and tabulate.
for line in accounts.aggregate('state', country='US'): ...will give you one summary line for each state, where for line in accounts.tabulate('state', country='US'):
...would give you both detail and summary lines. I need to put subtotal and total flags on the records for tabulate. Have to watch the keywords, there. And keep my closet doors closed.* Boo, hiss!
** Or indeed
results = accounts(state='NY').aggregate()Either way should perform the same and produce the same results. I think...*** Which should come out the same as the average; just one I'm calculating myself and the other I'm pulling out of a stats module.
Posted by: Pixy Misa at
03:14 AM
| Comments (6)
| Add Comment
| Trackbacks (Suck)
Post contains 316 words, total size 3 kb.
Tuesday, February 16

Okay, yeah, they needed that sharpening filter. That's Minx's built-in upscaling. Quality is not so hot, as it turns out. I'll check on what filter it's using; normally it's only used for downscaling, which works great:

Posted by: Pixy Misa at
11:35 AM
| Comments (1)
| Add Comment
| Trackbacks (Suck)
Post contains 39 words, total size 1 kb.
Sunday, February 14
I have a working base storage class for Pita. Unfortunately, most of my weekend was eaten up by my day job and other miscellanea, but it does work.
I'll post the full code later in the week once I have a derived class or two that does something more useful, in the meantime, here's the test code to give you an example of how it's used:
def oodle_test():
# Create a base view
pets = Oodle()
# Create some pets
log('Creating pets')
# Create a dog, and save it
pet = pets.new()
pet.animal = 'dog'
pet.sound = 'woof'
pet.save()
log('Dog saved, %s pets' % pets.count(),1)
# Create a cat from a dict, and save it
pet = pets.new({'animal': 'cat', 'sound': 'meow'})
pet.save()
log('Cat saved, %s pets' % pets.count(),1)
# Append an aardvark
pet = pets.append({'animal': 'aardvark', 'sound': 'snorf'})
log('Aardvark appended, %s pets' % pets.count(),1)
# Append a hippopotamus too
pet = pets.append(animal = 'hippopotamus', sound = 'hrooonk')
log('Hippopotamus appended, %s pets' % pets.count(),1)
# What pets do I have?
log('Selecting all pets')
for pet in pets.select():
log('My %s says %s' % (pet.animal, pet.sound),1)
# Select and find on fields
log('Selecting specific pets')
# What does my dog say?
for pet in pets.select(animal = 'dog'):
log('Selected my %s; it says %s' % (pet.animal, pet.sound),1)
# Can I find my cat?
pet = pets.find(animal = 'cat')
log('Found my %s; it says %s' % (pet.animal, pet.sound),1)
return pets.count() == 4The base view class, which has no indexes, no persistence, and no support for sorting, is called an Oodle.The results of the test?
Creating pets
Dog saved, 1 pets
Cat saved, 2 pets
Aardvark appended, 3 pets
Hippopotamus appended, 4 pets
Selecting all pets
My dog says woof
My cat says meow
My aardvark says snorf
My hippopotamus says hrooonk
Selecting specific pets
Selected my dog; it says woof
Found my cat; it says meow
Oodle OKUpdate: We've hit version 0.02 wih a successful hash-table implementation. Next up is persistence... And deletes.Update: 0.03! I deleted my pet hippopotamus!
Update: 0.04! The idiom
for pet in pets now works. You can't slice it or select within it yet.
Posted by: Pixy Misa at
11:34 PM
| Comments (9)
| Add Comment
| Trackbacks (Suck)
Post contains 356 words, total size 3 kb.
Well, sort of. Kind of sort of.*
Okay, quick, tell me what language this is (without Googling the source code):
print "Eratosthenes' Sieve, in some funny language"
function print_sieve (limit):
local sieve, j = { }, 2
while j<limit:
while sieve[j]:
j=j+1
print(j)
for k = j*j, limit, j:
sieve[k] = true
j=j+1
print_sieve(100)Hint: That's not it. And I don't understand the first line of print_sieve at all. Oh, right. Logically, (sieve, j) = ({}, 2), so the local variables sieve and j are initialised as an empty dict and 2, respectively.Hint the second: It's the same language as this (believe it or not):
map = |f,x| x ? %{ hd=f(x.hd), tl=map(f,x.tl) }
filter = |p,x| x ? p(x.hd) ? %{ hd=x.hd, tl=filter(p, x.tl) }, filter(p, x.tl)
take = |n,s| n<=0 ? { }, { s.hd, unpack(take(n-1, s.tl)) }
ints = %{ hd=1; tl=map (|x| x+1, ints) }
f = |seq| %{ hd=seq.hd; tl=f(filter (|x| x%seq.hd~=0, seq.tl)) }
primes = f (ints.tl)
table.print(take (100, primes))Which implements the exact same function.Hint the third: You'll be able to script your mee.nu blog like this soon.
* Good language designers borrow. Great language designers swipe someone else's metaprogramming project.
Posted by: Pixy Misa at
01:48 AM
| Comments (2)
| Add Comment
| Trackbacks (Suck)
Post contains 199 words, total size 2 kb.
Saturday, February 13
I was looking at YAML as a serialisation option for Pita (it's already supported for exporting data in the development version of Minx).
So I ran some benchmarks.
It's sloooooooooooooooooooow.
Here we are, encoding and decoding a 2k record:
[andrew@eineus ~]$ python jsonbench.py
10000 iterations on 1973 bytes
Python
json: 10000 encodes+decodes in 4.9 seconds, 2055.8 per second
simplejson: 10000 encodes+decodes in 0.5 seconds, 20686.2 per second
pickle: 10000 encodes+decodes in 4.3 seconds, 2338.0 per second
cPickle: 10000 encodes+decodes in 0.6 seconds, 17297.6 per second
yaml: 10000 encodes+decodes in 213.4 seconds, 46.9 per second
json is Python's built-in JSON library, which is written in Python and thus somewhat sluggish. simplejson is the same JSON library with an optional C implementation. Essentially the same applies for pickle vs. cPickle.
yaml is PyYAML, which includes a C implemenation if you have LibYAML installed. Which I do, but I can't seem to get the C implemenation to run... Unless that's it, which would be pretty sad.
On the one hand, simplejson is the fastest of these options, which is good because it's also the most widely supported format and the easiest to parse.
One the other hand, 20,000 records per second is not all that much.
Posted by: Pixy Misa at
04:37 PM
| Comments (4)
| Add Comment
| Trackbacks (Suck)
Post contains 205 words, total size 1 kb.
Announcing Jsyn, the JSON syndication format for blogs and everything else.
Specification for version sqrt(-1):
1. Put a bunch of stuff in a data structure.
2. JSON-encode it.
3. Make it available somehow.
...
What, you need more of a spec? But I've registered a domain and everything!
Seriously, freedom from so-called "simple" syndication, coming soon! Part of the Pita* project.
Update: Spec version -1:
A Jsyn feed object will have precisely three first level sub-elements:
1. A
feed element, an object containing the feed properties (required).2. A
schema element, an object containing advisory schema information (optional).3. A
items element, an array of objects representing the data items (required, but may be empty).Example:
{"feed": {"source": "http://ai.mee.nu/feed.jsyn"},
"schema": {"source": "http://jsyn.net/schemas/blog.jsyn", "version": 1.0},
"items": []}
A client may use a local copy of the schema so long as the version matches that specified in the
schema object. The server must increment the version when updating the schema. The server may revert to an older schema with a lower version number; the client must not continue to use the local copy of the schema in this case.* Which is part of the Minx project**, which is part of the make-Pixy-rich-or-drive-him-insane-either-is-fine project.
** I've subdivided Minx into three parts, like Gaul, only with less garlic: Minx, the bliki; Meta, the template, formatting, and scripting engine; and Pita, the database engine/abstraction layer. In addition, there's Miko, the planned desktop client.
Posted by: Pixy Misa at
01:17 PM
| Comments (2)
| Add Comment
| Trackbacks (Suck)
Post contains 240 words, total size 2 kb.
Now where the heck was I, before being buried under an avalanche of poorly-considered Atom feeds and Chinese replica watch spam?
Ah, right.
We can't do a full Progress-style
where clause in Python, unfortunately. Or not without more trickery than I intend to apply; someone did make a working goto - never mind that, a working comefrom - but I'm not inclined to go to that sort of length.So. I want the first 20 posts in a given folder of a given blog, sorted by date order (descending, of course). Pythonically. No SQL. Let's see:
db = Pita.Connect(host, user, pass, database)
posts = db.views.postsI've connected to the Pita server and have a view open. Now:for post in posts(folder=f,order='date-',limit=20):
...That's not bad. With Python's named parameters, you can use any field in a flat record structure, so:authors = db.views.authors
a = authors.find(name='Pixy Misa')
for post in posts(author=a.id, tag='databases',order='score-',limit=20):
...If we have a nested structure, though, it doesn't work. Python doesn't let you say:for post in posts(author.name = 'Pixy Misa')even if we have the code to automatically resolve the relation. It's not valid syntax. So that's one place where it breaks down. Another place is ranges; we can sayfor author in authors(country = 'Australia')to get a list of authors who live in Australia, but we can't sayfor author in authors('Andorra' < country < 'Azerbaijan')even though that is a valid Python expression. It will get evaluated, and we'll just pass either True or False to authors() (or throw an exception), and it just won't work.Now, the design of Pita is that it's primarly a document database with advisory schemas. It's not schemaless like many or the key-value stores, and it's not fixed-schema like most traditional relational databases. Each view has a schema, which specifies what fields should be there, and if they are, what type they should be. Fields can be missing, in which case the schema may specify a default value. And you can stick in whatever additional data you want, so long as the schema doesn't specifically conflict with that.
What this means is that we can know that
country is a string, and if we do an equality comparison between a string and a list, we mean that we want to know if the string is in the list. So we can also do this:for author in authors(country = ['Australia', 'New Zealand', 'Canada'])to get authors from any of those countries.By returning a generator or iterator, we can efficiently replace this:
for post in posts(blog=b,tag='databases',order='score-',limit=20)with the more Pythonicfor post in posts(blog=b,tag='databases',order='score-')[:20]Slicing (as it is called) is very general in Python and very useful, so adapting it to database selects will come naturally.But what about range searches? There's no obvious Pythonic syntax for this, at least, not one that works. Here are a few possiblities:
for house in houses(price = '<100000'):
...
for house in houses(price = ['>50000','<100000']):
...
We know price is of type money, so we look at that string, and the leading < means it's a range match. Goody! Doesn't work so well - or at all, for that matter - for strings, because we could be perfectly well looking for those exact strings. We could have an explicit range function:
for house in houses.price.range(50000,100000):
...That's not too bad either; it's pretty clear syntactically and semantically, and it requires no parsing. Doesn't let you differentiate between > and >= though - and you can't do a range match on more than one field. (You can't effectively use a binary tree for such a search anyway.) We can still slide in our other parameters like so:for house in houses.price.range(50000,100000,suburb='Wondabyne'):
...But (again due to the strictures of Python), they must come last.Since we're building a generator, it should be possible to do this LINQish trick:
for house in houses(suburb='Wondabyne').price.range(50000,100000):
...The first operation produces a view that knows to search on the suburb field for Wondabyne. This derived view has the exact same attributes of the original view, and price is one of those attributes, and we can use the range selector on price just like before.We should be able to keep doing that sort of thing, until we get something like:
for house in houses.suburb('Wondabyne').bedrooms.range(3,5).bathrooms.min(2).price.max(150000).order('price+'):
...But it's not terribly dynamic. So, next stop, dynamism.
Posted by: Pixy Misa at
03:13 AM
| Comments (2)
| Add Comment
| Trackbacks (Suck)
Post contains 701 words, total size 5 kb.
Friday, February 12
5 letters, priority queue.
Posted by: Pixy Misa at
10:48 PM
| Comments (2)
| Add Comment
| Trackbacks (Suck)
Post contains 6 words, total size 1 kb.
I was planning to spend the weekend working with MongoDB, but those plans evaporated when it crashed and destroyed my test database. So instead I dug out my toy Python benchmark and ran it on Eineus. And just for fun, did the same in Psyco and Cython and Jython. Results are... Mixed. Yeah, that's a good word, particularly since the IronPython benchmark is still running.
| System | CPU | Clock | Python | Loop | String | Scan | Total |
|---|---|---|---|---|---|---|---|
| Eineus | Phenom II 945 | 3.0GHz | 2.6.4/32 | 0.950 | 1.483 | 0.437 | 2.870 |
| Eineus | Phenom II 945 | 3.0GHz | 2.6.4/Pysco | 0.013 | 0.180 | 0.477 | 0.670 |
| Eineus | Phenom II 945 | 3.0GHz | 2.6.4/Cython | 0.000 | 84.750 | 0.490 | 85.240 |
| Eineus | Phenom II 945 | 3.0GHz | 2.5.1/Jython | 0.682 | 499.936 | 0.758 | 501.376 |
| Nagi | Phenom 9750 | 2.4GHz* | 2.6/IronPython/32 | 0.544 | 3502.652 | 1.541 | 3504.739 |
| Nagi | Phenom 9750 | 2.4GHz* | 2.6/IronPython/64 | 0.916 | 5399.020 | 1.264 | 5401.202 |
| Potemayo | Core 2 Duo | 1.8GHz | 2.6/IronPython/32 | 0.559 | 1.632 | 1.567 | 3.759 |
| Miyabi | Phenom II 945 | 3.0GHz | 2.6.4/64 | 0.637 | 1.003 | 0.530 | 2.170 |
| Akane | Opteron | 2.0GHz | 2.5 | 1.887 | 2.733 | 0.880 | 5.500 |
* Normalised to 3.0GHz for ease of comparison.
I'll paste in the IronPython results if it ever finishes. (Update: Done now.)
Some notes:
64-bit Python is now a good bit faster than 32-bit for many cases. It's actually a bit slower in string scanning; I don't know why.
A 3GHz Phenom II running Python 2.6 is 2x faster than a 2GHz Opteron running Python 2.5 from 3 years ago. Someone's been doing some good work, either the Python people or AMD or the Gnu compiler team.
CPython (the standard Python) has some really neat string optimisations that I depend on in Minx. These flow through nicely to Psyco, but are conspicuously absent from Cython, Jython, and IronPython, which are 60, 400, and a couple of thousand times slower for heavy string concatenation (as I said, that benchmark hasn't finished yet...) It's certainly possible to avoid that idiom, and instead, for example, create a list of substrings and then join them all in one operation.
Apart from that, Jython seems to perform fairly well; certainly, if you need to run heavily multi-threaded Python code and can avoid doing millions of concatenations of large strings, Jython could be a winner. The Python interpreter can only run one thread at a time, though other threads can be handling I/O or library functions. The Jython runtime is fully multithreaded, so if you have a multi-threaded application and more than two CPUs - which I do - then Jython can provide an overall performance boost even if the single thread performance declines somewhat. (And it's actually faster on one of the tests, so depending on your code you might win both ways.)
As for IronPython, well, the string concatenation results are just terrible. Looping is comparable with Python or Jython (but far behind Psyco or Cython), and string scanning is the slowest of the lot, though only by a factor of 2, not 2000. It should be fast enough for most tasks as long as you really avoid concatenating large strings. I wonder what list performance is like - I'll have to add a test for that.
Bumped and updated: Tested this just now on Potemayo with a StringBuilder implementation for the strcat part of the benchmark, which delivered about 2000x faster performance (!) on that part of the benchmark and 1000x on average, which would make a .Net deployment of Minx pretty feasible. It would be nice if all these people deploying immutable string libraries could also deploy the Python concatentation performance trick, because StringBuilder is not a general-purpose string library, just a faster way to concatenate and modify a stringlike object.
Posted by: Pixy Misa at
05:09 AM
| Comments (3)
| Add Comment
| Trackbacks (Suck)
Post contains 527 words, total size 5 kb.
Just pootling about benchmarking stuffs again, which is what I do when I'm too lazy to actually code something but want to appear productive. This time I'm seeing how many little objects Python can create per second. Here's the code:
import time
class iwb(int):
def __setattr__(self, k, v):
pass
def __getattr__(self, k, default=None):
return None
def perf(count):
t0 = time.time()
for i in range(count):
j = iwb(i)
print '%s per second' % (count / (time.time() - t0))
perf(10000000)And here are the numbers:Python 2.6.4/32: 3044540.34221 per second
Psyco 2.6.4/32: 5378755.4303 per second
Jython 2.5.1: 795102.161706 per second (significantly slower, but better than I'd expected)
Cython 2.6.4/32: 3830629.40698 per second (faster than Python, slower than Psyco)
Python 2.6.4/64: 3617701.37013 per second (again nearly 20% faster than the 32-bit version)
IronPython 2.6: 1931247.06236 per second (On my laptop, a 1.8GHz Core 2 Duo, compared to the 3GHz Phenom II for the preceding results, so not too bad. I need to see if it works under Mono.)
The reason I'm fiddling with this is that I'm looking at an optimisation/simplification for Minx, but it would involve creating about 10,000 objects to load the current main page of my blog. It's part of some syntactical trickery I'm trying to work out that creates virtual methods on objects in trees.... Without the objects being visibly objects, which involves some hackery when you put them in the tree in the first place.
Posted by: Pixy Misa at
03:16 AM
| Comments (2)
| Add Comment
| Trackbacks (Suck)
Post contains 236 words, total size 2 kb.
Pixy's First Rule of Computer Language Design: Syntax matters.
You can have all the semantic brilliance in the world, but if your syntax is counter-intuitive, no-one will use it. See for example the widespread success of Smalltalk and Lisp in the industry.
... crickets ...
Right.
So, how does this apply to databases, the topic to which I am dedicating an extended rant? Well, the problem with databases in the modern world is essentially twofold:
- SQL.
- Everything else.
Everything else's syntax, semantics and implementations generally leave something to be desired too; it's impossible to design the perfect language (though if someone created interoperable implementations of Python and Ruby atop the Go compiler, they'd be getting close) because there are syntactic and semantic conflicts that cannot be resolved; you have to choose on which side of the conflict you wish to fall, and live with that decision.*
But the real clash comes when you want to get your data and do something with it, like, for example, display it on the screen. You write a SQL query (which is a program in its own right), send it to the database server, get a bunch of text back, which you then interpret into structures suited to your programming language, process those structures, and finally, display it on the screen.
That is, frankly, a godawful mess, slow, insecure, painful to code, and nonetheless how most of the websites in the world actually work.
The frustrating thing is that this problem was solved completely in the early 80s.**
In Progress, to display a list of customers with balances over $100, you can say:
for each customer where balance>100:
display customer.
end.Okay, that's nothing special; in SQL it's even easier:select * from customer where balance>100;But in Progress, you can do this:def var reminder as logical.
for each customer where balance>100:
display customer.
update reminder.
if reminder then do:
send_reminder(id).
reminder_date = today.
end.
end.That is, you can loop through the customers, prompt the user whether to send a reminder letter to each, send out the letters as instructed, and mark it on the account. It even initiates the transactions for you and provides automatic commit and undo semantics. No need to write queries in one language and application code in another, your data is simply there for you to work with. Your user could be on a Wyse 60, your application on a Dell 386 running Xenix, and your database on an AS/400, and it would all just work.It was a thing of beauty. Until your r-code (Progress's bytecode) exceeded 64k and you couldn't compile it any more.***
Nothing else compares. Microsoft's LINQ, two full decades later, is a poor and partial reinvention of Progress.
Exactly why Progress is languishing in near-irrelevance today is a topic of a whole month of rants, and beside the point, which is that somebody already got it right.
I can't use Progress for my applications because the language itself is abjectly inefficient and limited compared to a modern dynamic language like Python or Ruby.**** So the question is, what can I do with a modern dynamic language like Python or Ruby to recapture some of that functionality, some of that breezy syntactic and semantic awesomeness?
Well, neither Python nor Ruby has an interactive mode that remotely compares with Progress, so I'm going to ignore that part for the moment and concentrate on the data syntax and semantics. And I'm going to mostly use Python, because I'm more familiar with it and because they're similar enough that it doesn't matter that much.
Python has these things called generators. Generators are program structures that pretend to be data structures. In Python, if you have a list
lollipops = ['cherry','huckleberry','lime']You can iterate through with:for each pop in lollipops:
print pop(As an aside, there's no end statement, which is one of Python's greatest strengths-and/or-weaknesses.) Which will print, as you might expect,cherry
huckleberry
limeSo far, it's elegant but limited. The clever trick is that if I write this piece of code:def lollipops():
flavours = ['cherry','huckleberry','lime']
for flavour in flavours:
yield flavourAnd then write:for pop in lollipops():
print popI will again getcherry
huckleberry
limeWe have a program (a subprogram) here acting almost exactly like a data structure. So rather than being limited to the data structures that we have, or being forced to write code that manipulates the data structures and calls functions to get the bits and pieces that we want, we can just use the language's own for loop. The only distinction is that calling a subprogram requires dereferencing it with () (otherwise you get Python's version of a function pointer - very useful, but not what we want right now). But we can do better still by creating a class, like so:class candy():
def __init__(self):
self.flavours = ['cherry', 'huckleberry', 'lime']
def __iter__(self):
for flavour in self.flavours:
yield flavour
lollipops = candy()
for pop in lollipops:
print pop
cherry
huckleberry
lime
And now we have code that acts exactly like a data structure. (Well, for the requirements of this little example.)At which point I shall leave off for the moment, to resume at some time other than 3AM.
* Some of the syntactic conflicts arise from the restrictions of the ASCII character set, which simply doesn't have some of the symbols we need, so we substitude and overload the ones we do. Of course, today we have Unicode and can display every symbol we could ask for - but you can't type in Unicode.
** And probably before, but certainly in the 80s.
*** Something which, as far as I know, they still haven't entirely resolved.
**** And because it's overpriced and the licensing model is insane.
Posted by: Pixy Misa at
02:31 AM
| Comments (3)
| Add Comment
| Trackbacks (Suck)
Post contains 967 words, total size 8 kb.
Thursday, February 11
There's a quote (which I am unable to find) to the effect that Algol 60 was better not only than all the languages that preceded it, but all the languages that followed it as well. Which rather steps on the punchline here, but at mee.nu we're all about the stepping on of punchlines.
P.S. Enough with the curly brackets!
P.P.S. Please don't remind me that it could have been worse.
Posted by: Pixy Misa at
02:17 AM
| Comments (3)
| Add Comment
| Trackbacks (Suck)
Post contains 76 words, total size 1 kb.
Wednesday, February 10
Quick list of the planned view types for Pita. Not all of these will be supported in initial releases, though only two (Braid and Chord) involve known unknowns. (A view is what SQL would call a table; a derived view in Pita is still called a view.)
| Type | Storage | Records | Indexing | Recovery | Distribution |
| Basic | Memory | Variable | Hash | Snapshot | Broker |
| Space | Memory | Variable | Atree | Snapshot, Log | Broker |
| Cloud | Memory | Variable | Atree | Replication | Replication, Sharding |
| Scene | Memory | Variable | Any memory | Snapshot, Log | Broker |
| Batch | Memory | Variable | Atree | Snapshot | Broker |
| Array | Memory | Fixed | Array | Snapshot |
Broker |
| Graph | Memory | Variable | Graph | Snapshot, Log | Broker |
| Queue | Memory | Variable | None | Snapshot, Log | Broker |
| Stack | Memory | Variable | None | Snapshot, Log | Broker |
| Deque | Memory | Variable | None | Snapshot, Log | Broker |
| Cache | Memory | Variable | Hash | None | Local |
| Image | Memory | Variable | Any memory | Snapshot | Broker |
| Store | Disk | Variable | Btree | Versioning | Broker |
| Shelf | Disk | Variable | Btree | Versioning | Broker |
| Fixed | Disk | Fixed | Btree | Log | Broker |
| Table | Disk | Variable, Fixed | Btree | Versioning, Log | Broker |
| Share | Disk | Variable, Fixed | Btree | Versioning, Log | Caching |
| Shard | Disk | Variable, Fixed | Btree | Replication | Replication, Sharding |
| Index | Disk | Variable | Btree | None | Broker |
| Chord | Disk | Variable | Btree | Versioning | Caching |
| Merge | Virtual | Variable, Fixed | Virtual | Virtual | Broker |
| Braid | Virtual | Variable, Fixed | Any memory | Virtual | Broker |
| Slice | Virtual | Variable, Fixed | Any memory | Virtual | Broker |
| Remap | Virtual | Variable, Fixed | Virtual | Virtual | Broker |
| Magic | Virtual | Variable, Fixed | Virtual | Virtual | Broker |
Storage - Memory, Disk, or Virtual
The immediate storage mechanism. Memory views are accessed in memory (and must fit in memory), but have persistency to disk. Disk views are read and written from disk. Virtual views are representations of other views or of programmatic data.
Records - Fixed, Variable
Pita suports three main modes of storing record data: Fixed-length random access on disk, variable-length sequential access on disk, or any format in memory. This is to avoid the complications and overhead of dynamically managing space on disk (it does handle that for Btree indexes, though).
Indexing - Hash, Atree, Btree, Qtree, Rtree, Array, Graph, Virtual
Hashes are simple one-to-one mappings of key to data, unordered, and are availble for in-memory of on-disk data. Atrees (actually AVL trees), Qtrees (quadtrees) Rtrees, Arrays, and Graphs are available indexing methods for in-memory data. Btrees are available for on-disk data.
Recovery - Log, Snapshot, Versioning, Replication
Crash recovery is handled by a variety of mechanisms depending on the view type. Log recovery involves replaying a log of recent operations to make sure the data is consistent and up-to-date. Snapshot recovery simply reloads a snapshot of data from disk. Versioning records every change to the data as a new version, and the only recovery needed is to ensure the indexes are up-to-date. Replication recovery is done by reading the data back from a replica.
Distribution - Broker, Caching, Local, Replication, Sharding
Broker distribution means that in a multi-server environment, client requests are directed to the dataserver that owns the desired view. Caching is broker distribution plus a local cache of recently accessed rows; caching is suitable for frequently-accessed data where consistency is not critical. Local distribution means that the view is available only on the local dataserver, that is, the server to which the client is directly connected. Replication means that the data is replicated, and can be accessed from any replica node in the mesh. Sharding means that the data is split up across nodes in the mesh, and a request may need to be sent to multiple nodes to be completed. Sharding is intended at this point to be simple rather than optimal.
Notes
I now have libraries for all the required index structures, so that part is relatively easy.
One problem, as I mentioned, is that the snapshot semantics don't work on Windows. So on Windows you won't get Image views, and backups (and probably index creation) of memory views will be blocking operations.
Image views can be read-only or read-write, and single- or multi-user as required.
Shelf views will initially be read-only (that's simplest, for obvious reasons), but can be made read-write by creating an append log for the shelf and adding a shelf value to the version field in the indexes. Images can be read-only or read-write as you prefer. Taking either disk or memory snapshots is effectively instantaneous and uses almost no resources (CPU, memory, or disk), however, maintaining an Image or a writeable Shelf will gradually use an increasing amount of space as the original and the snapshot diverge.
Snapshots are used internally by the system for backups, recovery, and online index building, but can also be created by the user for manual point-in-time recovery, for reporting and data extracts (where it is required that the data be consistent to a point in time), and for testing - you can easily snapshot your data, run any operations you like on the snapshot, and then just drop the snapshot when you are done. You can have multiple snapshots open at once, and you can take a snapshot of a snapshot, though of course this can multiple the resource requirements. A single snapshot operation can be used to simulataneously snapshot all the memory-based views on that dataserver, even across multiple databases.
Also, particularly with CPython (regular Python) deployments, the dataserver is likely to be a single process. Multi-threaded, but a single process, so it will be limited by the Python global interpreter lock, so a single dataserver can only effectively use a single core. However, a snapshot is automatically assigned to a new process, so backups can run on a second core, and so can your reports, data extracts, and testing.
At any time when working with an Image view, you have the option of dropping it, committing it as the live version of the original view (losing any changes that happened to the original in the meantime), or committing it as a new, persistent view. You can also simply snapshot one memory view into a new persistent view; this is also an online operation that takes almost no time or resources, and your new view is immediately recoverable by using the existing snapshot and log of the base view plus an ongoing log on the new view.
A database can contain any assortment of table types, and you don't need to know a table's type when using it, except when it comes to specific features like the expected performance of a range or spatial lookup, or reverting a record to a previous version (which of course is only available on versioned tables).
The storage and recovery mechanisms are intended to allow consistent, recoverable backups of entire databases simply by copying the files on disk using standard operating system commands. This would require an index rebuild, but it's simple, reliable, and doesn't requires no locking of any structures.
The two partial exceptions are Batches and Arrays. Batches are designed to be programmatically checkpointed (using the snapshot mechanism), and are for cases where it is better for the database to recover to a known point than to the latest point. So, when you run a batch process that stores information in its own memory, it can use a Batch view to store external data. In the event of a crash, the server will restore the Batch to its last checkpoint and you can simply re-run the program.
Arrays are intended to be high performance, and support atomic matrix and sub-matrix manipulations that can modify thousands or millions of values at once. For this reason, they are not logged, but as with Batches you can issue checkpoint operations as required.
I'll post some more details on the planned special view types - like Index, Braid, and Chord - soon.
P.S. Yes, I'm aware that all the view types have five-letter names. And yes, I am mildly OCD.
Update: Come to think of it, since Pita doesn't support cross-table transactions, I could hand off each database or even each table to a separate process and scale to multiple cores that way. The potential problem there is that Merges and Braids would require interprocess traffic rather than running in a single shared memory space. Hmm, and I couldn't maintain a single Lua instance per connection via the embedded API. Okay, on third thoughts, that won't work on a per-table basis. Per database is probably fine though.
Posted by: Pixy Misa at
08:41 AM
| Comments (2)
| Add Comment
| Trackbacks (Suck)
Post contains 1381 words, total size 14 kb.
Tuesday, February 09
Just priced a server for the office at my day job - all but one little server at the moment are hosted, and we need something local.
8CPUs, 24GB RAM, 12TB disk, $5000.
I remember at my PPPPOE spending a good million dollars to get that kind of system. Throw a couple of Intel E-series SSDs in there and we'd get the IO throughput of that million dollar system as well.
Posted by: Pixy Misa at
06:04 PM
| Comments (2)
| Add Comment
| Trackbacks (Suck)
Post contains 79 words, total size 1 kb.
I'd forgotten about this one: mxBeeBase. Open source B+tree Python library.
Perfect!
So that's R-trees, Quadtrees and B+trees done. Within hours of deciding I was going to write my own database I already have all the indexing dealt with.
Oh, and Pita is going to be open source too. BSD license, probably, unless I need to use something GPL.
Posted by: Pixy Misa at
02:31 AM
| Comments (4)
| Add Comment
| Trackbacks (Suck)
Post contains 63 words, total size 1 kb.
Well, that's annoying. I just wrote one of these to speed up the Minx template parser, but this one is probably faster (it's a Cython module) and certainly better documented.
The functionality is identical as far as I can see, but I called mine a DictTree.
Actually, no, mine has one extra feature: It allows you to reference a value using mapping-style lookup (tags['a.b']) that would be a subtree in an attribute-style lookup. If tags.a.b.c is set, tags.a.b is necessarily the tree containing b; try anything else and you blow yourself out of the water. But tags['a.b'] can be anything you like.
I need to do that because I designed the Minx template language that way. (Oops.) It lets you reference, for example, post.date as a date value, and also post.date.month to find just the month of the date of the post. You can't do that with dicts; you could probably do it with a smarter class, but bang would go my generality.
Since that trick is used on both dates and strings, I'd need to make all my dates and strings into custom classes to make the attribute syntax work directly, and that's just too messy to contemplate.
Posted by: Pixy Misa at
02:09 AM
| Comments (2)
| Add Comment
| Trackbacks (Suck)
Post contains 200 words, total size 1 kb.
Monday, February 08
Or, The Screw It, I'm Writing A Database Post
Okay, I've had it up to here with databases that suck.*
So I'm writing my own.** It is called, for obvious reasons, Pita.
The plan is to steal as much of the low ground as possible. Anything hard, it just won't do.
First, it won't be written in C. It will be written in Python. Yes, performance will suffer, but Python can actually deliver suprisingly well, and it has some very well-optimised and well-tested libraries available. My Python benchmark script for MongoDB*** was moving as much as 90MB of data per second, which would about saturate a gigabit ethernet link or SATA disk.
The design goals are as follows:
- Doesn't lose your data.
- Doesn't lose your data, ever.
- Doesn't have to go offline for schema changes, including adding or removing indexes.
- Doesn't have a query language. This may end up making complex queries more complex. That's okay, becase it makes simple CRUD operations dead easy.
- Doesn't have any avoidable hard-coded limits that aren't insanely large. Whatever bit-size seems reasonable for a counter, I'll double it (if it's not already an arbitrary precision value).
- Doesn't do random I/Os except for indexes and one (of seven) table types. Well, two, I guess.
- Doesn't do row-level locking.
- Doesn't do multi-table or multi-record transactions, let alone two-phase commit. Look, if you're a bank, you already have the money, go buy DB2 and stop bothering me.
- Doesn't do joins... Well, it sort of does. We'll get to that.
- Doesn't guarantee high performance, full consistency, versioning, and multi-master replication on any single table type. It will have all of those, but you can only choose one from column A and one from column B.
- Doesn't scale to ridiculous sizes. If at any point, though, it looks like it can't at least scale to pretty darn big, I'll drop it.
- Doesn't ever do random access of random-length data to disk. Um, except for indexes, where the random-length data will be divvied up into fixed-length blocks anyway. And for the initial development versions, indexes are likely to be memory-based unless I find some nice (simple, fast, efficient) on-disk indexing libraries.
- Doesn't lose your data. I mean it. Even when it loses your data, it doesn't lose your data.
- Supports multiple table types, each optimised for a different task. So even though there's no one table type that does everything, there should be something good enough for most cases.
Specifically: - Store - a log-structured, versioned, indexed document store. Documents are never deleted or changed. All changes are added as a new version of the document at the end of the table. Everything is stored in JSON or YAML.
Advantages - no random I/Os for data, roll back to any point in time, back up simply by copying the data segments. The entire table is in text, so even if everything goes splatooie, it's easy to write a program to parse it and salvage your data.
Disadvantages - if your records change frequently, it will eat disk space like candy. You can do an online file pack to purge old versions of documents, but that's a chore. Restore from backup would require an index rebuild. - Fixed - a fixed length, indexed, random-access table with change log. Pretty much the diametric opposite of Store.
Advantages - fast access to data because the position in on disk is a function of the record number, and it's all binary so there's no parsing required. Every record is versioned, so by copying the data file plus the change log you can roll forward to a consistent state. You can't necessarily roll back, though.
Disadvantages - fixed length. - Table - a combination of a Store for the documents and a Fixed for numeric fields, dates, booleans and any other fixed-length data.
Advantages - all of the advantages of Store and Fixed together. The default on-disk table type, hence the name. Supports full roll-forward and roll-back recovery.
Disadvantages - the fixed fields are written twice on a new major document version, to both the Store and the Fixed files. On the other hand, changes to the fixed fields only don't require an update of the Store, significantly reducing storage requirements when you have documents with a few numeric fields that change frequently. Similarly, reading a document requires reading two files, and you can't readily get the fixed values for minor versions (where the Store wasn't updated). - Space - a non-versioned, indexed, in-memory document store, with snapshot plus log persistence.
Advantages - should be very very fast, since it's entirely held in memory. Also very flexible for the same reason. Roll-forward recovery, though no roll-back. Backups are still as simple as copying the files on disk.
Disadvantages - if your server crashes, or even if you need to do a normal reboot, you can't use the table until it's reloaded itself from disk, resynced, and rebuilt the indexes. Thus primarily suited for frequently-accessed but relatively small tables. - Cache - a non-versioned, indexed, in-memory document store with a fixed size and LRU policy.
Advantages - it's a memory-based cache with the exact same semantics as your database.
Disadvantages - I'd be surprised if it gets within a factor of 3 of the speed of dedicated caches like memcached. - Queue - a disk- or memory-based document queue, i.e. first-in first-out. Disk-based queues use segmented files and a transaction log for recovery and efficient space reclamation.
Advantages - it's a queue with the same semantics as your database. Well, kind of. I don't know that I'll actually support all the fancy stuff. Does only sequential disk I/O.
Disadvantages - won't have some of the fancy features of something like ActiveMQ. However, probably won't arbitrarily run out of memory and wedge itself. At least if it crashes outright you can restart it. - Stack - a disk- or memory-based document stack, i.e. last-in first-out. All the reads and writes alike happen at the end of the file.
Advantages - it's a stack.
Disadvantages - has to lock the file for every operation to prevent screwups, so won't be super-efficient. - Others under consideration
- Array - a shared fixed-structure matrix with atomic matrix operations.
- Cloud - a master-master eventually-consistent Space.
- Merge - a (probably read-only) table granting easy access to multiple other tables.
- Graph - an in-memory graph structured table , for example, for social network relationship data, with snapshot/log persistence.
- Deque - a double-ended queue, combining the functions of Stacks and Queues.
- Batch - a batch-oriented in-memory table with automatic snapshot persistence.
- Share - a cached Store, ideal for data that can afford to be a little out of date but can't be inconsistent.
- Support multiple data types that (mostly) map closely to Python's own:
- Int
- Char
- Date
- Time
- Float
- String
- Money
- Number
- Geometry
- Point
- Line
- Square
- Rectangle
- Circle
- Ellipse
- Text
- Auto
- Logical
- Encrypted
- Binary
- Support multiple data structures within documents that closely map to Python's own:
- Map
- Set
- Bag
- List
- Array
- Variant
- Reference
- Support multiple index types and modes:
- B-tree/B+-tree (of some sort) for primary keys, unique indexes, and general purpose indexy stuff.
- R-tree and/or Quadtree for GIS stuff.
- Full-text index, which will probably start out as a hacky B-tree of some sort.
- Indexing of structures (lists, maps etc) within documents.
- Partial indexes.
- Triggers and stored procedures. The embedded Lua interpreter I'm putting into the next version of Minx will do nicely.
- Embedded database. Don't need a full-fledged dataserver? Just run the whole thing within your app. The code will be split into a database library and a dataserver that runs on top.
- Replication - for Store, Queue and Stack, a choice of multi-master replication with eventual consistency or master-slave. For Fixed, Table, and Space, master-slave replication. For Cache, no replication. (It's a cache!)
- Sharding - for Store (probably) and Cache, easy sharding across servers. For other table types (probably), no sharding.
- Uses JSON or YAML everywhere, for data storage, data logs, config files, schema files, APIs and anywhere else a standard format is required.
Advantages - no XML.
Disadvantages - none. - Pure-ish Python. The plan is to write it all in Python, with some optimisations in place for Cython.
You can do that - it's kind of neat. The exact same code can run interpreted with regular Python, JIT-compiled with Psyco, binary-compiled with Cython, on the JVM with Jython, or on .NET with IronPython. And that's the plan; to make it run everywhere, but include optimisations for regular Python on Linux on x86 or x86_64. And avoid those horrifying string concatenations if I can.
One catch I know already - for the Space snapshots, I'm planning to use the Unix fork semantics, which are copy-on-write, i.e. you get a static snapshot of all your data structures at an amortised cost so that you can easily write it back to disk while online. Windows' fork semantics are different and don't let you do that, so snapshots would stall the database, or at least the table. Still, with commodity hard drives achieving peaks of over 100MB/second and modest RAID arrays reaching a few hundred MB/second, even writing a few GB of data to second once a day shouldn't take too long. - Designed to take advantage of SSDs and HDDs. Put the random I/O load on your SSDs and the sequential-update bulk data on your HDDs. Or put everything on SSDs, that works too. I'm not going to bother to try to make random I/O work super-efficiently on HDDs; that's simply a losing game. For small databases just use Array tables and load everything into memory; for larger databases buy yourself a nice Intel X25-E.
Update: Found Python implementations of B+Trees, R*-Trees, and a B-Tree based sorted list and dict module. That'll save some time!
* That is, do not meet my current requirements. Or in some cases, actually spread pneumonic plague. YMMV.
** Maybe.
*** The MongoDB server is C, but the benchmark program is Python, and it ships a whole lot of data back and forth. Which tells me that a Python program can ship a whole lot of data back and forth. The program can create 32,000 1K or 22,000 2K records per second, and read 50,000 1k or 32,000 2K records per second. The 90MB per second was a achieved with 10K records.
Posted by: Pixy Misa at
09:13 PM
| Comments (9)
| Add Comment
| Trackbacks (Suck)
Post contains 1703 words, total size 12 kb.
Sunday, February 07
Anyone?
Quick recap of databases that suck - or at least, suck for my purposes - and some that I'm still investigating.
SQL
- MySQL Lacks intersection and except, lacks array support, has only so-so full-text indexing, offers either concurrency or full-text indexes and GIS, but not both.
- PostgreSQL Provides arrays, concurrency and full-text indexes and GIS, but a lot of this is lost in a twisty maze of plugins and thoroughly non-standard operators. And the full-text indexing sucks.
- Ingres Ingres is free these days (if you don't need a support contract). It's a good, solid database, but doesn't actually offer anything I can't get from MySQL with InnoDB.
- Firebird Doesn't seem to offer anything more than MySQL or PostgreSQL or Ingres. Which doesn't mean that it's bad, but doesn't really help me either.
- SQL Server Needs Windows, which is worth an automatic 6 demerits, even though I can get enterprise-level Windows and SQL Server products for free. (My company is a Microsoft Bizspark member.) Full-text and GIS, intersect and except are all there, but still no arrays.
- IBM DB2 Costs too much.
- Oracle Costs way too much.
- Progress / OpenEdge Solid database, lovely 4GL, but still, the last time I looked at it (2006) mired in 16-bitness (!) and the 4GL is too slow for anything complicated. Also expensive and has a screwed-up pricing model. Would use it if I could.
NoSQL
- Redis Nice feature set, and looks very useful for small systems, but the current version is strictly memory-based. (It's persistent through snapshots and logging, but the whole database must fit in memory.) The developers are working on this, though. The API could do with a tidy-up too; it has different calls for the same operation on different data structures.
- MongoDB Very nice feature set. It's a document database, but it stores the documents in a JSON-like structure (called BSON) and can nest documents arbitrarily and inspect the fields within a document and build indexes on them. But its memory handling is lousy; while it's not explicitly memory-based, I wouldn't want to run it on anything but a dedicated physical server with more memory than my total database size. I could just throw money at it and put another 24GB of RAM in the server (far cheaper than a commercial RDBMS license) which would last us for a while, but I have serious doubts about its robustness as well.
- CouchDB Written in Erlang, which is always a warning sign. Erlang programmers seem to care about performance and reliability far more than they care about making a product that anyone would want to use. In this case, instead of MongoDB's elegant query-by-example (with extensions) I write map/reduce functions in JavaScript and send them to the server. In what universe is that an improvement on SQL? On the plus side, it apparently has replication. On the minus side, it's an Apache project, and I have yet to meet an Apache project that didn't suck in some major way.
- HBase Looks good if you have billions of very regular rows (which I do at my day job, but not here). Nothing wrong with it, but not a good fit.
- Project Voldemort Pure evil. No, wait. This one came out of LinkedIn. It's one of the recent flock of inherently scalable (automatic sharding and multi-master replication) key/value databases. In their own words, [it] is basically just a big, distributed, persistent, fault-tolerant hash table. That's a very useful thing, but I need defined ordering (multiple defined orderings for the same dataset, in fact) which a hash table can't give me.
- Cassandra This is Facebook's distributed hash table thingy (it's like the old days when every server company designed their own CPU). May have some vague concept of ordering, so I'll take a closer look.
- Jackrabbit It's a Java/XML datastore from the Apache Foundation. Uh-uh. I've used ActiveMQ guys. You can't fool me twice. I'd sooner chew rusty nails.
- Riak Bleah. Another key/value/map/reduce thing. In their own words:
A "map phase" is essentially just a function ("F") and an argument ("A") that is defined as part of a series of phases making up a given map/reduce query. The phase will receive a stream of inputs ("I"), each of which consists of a key identifying a Riak object and an optional additional data element to accompany that object. As each input is received by the phase, a node that already contains the document ("D") corresponding to "I" will run
Clear? Right. Not interested at all.F(D,A)and stream along the results to the next phase. The point here is that your function can be executed over many data items, but instead of collecting all of the data items in one place it will execute wherever the data is already placed. - LightCloud A distributed key-value store from Plurk. On the plus side, it's written in Python, supports Tokyo Tyrant and/or Redis for a back end, and "plurk" is fun to say. On the downside, it seems to be just a key/value database and not all that fast; it doesn't seem to expose the more interesting features of Tokyo Cabinet or Redis. It does at least have some update-in-place operations.
- GT.M GT.M is a crocodile. That's not an aspersion, exactly. Crocodiles were contemporaries of the dinosaurs, but when the dinosaurs went extinct, the crocodiles survived, and they are still around and occasionally snacking on jumped-up bipeds today. It's a hierachical key-value store with a variety of access mechanisms. It's unquestionably powerful, but it looks clunky; the MUMPS code reminds me of the systems I was employed to replace as little boy programmer in the 80's, and the Python interface doesn't actually look like Python, but more like some odd offspring of Cobol and Pascal.
- Neo4j Neo4j is a graph database, which is not something I've worked with before. Graphs are a very general data structure for mapping relationships; while relational databases model parent-child relationships well, graphs are the natural model for networks of friends (for example) where you can end up back where you started by any number of differing paths. The shortcoming of graphs is that they do not have a defined order, which is something I need a lot of around here.
- Berkeley DB An oldie but a goodie. An embedded, transactional database. You can shove pretty much anything into it; it doesn't care. No query language, but does have indexes. One problem is this clause from the license:
Redistributions in any form must be accompanied by information on how to obtain complete source code for the DB software and any accompanying software that uses the DB software. The source code must either be included in the distribution or be available for no more than the cost of distribution plus a nominal fee, and must be freely redistributable under reasonable conditions.
Any code that uses the DB software? I assume they mean direct code embedding/linking, but that's pretty broad. And it's really just a library, albeit a good one; it could serve as the basis for a database server, but it isn't that by itself. - Metakit Metakit is a column-oriented database library, with a very nice, clean Python interface. For example, to display all posts by user 'Pixy Misa', you could simply write:
The problem is, it doesn't scale. I tried using it for the first pass at Minx, about four years ago, and it broke long before it reached our current database size. Like MongoDB, nice semantics, not so great on the implementation.for post in posts.select(user = 'Pixy Misa'):
print post.title, post.date - Tokyo Cabinet / Tokyo Tyrant / Tokyo Dystopia, VertexDB Tokyo Cabinet is a database library similar to Berkeley DB, but licensed under the no-worries LGPL. Tyrant is a "lightweight" database server built on Cabinet, Dystopia a full-text search engine built on Cabinet, and VertexDB a graph database built on Cabinet. I haven't explored these in depth yet because the standard Tokyo Cabinet distribution doesn't include Python libraries (Perl, Ruby, Java and Lua, but no Python?), but there are third-party libraries available.
- Xapian and Omega Xapian is a full-text search library, and Omega a search engine built on Xapian. In fact, Xapian is more than that; it can do range searches on strings, numbers, and dates as well, and can store arbitrary documents. It's quite good for searches, but not really suited to general database work.
Posted by: Pixy Misa at
01:17 AM
| Comments (67)
| Add Comment
| Trackbacks (Suck)
Post contains 1381 words, total size 10 kb.
Powered by Minx 1.1.2-pink.








