Ambient Irony

Now? You want to do this now?
I have a right to know! I'm getting married in four hundred and thirty years!

Tuesday, February 09

TreeDict

Well, that's annoying. I just wrote one of these to speed up the Minx template parser, but this one is probably faster (it's a Cython module) and certainly better documented.

The functionality is identical as far as I can see, but I called mine a DictTree.

Actually, no, mine has one extra feature: It allows you to reference a value using mapping-style lookup (tags['a.b']) that would be a subtree in an attribute-style lookup. If tags.a.b.c is set, tags.a.b is necessarily the tree containing b; try anything else and you blow yourself out of the water. But tags['a.b'] can be anything you like.

I need to do that because I designed the Minx template language that way. (Oops.) It lets you reference, for example, post.date as a date value, and also post.date.month to find just the month of the date of the post. You can't do that with dicts; you could probably do it with a smarter class, but bang would go my generality.

Since that trick is used on both dates and strings, I'd need to make all my dates and strings into custom classes to make the attribute syntax work directly, and that's just too messy to contemplate.

Posted by: Pixy Misa at 02:09 AM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 200 words, total size 1 kb.

Monday, February 08

Pita

Or, The Screw It, I'm Writing A Database Post

Okay, I've had it up to here with databases that suck.*

So I'm writing my own.** It is called, for obvious reasons, Pita.

The plan is to steal as much of the low ground as possible. Anything hard, it just won't do.

First, it won't be written in C. It will be written in Python. Yes, performance will suffer, but Python can actually deliver suprisingly well, and it has some very well-optimised and well-tested libraries available. My Python benchmark script for MongoDB*** was moving as much as 90MB of data per second, which would about saturate a gigabit ethernet link or SATA disk.

The design goals are as follows:

Doesn't lose your data.
Doesn't lose your data, ever.
Doesn't have to go offline for schema changes, including adding or removing indexes.
Doesn't have a query language. This may end up making complex queries more complex. That's okay, becase it makes simple CRUD operations dead easy.
Doesn't have any avoidable hard-coded limits that aren't insanely large. Whatever bit-size seems reasonable for a counter, I'll double it (if it's not already an arbitrary precision value).

Notice how I'm mostly concerned with what it doesn't do? Yeah. I'm trying to keep this feasible.

Doesn't do random I/Os except for indexes and one (of seven) table types. Well, two, I guess.
Doesn't do row-level locking.
Doesn't do multi-table or multi-record transactions, let alone two-phase commit. Look, if you're a bank, you already have the money, go buy DB2 and stop bothering me.
Doesn't do joins... Well, it sort of does. We'll get to that.
Doesn't guarantee high performance, full consistency, versioning, and multi-master replication on any single table type. It will have all of those, but you can only choose one from column A and one from column B.
Doesn't scale to ridiculous sizes. If at any point, though, it looks like it can't at least scale to pretty darn big, I'll drop it.
Doesn't ever do random access of random-length data to disk. Um, except for indexes, where the random-length data will be divvied up into fixed-length blocks anyway. And for the initial development versions, indexes are likely to be memory-based unless I find some nice (simple, fast, efficient) on-disk indexing libraries.
Doesn't lose your data. I mean it. Even when it loses your data, it doesn't lose your data.

Okay, that's what it won't do. What will it do?

Supports multiple table types, each optimised for a different task. So even though there's no one table type that does everything, there should be something good enough for most cases.

Specifically:

Store - a log-structured, versioned, indexed document store. Documents are never deleted or changed. All changes are added as a new version of the document at the end of the table. Everything is stored in JSON or YAML.

Advantages - no random I/Os for data, roll back to any point in time, back up simply by copying the data segments. The entire table is in text, so even if everything goes splatooie, it's easy to write a program to parse it and salvage your data.

Disadvantages - if your records change frequently, it will eat disk space like candy. You can do an online file pack to purge old versions of documents, but that's a chore. Restore from backup would require an index rebuild.
Fixed - a fixed length, indexed, random-access table with change log. Pretty much the diametric opposite of Store.

Advantages - fast access to data because the position in on disk is a function of the record number, and it's all binary so there's no parsing required. Every record is versioned, so by copying the data file plus the change log you can roll forward to a consistent state. You can't necessarily roll back, though.

Disadvantages - fixed length.
Table - a combination of a Store for the documents and a Fixed for numeric fields, dates, booleans and any other fixed-length data.

Advantages - all of the advantages of Store and Fixed together. The default on-disk table type, hence the name. Supports full roll-forward and roll-back recovery.

Disadvantages - the fixed fields are written twice on a new major document version, to both the Store and the Fixed files. On the other hand, changes to the fixed fields only don't require an update of the Store, significantly reducing storage requirements when you have documents with a few numeric fields that change frequently. Similarly, reading a document requires reading two files, and you can't readily get the fixed values for minor versions (where the Store wasn't updated).
Space - a non-versioned, indexed, in-memory document store, with snapshot plus log persistence.

Advantages - should be very very fast, since it's entirely held in memory. Also very flexible for the same reason. Roll-forward recovery, though no roll-back. Backups are still as simple as copying the files on disk.

Disadvantages - if your server crashes, or even if you need to do a normal reboot, you can't use the table until it's reloaded itself from disk, resynced, and rebuilt the indexes. Thus primarily suited for frequently-accessed but relatively small tables.
Cache - a non-versioned, indexed, in-memory document store with a fixed size and LRU policy.

Advantages - it's a memory-based cache with the exact same semantics as your database.

Disadvantages - I'd be surprised if it gets within a factor of 3 of the speed of dedicated caches like memcached.
Queue - a disk- or memory-based document queue, i.e. first-in first-out. Disk-based queues use segmented files and a transaction log for recovery and efficient space reclamation.

Advantages - it's a queue with the same semantics as your database. Well, kind of. I don't know that I'll actually support all the fancy stuff. Does only sequential disk I/O.

Disadvantages - won't have some of the fancy features of something like ActiveMQ. However, probably won't arbitrarily run out of memory and wedge itself. At least if it crashes outright you can restart it.
Stack - a disk- or memory-based document stack, i.e. last-in first-out. All the reads and writes alike happen at the end of the file.

Advantages - it's a stack.

Disadvantages - has to lock the file for every operation to prevent screwups, so won't be super-efficient.
Others under consideration

Array - a shared fixed-structure matrix with atomic matrix operations.
Cloud - a master-master eventually-consistent Space.
Merge - a (probably read-only) table granting easy access to multiple other tables.
Graph - an in-memory graph structured table , for example, for social network relationship data, with snapshot/log persistence.
Deque - a double-ended queue, combining the functions of Stacks and Queues.
Batch - a batch-oriented in-memory table with automatic snapshot persistence.
Share - a cached Store, ideal for data that can afford to be a little out of date but can't be inconsistent.

Support multiple data types that (mostly) map closely to Python's own:

Int
Char
Date
Time
Float
String
Money
Number
Geometry

Point
Line
Square
Rectangle
Circle
Ellipse

Text
Auto
Logical
Encrypted
Binary

Support multiple data structures within documents that closely map to Python's own:

Map
Set
Bag
List
Array
Variant
Reference

Support multiple index types and modes:

B-tree/B+-tree (of some sort) for primary keys, unique indexes, and general purpose indexy stuff.
R-tree and/or Quadtree for GIS stuff.
Full-text index, which will probably start out as a hacky B-tree of some sort.
Indexing of structures (lists, maps etc) within documents.
Partial indexes.

Triggers and stored procedures. The embedded Lua interpreter I'm putting into the next version of Minx will do nicely.
Embedded database. Don't need a full-fledged dataserver? Just run the whole thing within your app. The code will be split into a database library and a dataserver that runs on top.
Replication - for Store, Queue and Stack, a choice of multi-master replication with eventual consistency or master-slave. For Fixed, Table, and Space, master-slave replication. For Cache, no replication. (It's a cache!)
Sharding - for Store (probably) and Cache, easy sharding across servers. For other table types (probably), no sharding.
Uses JSON or YAML everywhere, for data storage, data logs, config files, schema files, APIs and anywhere else a standard format is required.

Advantages - no XML.

Disadvantages - none.
Pure-ish Python. The plan is to write it all in Python, with some optimisations in place for Cython.

You can do that - it's kind of neat. The exact same code can run interpreted with regular Python, JIT-compiled with Psyco, binary-compiled with Cython, on the JVM with Jython, or on .NET with IronPython. And that's the plan; to make it run everywhere, but include optimisations for regular Python on Linux on x86 or x86_64. And avoid those horrifying string concatenations if I can.

One catch I know already - for the Space snapshots, I'm planning to use the Unix fork semantics, which are copy-on-write, i.e. you get a static snapshot of all your data structures at an amortised cost so that you can easily write it back to disk while online. Windows' fork semantics are different and don't let you do that, so snapshots would stall the database, or at least the table. Still, with commodity hard drives achieving peaks of over 100MB/second and modest RAID arrays reaching a few hundred MB/second, even writing a few GB of data to second once a day shouldn't take too long.
Designed to take advantage of SSDs and HDDs. Put the random I/O load on your SSDs and the sequential-update bulk data on your HDDs. Or put everything on SSDs, that works too. I'm not going to bother to try to make random I/O work super-efficiently on HDDs; that's simply a losing game. For small databases just use Array tables and load everything into memory; for larger databases buy yourself a nice Intel X25-E.

Okay, that's a quick rundown on the planned features. Tomorrow, I'll start writing up my thoughts on the programming and data APIs.

Update: Found Python implementations of B+Trees, R*-Trees, and a B-Tree based sorted list and dict module. That'll save some time!

* That is, do not meet my current requirements. Or in some cases, actually spread pneumonic plague. YMMV.
** Maybe.
*** The MongoDB server is C, but the benchmark program is Python, and it ships a whole lot of data back and forth. Which tells me that a Python program can ship a whole lot of data back and forth. The program can create 32,000 1K or 22,000 2K records per second, and read 50,000 1k or 32,000 2K records per second. The 90MB per second was a achieved with 10K records.

Posted by: Pixy Misa at 09:13 PM | Comments (7) | Add Comment | Trackbacks (Suck)
Post contains 1703 words, total size 12 kb.

Sunday, February 07

Seeking A Database That Doesn't Suck

Anyone?

Quick recap of databases that suck - or at least, suck for my purposes - and some that I'm still investigating.

SQL

MySQL Lacks intersection and except, lacks array support, has only so-so full-text indexing, offers either concurrency or full-text indexes and GIS, but not both.
PostgreSQL Provides arrays, concurrency and full-text indexes and GIS, but a lot of this is lost in a twisty maze of plugins and thoroughly non-standard operators. And the full-text indexing sucks.
Ingres Ingres is free these days (if you don't need a support contract). It's a good, solid database, but doesn't actually offer anything I can't get from MySQL with InnoDB.
Firebird Doesn't seem to offer anything more than MySQL or PostgreSQL or Ingres. Which doesn't mean that it's bad, but doesn't really help me either.
SQL Server Needs Windows, which is worth an automatic 6 demerits, even though I can get enterprise-level Windows and SQL Server products for free. (My company is a Microsoft Bizspark member.) Full-text and GIS, intersect and except are all there, but still no arrays.
IBM DB2 Costs too much.
Oracle Costs way too much.
Progress / OpenEdge Solid database, lovely 4GL, but still, the last time I looked at it (2006) mired in 16-bitness (!) and the 4GL is too slow for anything complicated. Also expensive and has a screwed-up pricing model. Would use it if I could.

NoSQL

Redis Nice feature set, and looks very useful for small systems, but the current version is strictly memory-based. (It's persistent through snapshots and logging, but the whole database must fit in memory.) The developers are working on this, though. The API could do with a tidy-up too; it has different calls for the same operation on different data structures.
MongoDB Very nice feature set. It's a document database, but it stores the documents in a JSON-like structure (called BSON) and can nest documents arbitrarily and inspect the fields within a document and build indexes on them. But its memory handling is lousy; while it's not explicitly memory-based, I wouldn't want to run it on anything but a dedicated physical server with more memory than my total database size. I could just throw money at it and put another 24GB of RAM in the server (far cheaper than a commercial RDBMS license) which would last us for a while, but I have serious doubts about its robustness as well.
CouchDB Written in Erlang, which is always a warning sign. Erlang programmers seem to care about performance and reliability far more than they care about making a product that anyone would want to use. In this case, instead of MongoDB's elegant query-by-example (with extensions) I write map/reduce functions in JavaScript and send them to the server. In what universe is that an improvement on SQL? On the plus side, it apparently has replication. On the minus side, it's an Apache project, and I have yet to meet an Apache project that didn't suck in some major way.
HBase Looks good if you have billions of very regular rows (which I do at my day job, but not here). Nothing wrong with it, but not a good fit.
Project Voldemort Pure evil. No, wait. This one came out of LinkedIn. It's one of the recent flock of inherently scalable (automatic sharding and multi-master replication) key/value databases. In their own words, [it] is basically just a big, distributed, persistent, fault-tolerant hash table. That's a very useful thing, but I need defined ordering (multiple defined orderings for the same dataset, in fact) which a hash table can't give me.
Cassandra This is Facebook's distributed hash table thingy (it's like the old days when every server company designed their own CPU). May have some vague concept of ordering, so I'll take a closer look.
Jackrabbit It's a Java/XML datastore from the Apache Foundation. Uh-uh. I've used ActiveMQ guys. You can't fool me twice. I'd sooner chew rusty nails.
Riak Bleah. Another key/value/map/reduce thing. In their own words:
A "map phase" is essentially just a function ("F") and an argument ("A") that is defined as part of a series of phases making up a given map/reduce query. The phase will receive a stream of inputs ("I"), each of which consists of a key identifying a Riak object and an optional additional data element to accompany that object. As each input is received by the phase, a node that already contains the document ("D") corresponding to "I" will run F(D,A) and stream along the results to the next phase. The point here is that your function can be executed over many data items, but instead of collecting all of the data items in one place it will execute wherever the data is already placed.
Clear? Right. Not interested at all.
LightCloud A distributed key-value store from Plurk. On the plus side, it's written in Python, supports Tokyo Tyrant and/or Redis for a back end, and "plurk" is fun to say. On the downside, it seems to be just a key/value database and not all that fast; it doesn't seem to expose the more interesting features of Tokyo Cabinet or Redis. It does at least have some update-in-place operations.
GT.M GT.M is a crocodile. That's not an aspersion, exactly. Crocodiles were contemporaries of the dinosaurs, but when the dinosaurs went extinct, the crocodiles survived, and they are still around and occasionally snacking on jumped-up bipeds today. It's a hierachical key-value store with a variety of access mechanisms. It's unquestionably powerful, but it looks clunky; the MUMPS code reminds me of the systems I was employed to replace as little boy programmer in the 80's, and the Python interface doesn't actually look like Python, but more like some odd offspring of Cobol and Pascal.
Neo4j Neo4j is a graph database, which is not something I've worked with before. Graphs are a very general data structure for mapping relationships; while relational databases model parent-child relationships well, graphs are the natural model for networks of friends (for example) where you can end up back where you started by any number of differing paths. The shortcoming of graphs is that they do not have a defined order, which is something I need a lot of around here.

Libraries and Other

Berkeley DB An oldie but a goodie. An embedded, transactional database. You can shove pretty much anything into it; it doesn't care. No query language, but does have indexes. One problem is this clause from the license:
Redistributions in any form must be accompanied by information on how to obtain complete source code for the DB software and any accompanying software that uses the DB software. The source code must either be included in the distribution or be available for no more than the cost of distribution plus a nominal fee, and must be freely redistributable under reasonable conditions.
Any code that uses the DB software? I assume they mean direct code embedding/linking, but that's pretty broad. And it's really just a library, albeit a good one; it could serve as the basis for a database server, but it isn't that by itself.
Metakit Metakit is a column-oriented database library, with a very nice, clean Python interface. For example, to display all posts by user 'Pixy Misa', you could simply write:
```
for post in posts.select(user = 'Pixy Misa'):
  print post.title, post.date
```
The problem is, it doesn't scale. I tried using it for the first pass at Minx, about four years ago, and it broke long before it reached our current database size. Like MongoDB, nice semantics, not so great on the implementation.
Tokyo Cabinet / Tokyo Tyrant / Tokyo Dystopia, VertexDB Tokyo Cabinet is a database library similar to Berkeley DB, but licensed under the no-worries LGPL. Tyrant is a "lightweight" database server built on Cabinet, Dystopia a full-text search engine built on Cabinet, and VertexDB a graph database built on Cabinet. I haven't explored these in depth yet because the standard Tokyo Cabinet distribution doesn't include Python libraries (Perl, Ruby, Java and Lua, but no Python?), but there are third-party libraries available.
Xapian and Omega Xapian is a full-text search library, and Omega a search engine built on Xapian. In fact, Xapian is more than that; it can do range searches on strings, numbers, and dates as well, and can store arbitrary documents. It's quite good for searches, but not really suited to general database work.

Posted by: Pixy Misa at 01:17 AM | Comments (67) | Add Comment | Trackbacks (Suck)
Post contains 1381 words, total size 10 kb.

1 In my project at work, I've basically resigned myself to using one from column A (SQL) and one from column B (NoSQL), using the strengths of each. I'm bootstrapping with just SQL, since emulating a NoSQL in SQL is pretty trivial in most ways (and is definitely trivial for the purpose I forsee needing it), but I expect to have to switch out that part with a NoSQL solution sooner or later and am making preparations for that.

One additional comment, per your parenthetical on Project Voldemort; the generic NoSQL answer to "order" is that you have a key that you store the order in: "blog_posts_(blog_id)_order_chrono_asc" => [1, 2, 4, 8, 3, ...]. Every time you add a new post, you have to update all the relevant index keys. Yes, this sucks and is prone to bugs... there's a reason I'm going with the hybrid approach. smile

Nice summary, though; this is the most complete listing of these technologies I've seen.

Posted by: Jeremy Bowers at Sunday, February 07 2010 04:31 AM (yu3og)

2 Progress has been 32-bit for years, altho there are still some screwy things about it. I use it for my day job. It works well enough, screwy pricing model aside, that a LOT of Fortune 500 companies have applications written in it. Casinos tend to use it, as do companies that write anything from ERP/MRP software to HR/Payroll software. If you've ever heard of DTS (the Defense Travel System) that's written in Progress too. Further you can always write compute-intensive stuff in other languages; the interop features are pretty good.

Posted by: RickC at Sunday, February 07 2010 04:56 AM (8GbPX)

3 Even though Progress has been theoretically 32-bit (or 64-bit) since forever, there's still a bunch of stuff that's 16-bit limited that's never been fixed (or hadn't as of version 9) - record sizes, string sizes, code segments and so on.

Glad to see another Progress programmer - I worked with it from 1988 to 2006, and I still haven't found anything better for ad-hoc data manipulation.

Posted by: Pixy Misa at Sunday, February 07 2010 05:34 AM (PiXy!)

4 Yeah, I followed your link to MongoDB and my brain aborted when I hit the remarks about endian issues. I've dipped my toe just a bit into Tokyo Cabinet, but just haven't had the time to apply it to some of our work projects; I managed to avoid some development work by molesting a Perl/MySQL script into running 10 times faster, which let me add the new features without a rewrite.

-j

Posted by: J Greely at Sunday, February 07 2010 11:52 AM (Pue/B)

5 Why do you need 1 database that does everything? You seemed to like Postgres except that it doesn't do full text. Solr is super awesome at full text, why not use Postgres + Solr?

Posted by: BJ Clark at Wednesday, February 10 2010 05:57 AM (1SMNd)

6 You mean, SQL, baked beans, Java, and SQL?

I DON'T LIKE SQL!!!

Posted by: Pixy Misa at Wednesday, February 10 2010 06:41 AM (PiXy!)

7 Aha, you're the one that wrote the if-only-it-was-that-easy article. I read that while I was searching for a database that doesn't suck.

Posted by: Pixy Misa at Wednesday, February 10 2010 07:06 AM (PiXy!)

8 Just found a bug in Minx too - comment previews don't get properly HTML/BBCode sanitised.

Posted by: Pixy Misa at Wednesday, February 10 2010 07:07 AM (PiXy!)

9 Great roundup but it left me wondering: What *are* your purposes?

Posted by: Guillaume Theoret at Wednesday, February 10 2010 09:47 AM (F3Kuu)

10 (Waves hands.) This.

Minx is a blog/forum/wiki/social networking system and an evolving application platform. I need a database suited to that.

So primary use cases are -

- Document store, where either the documents don't change much or versioning is actively desirable.
- Fixed-length randomly accessed data, like statistics.
- Variable-length randomly accessed data that fits nicely in memory (user tables, session data, etc).

What I do need -

- Fast operation on complex sorts and selects.

What I don't need -

- Transactions.

I guess Pita (see more recent posts) is a database for mid-scale social networks. My goal is for it to be suitable for systems from hundreds of thousands of documents to hundreds of millions (I currently have around three million posts and comments). I don't plan to even consider the problem of billions of documents at this point (that's what I do at my day job). I do plan to keep it small and simple enough that anyone starting a social web project would think of using it.

Posted by: Pixy Misa at Wednesday, February 10 2010 09:55 AM (PiXy!)

11 <quote>
And the full-text indexing sucks.
</quote>

What don't you like about PostgreSQL's full text search? It works for me; curious why you don't like it..

Posted by: Eugene Koontz at Wednesday, February 10 2010 02:02 PM (0ivZI)

12 In MySQL all you do is say "put a full text index on these fields", and lo, it is done. In PostgreSQL you have to prepare a separate field specifically for the data to be indexed. So while full-text-indexing a blog in MySQL roughly doubles your database size, it's more like tripled in PostgreSQL. It's also a pain.

Posted by: Pixy Misa at Wednesday, February 10 2010 02:21 PM (PiXy!)

13 Have you checked Scalaris[1]? Not that I think it will get any good credits from you ;-) It's an Erlang+RAM+NoOdering ShowStopper for you, but I would rather like to know how you would argue; and also to make your list look more complete.

Posted by: Djui at Wednesday, February 10 2010 05:46 PM (NyKge)

14 I forgot about Scalaris. I'd seen it mentioned, but didn't really know anything about it.

I took a quick look - given all its constraints, I'm not sure who it is good for!

Posted by: Pixy Misa at Wednesday, February 10 2010 05:53 PM (PiXy!)

15 Hi there, have you taken a look at www.neo4j.org? Depending on what you are trying to do, a graph is VERY flexible as a data structure, and you get fast traversals, a REST/JSON API and a very capable query language that will soon even work for Document DBs like CouchDB and MongoDB, http://gremlin.tinkerpop.com

Posted by: Peter Neubauer at Wednesday, February 10 2010 06:24 PM (QZXb3)

16 I did take a quick look at Neo4j. I'll take a close look and do a little writeup of it as well. (From my perspective, of course.)

So, who's linked to me?

Posted by: Pixy Misa at Wednesday, February 10 2010 06:27 PM (PiXy!)

17 Everyone who writes these kind of NoSQL articles always forgets GT.M - I've no idea why. See the FOSDEM talks at http://bit.ly/cLx2BL and http://bit.ly/93c8fi

Posted by: Rob at Wednesday, February 10 2010 09:34 PM (JU/Kj)

18 I looked at GT.M, but the site didn't do a good job of explaining what it was or why I should use it. I'll take a look at those talks, thanks.

Posted by: Pixy Misa at Wednesday, February 10 2010 09:46 PM (PiXy!)

19 Oh, yes, that's what I didn't like about GT.M - MUMPS.

MUMPS!

I try to avoid programming languages that are as old as me.

Except Algol. Algol rules.

Posted by: Pixy Misa at Thursday, February 11 2010 12:41 AM (PiXy!)

20 Just remove fulltext-search from your criteria list and use Lucene for that?! Fulltext means more than just being able to index ASCII text, but PDF, DOC,...
Are you now able to choose a database?

Posted by: Mike at Thursday, February 11 2010 12:41 AM (SK6Rk)

21 If, for whatever your reasons, you don't want to use the Mumps language, you can use still use GT.M as a back-end NoSQL database to pretty much anything. See http://gradvs1.mgateway.com/download/gtm_4_python.pdf ie the database and language are separate issues, so don't throw the baby out with your bathwater.

Posted by: Rob at Thursday, February 11 2010 12:54 AM (JU/Kj)

22 Mike - I'm not sure what Lucene does with PDF, DOC and such beyond indexing the text, but even so you're right that there's more to it - stemming and weighting and so on. And you really need a query language for full text, though it doesn't need to be Turing-complete. My choice there would be Xapian - I've indexed a few TB of documents with it at my day job, and it has a neat Python interface.

And no, that doesn't make my problem easier. It makes it worse.

Posted by: Pixy Misa at Thursday, February 11 2010 01:02 AM (PiXy!)

23 Rob - GT.M does look like a powerful back-end, that's for sure. But I do want a "Pythonic" interface (though I don't want an object database). And that ain't it. sad

Posted by: Pixy Misa at Thursday, February 11 2010 01:09 AM (PiXy!)

24 If you'd care to contact me by email I'd be interested to discuss if/how a more agreeable interface could be developed for GT.M. It really is a powerful and immensely flexible and adaptable back-end db so it would be good to figure this out.

Posted by: Rob at Thursday, February 11 2010 01:19 AM (JU/Kj)

25 Over the next couple of days I'm going to be writing some detailed posts on how I'd do a Python interface to a document-oriented database. The purpose is to get my thoughts in order, whether I write my own database or a wrapper for an existing one, but if it offers value to anyone else then that's just another bonus.

(Mind you, there are those who would also accuse me of writing un-Pythonic code, but I think a two-line for loop is more readable than something like
print sum(map(float, open('test.numbers').readlines()))
Human beings read sequentially, not from the middle of the line outwards in both directions. Well, except for German, which is a stack-based language.)

There are two particular interfaces I plan to steal from - MongoDB's PyMongo and MongoEngine and Metakit's Mk4py - plus the TreeDict data structure.

Posted by: Pixy Misa at Thursday, February 11 2010 01:44 AM (PiXy!)

26 Sounds like a great plan - I shall be very interested in what you come up with and see how I can implement it.

Posted by: Rob at Thursday, February 11 2010 02:14 AM (JU/Kj)

27 Thanks for the note about the BerkelyDB license. I followed up with Oracle just to verify you are correct and you are.

Posted by: Dave at Thursday, February 11 2010 06:17 AM (SCYxm)

28 In MySQL all you do is say "put a full text index on these fields", and lo, it is done. In PostgreSQL you have to prepare a separate field specifically for the data to be indexed. So while full-text-indexing a blog in MySQL roughly doubles your database size, it's more like tripled in PostgreSQL. It's also a pain.

I agree it is hard to set up the first time, but once you understand it you can script it - you could even use Python if you like:

http://www.postgresql.org/docs/8.4/interactive/plpython-database.html

As far as the size increase, I guess my databases are too small for that to be a factor for me, and the growth is linear with the number of rows..

Posted by: Eugene Koontz at Thursday, February 11 2010 09:12 AM (0ivZI)

29 Pixy Misa, why is age alone a negative? MUMPS, SQL, UNIX and C all had origins circa 1970. All have evolved since. (I too like Algol - I wrote many programs in Algol W, but never wrapped my head around Algol 68.)

One important thing to realize about GT.M is that while the key-value database and the language are well integrated, either is usable without the other. So, you can easily define and create your own preferred Python or other API to connect to the database. The database engine itself is extremely scalable. There are production sites with individual database files in the 100s of GB and aggregate (logical) databases in the TB. The largest production site I know of serves approximately 10,000 concurrent users during peak usage. Yet, the footprint of the database engine itself on disk is on the order of 20MB, so it also fits nicely into virtual machines and software appliances. GT.M also has excellent functionality for building applications that must remain available 24x365 in the face of not only unplanned events (like crashes) but also planned events (such as system and application upgrades). And yes, you can create an API that accesses all this from Python or anything else that uses a C compatible calling interface.

By the way, for an example of MUMPS code see http://docs.google.com/View?id=dd5f3337_15hssw6bdh (the header has a pointer to the problem that the program solves). Notice the support for multiprocess synchronization, international characters, and of course key-value access to the database that is as simple as accessing an array.

-- Bhaskar
(full disclosure - I manage GT.M)

Posted by: K.S. Bhaskar at Thursday, February 11 2010 09:59 AM (jcSjr)

30 It's not age per se. People in 2010 are perfectly capable of designing horrible programming languages, and indeed, new and horrible languages are implemented every day.

But back then, horrible was the rule, because (a) language designers didn't have a lot of experience and (b) writing a compiler at all was on the edge of what was even possible.

Hence Cobol, early Fortran, Lisp, Snobol, and so on. All of them are thoroughly horrible languages. But all of them are a hell of a lot better than assembler. Likewise C, which came somewhat later and is concomittantly somewhat cleaner; the description of it as "a portable PDP-11 assembler" is still apt.

None of which excuses modern efforts like Go, which has an inferior syntax and largely inferior semantics to decades-old efforts like Algol. (I think that the entire family of curly-bracket languages should cease to exist, with the one exception of C, which at least had an excuse.... But that's a whole separate rant.)

Now, to the specific issue of MUMPS: When I look at that code, the first thing that strikes me is that I have no visual indication of program flow. I can glance at a short C, Pascal, Java, Python or Ruby program and get an immediate idea of where it comes in, where it goes, and what comes out of this. Not so with MUMPS.

The second thing I notice is that there's punctuation where I'm not expecting punctuation. Most programming languages fail at this; certainly C does, Perl and PHP with their $, Python and Ruby with their respective @. MUMPS isn't as bad as line-noise Perl, but it is opaque to even an expert programmer who is new to the language. That's typical of older languages too.

Posted by: Pixy Misa at Thursday, February 11 2010 01:17 PM (PiXy!)

31 ..which simply goes to prove that when it comes to computer languages, one person's meat is another person's poison. To get back to the important point: the great thing about GT.M (unlike most other earlier Mumps database implementations) is that you don't have to use the Mumps language to access the database, and it's the database that is the cool and powerful part (and it's arguments about the language that have traditionally prevented more people discovering the powerfully simple and unique features of this database). I'd love more people to discover the kinds of things I can do with ease and amazing speed with a GT.M database, and if we can apply Pixy's work on a more "Pythonic" syntax, then I think we may have a chance of getting many more people to become familiar with this important member of the NoSQL technologies.

Posted by: Rob at Thursday, February 11 2010 06:42 PM (JU/Kj)

32 The human mind has a need to manage information. Just give a child a box of trinkets and watch them. They will organize, and reorganize the objects based on colour, size or any other characteristic that they can identify. Each time there will be resultant groupings that accomplish the goal of classification to a degree, yet none exist that work universally for all purposes. Database structures are attempts to impose structure for a purpose. Spreadsheets have existed since Phonecian traders needed to keep records of their shipping. When the first spreadsheet was computerized the very idea was rejected outright. Because it didn't make sense to the world at large and it took years to catch on, and more years to settle into a useful tool. Database science has been around since the 1930's, a much shorter time frame, and a much more difficult problem scope. Computer science embraced relational databases in the 1970's because it was the most applicable tool for the business related problems that computer science was trying to solve. MUMPS was written specifically to solve medical database problems. The language is different because the database is different. The language operators represent specific actions that provide a mechanism for the database to be useful. It is one of the very few software tools that has an integrated language and database. Rather than being a crocodile it was a glimpse of a future where the language and the database can be integrated into a cohesive whole.

Posted by: stuffduff at Friday, February 12 2010 05:15 AM (+7HGt)

33 Well, no.

As I noted in my recent post regarding Progress, there's a right way to do this, and it's been done, and it's not the MUMPS way. MUMPS is still useful, but there are still crocodiles.

The fact that most languages are actually worse at this sort of data handline than MUMPS doesn't mean that there aren't languages that are better. And that's purely on semantics; syntax-wise, MUMPS is just plain horrible.

Posted by: Pixy Misa at Friday, February 12 2010 05:35 AM (PiXy!)

34 That's this post, Syntax Matters. The point I'm making regarding MUMPS is similar to the problem of Lisp and Smalltalk. Semantically powerful. Syntactically horrible. Essentially gone from the marketplace.

Syntax matters. If programmers back away when they see the sample code, your language is not going to be a success, no matter what semantic advantages it brings to the table. They might be wrong, in a sense, to reject your language for its surface syntax, but it's far easier to evaluate syntax than semantics. And I'd argue that bad syntax has a permanent cost, particularly in maintenance.

Posted by: Pixy Misa at Friday, February 12 2010 05:40 AM (PiXy!)

35 I think I am the guilty party for morphing this discussion from one about databases to one about languages. While a language discussion can be objective in parts, it is also inevitably subjective as well.

A major point in my message is that the language and API are very separable from the database / datastore with GT.M. One of the lessons from Rob is that with m_python, M/DB and M/DB:X (discussed at his http://mgatgeway.com site), he is able to rapidly create a variety of APIs/intefaces into the same key-value database.

The content of databases lives even as languages evolve and the ability to access the same long-lived data easily from multiple APIs and interfaces is essential.

Posted by: K.S. Bhaskar at Friday, February 12 2010 06:30 AM (jcSjr)

36 Quite so Bhaskar. Take M/DB for example. Internally its logic is written using the Mumps language and it uses the GT.M database to store and index the data. However, no user of M/DB will ever need to be aware of that fact because, to them, it looks and behaves like SimpleDB and can be accessed by your pet client using whatever your favoured language is. Rather than crocodile, chameleon seems more apt to me. It would be quite feasible and pretty straightforward to emulate the behaviour of pretty much any other database, both NoSQL and SQL, if you so desired using GT.M and make it available and accessible via whatever people's current pet (and in vogue) languages and syntax. The language used internally for development then doesn't matter to anyone else but the guy(s) who developed it.

Posted by: Rob at Friday, February 12 2010 08:06 PM (JU/Kj)

37 True to a degree. But then you get into the area of leaky abstractions - and all abstractions leak.

Sorry, I've been busy hating XML parsers today and haven't had time to hate databases. wink

Posted by: Pixy Misa at Friday, February 12 2010 09:01 PM (PiXy!)

38 Actually you miss my real point. Take CouchDB and SimpleDB - written in Erlang. Does anyone but the developers of those databases really care? You don't need to understand Erlang to use them. Now let's say I want to develop a brand new NoSQL database, newDB, that has all the cool features I believe people will need and is accessible in any language that enthusiasts want to use for it. Why would I go to the bother of building an entire database from scratch to implement it? All that architectural and system management stuff that needs to be developed under the water line (guess what the CouchDB folks are spending most of their time working on these days?) - sheesh! Why not develop newDB with something tried and tested as the underlying engine that has all that stuff already in place? Call me old-fashioned, but that seems a lot more sensible use of people's time than building a complete new mousetrap from the ground up. The emulations I've done, leaky or not) serve to provide proof that GT.M is a great "raw pallette" on which to build other database functionality and projections. So if anyone out there is considering building newDB, check GT.M out and save yourself a ton of time to market.

Posted by: Rob at Friday, February 12 2010 09:49 PM (JU/Kj)

Does anyone but the developers of those databases really care?

Yes! Because the abstractions of the language and the mindset of the developers leak out into the public API and from there into the applications built on it.

This is one of the great invariants of computer science - all abstractions leak.

Of course this does not make CouchDB or SimpleDB or other Erlang applications useless, or GT.M either. But it actually matters, and it really does affect the adoption of these products.

You can build a wonderful system in Smalltalk, and its users don't need to know that it's written in Smalltalk (except of course that it doesn't look quite like any of the other applications they run), and everyone will be happy (mostly) - until something goes wrong (and something always goes wrong) and you have to find a Smalltalk developer to fix it. There's a lot of different things going on here - simple practicality, network effects and so on - but it's very real.

Posted by: Pixy Misa at Friday, February 12 2010 10:42 PM (PiXy!)

40 Can GT.M be deployed on more than one, replicated node?

And, don't forget Keyspace.

Posted by: W-Mark Kubacki at Saturday, February 13 2010 05:31 AM (mjrUB)

41 W-Mark Kubacki asks whether GT.M can be deployed on more than one, replicated node.

With GT.M replication, all business logic (updates) are processed on one node. From that one node, updates are streamed in real time to as many as sixteen replicating nodes, each of which can in turn stream to 16 more (for a total of 256 at the third level) and so on without limit. The replicating nodes can be used for queries but not updates. In the event the original node goes down for an unplanned event, or a planned event, any downstream node can become the new originating node (although network considerations probably suggest making one of the 16 second level nodes the new originating node). When the original node is returned to service, it becomes a replicating node.

That's a start but there's actually a lot more to it (for example, the nodes have eventual Consistency not just eventual consistency, the difference being that the design provides for all nodes ultimately having the same path through the database state space, not just having the same state).

Posted by: K.S. Bhaskar at Saturday, February 13 2010 07:53 AM (bLn04)

42 Yes, that's a valuable advantage of master-slave replication systems - you never see local inconsistency. You can be out-of-date, but the data you view from a given replication node will be consistent in itself.

I'm aiming (if I go ahead with this project) to offer a choice, either eventual consistency (fine for most web app requirements) or state space consistency. So the comments on a blog can be eventually consistent, but the payment history for your accounts database needs to do better than that!

Posted by: Pixy Misa at Saturday, February 13 2010 12:40 PM (PiXy!)

43 Hi Pixy,

you may want to check out Terrastore as well: http://code.google.com/p/terrastore
It's a (brand new) distributed document store with per-record strong consistency.

I'd be curious to know where it sucks: I'll take your feedback as on opportunity to improve it wink

Cheers,

Sergio B.

Posted by: Sergio Bossa at Saturday, February 13 2010 06:58 PM (ZVAu2)

44 MonetDB. As always, suitability does depend on use case.

Posted by: hedgehog at Saturday, February 13 2010 07:37 PM (3JNr9)

45 Sergio - I was looking at Terrastore today. I don't know that there's anything wrong with it, but I don't think it fits my requirements (strong ordering in particular).

hedgehog - I completly forgot MonetDB, and I have looked at it in the past. Thanks for reminding me! I'll add it to my list.

Posted by: Pixy Misa at Saturday, February 13 2010 10:04 PM (PiXy!)

46 Ah, now I remember the problem with MonetDB: The documentation. Makes me want to shoot myself. It manages to tell you everything except what you actually need to know.

Posted by: Pixy Misa at Saturday, February 13 2010 10:21 PM (PiXy!)

47 Hi Pixy,

glad to know you're looking into Terrastore.

What do you mean, more specifically, with "strong ordering"?
You can perform dynamically ordered range queries (on keys) with optional predicates (on values): does it fit your needs?

Posted by: Sergio Bossa at Saturday, February 13 2010 11:32 PM (ZVAu2)

48 What I need to do is this sort of thing:

- Find the last 20 comments from a given user on a given blog
- Find the last 20 comments on posts in a given category on a given blog
- Find the last 20 posts with a given tag across all blogs
- Find the 20 most commented posts in the last 48 hours by users in Australia

And I need to do it across millions of documents in under a millisecond. Just about every page produced by the system is performing queries like this, and with the new features coming up it's only going to increase.

So one of my main concerns is having a flexible selection of indexing methods (B-tree, R-tree, Quadtree, Trie etc). And I'm a lot less worried about insert time than the ability to zoom in on specific subsets of my data. With the MySQL, this sort of query is the second most time-consuming operation in the entire system (number one is HTML sanitisation on comments; on-demand image processing is even more CPU intensive, but happens far less often).

There is another approach, and it's something I've implemented at my day job, where the documents number in the billions and most other approaches simply don't work. And that's to pre-define the complex subset operations and prebuild them. That's certainly something that can be done with a key/value store, and it's the only way to go for really high volumes. But I have a dynamic template language (and in the upcoming version, a matching API and scripting language), and I don't want to put arbitrary restrictions on users if I can avoid it. Optimising the common case, though - that I can certainly spend some time on.

Posted by: Pixy Misa at Sunday, February 14 2010 01:44 AM (PiXy!)

49 Pixy said:

"... that's a valuable advantage of master-slave replication systems - you never see local inconsistency. You can be out-of-date, but the data you view from a given replication node will be consistent in itself."

Actually, it's even a little stronger than that. If you use transaction processing on the originating instance, and you transfer a micropayment from an account to view an article under an ACID transaction, and this is replicated, then no process on any instance will ever see an intermediate database state with the account debited but the article not paid for. So, the replication preserves the ACIDity.

Posted by: K.S. Bhaskar at Sunday, February 14 2010 02:27 AM (eyb9K)

50 Yep, good point.

Posted by: Pixy Misa at Sunday, February 14 2010 04:02 AM (PiXy!)

51 Pixy said:

"- Find the last 20 comments from a given user on a given blog
- Find the last 20 comments on posts in a given category on a given blog
- Find the last 20 posts with a given tag across all blogs
- Find the 20 most commented posts in the last 48 hours by users in Australia"

A quick comment - when storing, use the negative of time as the time index (array subscript, key - whatever you want to call it). So, when inserting, you're always inserting at the front, since a later post has a smaller index than an earlier post. It may take a few millibleems longer to insert, but if most of what you are doing is reading, the extra effort taken to insert is more than offset by faster reads. Many (maybe even most?) database engines move faster front to back than back to front.

Posted by: K.S. Bhaskar at Sunday, February 14 2010 09:28 AM (eyb9K)

52 Another one worth mentioning is SQLite. Embedded and pretty basic, but damn useful.

Posted by: Stefano Rivera at Monday, February 15 2010 04:34 PM (rwGpJ)

53 Definitely worth adding to the list. If you need an embedded SQL database with a minimum of fuss, it's the way to go.

Of course, what I need is neither SQL nor embedded, so... wink

Posted by: Pixy Misa at Monday, February 15 2010 04:45 PM (PiXy!)

54 I'm a big fan of SQLite, but I'd never shove a terabyte of data into it. A gig or two, maybe, for something that's mostly read-only. I think the coolest thing you can do with it is create in-memory databases, which lets you replace hard-coded text-processing logic with a few simple queries.

-j

Posted by: J Greely at Tuesday, February 16 2010 03:36 AM (2XtN5)

55 Ask a proper question and answer will come to you. I agree with some of the posters, fulltext search better be done externally, suddenly your options open up. I heard good things on mongodb. Only caveat that it eats all ram you got, not a problem if you have a separate database server.
You are running a database on separate server, right?

Best of luck to you with finding the right question(s) to ask.

Posted by: Pavel Zaitsev at Wednesday, February 17 2010 03:57 AM (MciT2)

56 "- Find the last 20 comments from a given user on a given blog
- Find the last 20 comments on posts in a given category on a given blog
- Find the last 20 posts with a given tag across all blogs
- Find the 20 most commented posts in the last 48 hours by users in Australia"

I think there is a perfect fit for such kind of queries:

http://www.sphinxsearch.com/index.html
Full text search engine with QL support and MySQL integration.
and it's fact, really fast.

Posted by: Vlad at Wednesday, February 17 2010 07:52 AM (czPrG)

57 Vlad - I'm not sure how Sphinx help me there. It's a full text search engine. It's not ordered at all. Sure, you can sort the results, but then I'm selecting irrelevant results, sorting them, taking the last 20 or whatever, and throwing away all the rest of the work.

What I need is an index that spans levels of the document hierachy. MongoDB can do this, but MongoDB has other problems. As Pavel said, it eats all the ram you've got.

Posted by: Pixy Misa at Wednesday, February 17 2010 10:11 AM (PiXy!)

58 Pavel - well, I could ask questions like "How do I run MySQL with instant schema updates, online index rebuilds, and updateable views with real cross-table indexes?" But I already know that I can't do any of that, and I need all of that. So I need a different database.

I hate SQL anyway.

Posted by: Pixy Misa at Wednesday, February 17 2010 10:15 AM (PiXy!)

59 Oh, and my database is on a VM. (OpenVZ.) That simply doesn't work with MongoDB because of OpenVZ's out-of-memory semantics. MongoDB on OpenVZ will reliably run out of memory and crash.

Posted by: Pixy Misa at Wednesday, February 17 2010 10:17 AM (PiXy!)

60 Did you look at the Solid DB project? www.soliddb.com
I also suspect you would have the same complaints about Informix and Informix XPS that you have about Oracle and DB2?

Posted by: Noir at Sunday, February 21 2010 05:06 AM (BPBpw)

61 I'm sure Solid DB is a fine product, but a two-core license costs more than all my other operational expenses for two years. Informix pricing is nearly as bad.

Posted by: Pixy Misa at Sunday, February 21 2010 12:46 PM (PiXy!)

62 Just a follow-up on GT.M. There's now a low-level TCP-based wire protocol for it at http://www.gateway.com/mwire.html on top of which it should be possible to build any kind of abstraction of its scheme-free hierarchical database in whatever your favourite language.

Posted by: Rob at Monday, February 22 2010 04:13 AM (JU/Kj)

There are certain topics which make zealots come out of the woodwork to defend their particular religious point of view. Alternate Energy is one of those, as I've found out by personal experience.

It's surprising to learn that databases are like that.

Posted by: Steven Den Beste at Monday, February 22 2010 08:40 AM (+rSRq)

64 . All databases suck because you cant find 1 that does "everything" for you. If I were to do a writeup on why Windows 7 doesn't cook me fried beans, it'd be as sensible, but more worthwhile.

Posted by: sofar at Tuesday, February 23 2010 10:40 PM (EIONZ)

65 All databases suck because none of them deliver an acceptable combination of price, performance, scalability and convenience for my application.

I don't want them to "do everything". I want them not to require me to jump through flaming hoops in my programming, or spend triple my current operating budget, or both.

Go away.

Posted by: Pixy Misa at Wednesday, February 24 2010 01:11 AM (PiXy!)

Hide Comments | Add Comment

Saturday, February 06

Mongo Fail!

MongoDB ran out of memory and crashed during benchmarking. I headed off to look for the appropriate parameters to tune its memory consumption, and discovered that there aren't any.

MongoDB uses memory-mapped files for storage - as far as I can tell, it maps them in, and then puts all its structures in them, directly to memory, relying on the operating system to handle paging. On OpenVZ, that approach seems unlikely to work. And without at least a synchronous recovery log, it seems destined to destroy your database sooner or later anyway.

So, nice features, shame about the functionality.

Sigh.

Posted by: Pixy Misa at 02:41 PM | Comments (2) | Add Comment | Trackbacks (Suck)
Post contains 102 words, total size 1 kb.

Mongo Angry! Mongo Smash!

One of the most intractable problems I have with Minx stems from it's inherent many-to-many structure.

Minx supports many sites.
Each site can have many folders.
Each folder can contain many thread.
Each thread can appear in many folders (even on different sites).
Each thread can have many items (posts, comments, and various other less used thingies).

What this means is that to display the 20 most recent comments on your blog, I have to - at least in theory - perform a four-way join, sort, and select. I actually play some tricks to reduce it to a three-way join on a subset of the data, but once you start to page through the comments the tricks begin to break down. Not enough that it's noticeably slow at present, but enough that it won't scale to really large numbers of users or really large sites.

I call it the grandparent problem. If you're looking for one record - an individual comment - no problem, it's O(ln n). If you're looking for the children of a record - comments on a post, comments by a particular user - no problem, it's O(n + log n). But if you're looking for grandchildren of a record, its O(n log n), and that n no longer bears any relation to the number of records you actually want; you have to do a huge join, then sort the results, then select the handful you actually want.

MongoDB has a set of features that, put together, look like they solve exactly this problem.

First, you can have arrays in your records. So, where I currently create duplicate thread records to place a thread in multiple folders (categories, for example), I can just add the category IDs to the array.

By itself that wouldn't be so useful, were it not for feature two: You can index arrays. So I can create an index on the category array and post time, and simply adding and removing category IDs from that array will make the post show correctly in your folders with no performance hit. In fact, it's more efficient (both in space and time) than the current technique.

So far so good. Now for feature three: Arrays in records can contain not just single data values (like a category ID), but other records. So I can put the posts and comments inside the thread record, and when I fetch a thread, I can fetch the entire thread contents in one go.

Now that wouldn't be so useful either except for feature four: You can build an index on a field in a record in an array in a record in your table.

That is, you can shove all your comments straight into the thread record, and then pick them out 50 at a time for paged comments, or in an entirely different order - say, the last 10 comments posted on your blog, no matter what post they're on.

Magic!

The one thing that seems slightly tricky is that MongoDB is document based. You don't read fields from the database, you read documents. You can store documents one inside another (comments inside threads, for example), and then you can get one or more of those comments without reading the whole thing. But if you have information in the thread record itself, you can only get at it by reading the whole thread, comments and all. For an Ace of Spades post with 1000+ comments, that would burn up all the performance I just gained.

There are some ways around that with a little bit of data duplication and other hackery, though it would be nice if MongoDB let you simply select a subset of fields to be returned. It already has ways of updating individual fields inside a document, so something like that might already be on the way.

Anyway, that's where I'll be this weekend.

Update: MongoEngine provides a rather nice ORM - um, ODM - for Python and MongoDB.

Posted by: Pixy Misa at 02:50 AM | Comments (8) | Add Comment | Trackbacks (Suck)
Post contains 667 words, total size 4 kb.

1 Big Boss Pixy, for you I have a question. Is there a way to upload multiple images at once that I've not figured out? For example, my soon-to-be-done review of Ga-Rei Zero ep05 has a ridiculous number of screenshots that, as far as I'm aware, need to be uploaded one by one.

I'd REALLY like to just upload the whole shebang at once to Images... can that be done as Minx stands, or am I SOL?

Posted by: Wonderduck at Saturday, February 06 2010 01:22 PM (w5qDx)

2 Not at the moment. But if you upload a zip of all the images, I can unzip it for you.

Posted by: Pixy Misa at Saturday, February 06 2010 01:25 PM (PiXy!)

3 No, I've actually already uploaded the ridiculous number of images, and it's hardly like I want to bother you MORE. If it can't be done without troubling you, then I'll continue doing it the normal way.

Unless you WANT a zip of the pics? :-)

Posted by: Wonderduck at Saturday, February 06 2010 11:49 PM (w5qDx)

I'm looking at two options - a feature which lets you unzip a zip file you've unloaded, and a Flash-based multiple file upload (I have a very nice one that works with the other JavaScript stuff I use, but haven't integrated it with the back-end yet).

It's definitely something I plan to do.

Posted by: Pixy Misa at Sunday, February 07 2010 12:33 AM (PiXy!)

5 For what it's worth, I know I'd rather see the Flash version, but that's simply because there appears to be one less step for me to screw up. *rolling eyes*

Posted by: Wonderduck at Sunday, February 07 2010 05:59 AM (w5qDx)

Are you sure Flash can do that? That sounds like it violates the sand box.

I thought that Flash could either access the internet or it could access local files, but it couldn't do both.

Posted by: Steven Den Beste at Sunday, February 07 2010 06:16 AM (+rSRq)

7 Yup. Gmail uses it for selecting attachments, for example.

It's set up so it can't happen without user intervention, but it definitely works.

Posted by: Pixy Misa at Sunday, February 07 2010 07:25 AM (PiXy!)

8 They must have changed it. The either-or I mentioned was originally deliberately set up that way to prevent Flash from being used as an illicit data vacuum cleaner.

Posted by: Steven Den Beste at Sunday, February 07 2010 07:41 AM (+rSRq)

Hide Comments | Add Comment

Virtual, Shmirtual

Ten little virtual servers are we,
Freshly created with OpenVZ,
Ten little servers running FreeBSD CentOS 5.4,
Ten little virtual servers.

Everything is installed from source,
Automated by script of course,
Enough packages to choke a baby horse,
Ten little virtual servers.

Ten little virtual servers swiftly,
Updated to to run MongoDB,
No more MySQL for this Pixy!
Ten little virtual servers,
Ten little virtual servers.
more...

Posted by: Pixy Misa at 01:15 AM | Comments (1) | Add Comment | Trackbacks (Suck)
Post contains 182 words, total size 4 kb.

135kb generated in CPU 0.0271, elapsed 0.0432 seconds.
26 queries taking 0.0219 seconds, 123 records returned.
Powered by Minx 1.1.6c-pink.

Using http / http://ai.mee.nu / 121

Tuesday, February 09

Monday, February 08

Sunday, February 07

Saturday, February 06

Praise for Ambient Irony

Contact Support

Contact Pixy

Business News

Search Thingy

Recent Comments

Topics

Monthly Traffic

Content

Categories

Archives

A Fine Selection of Aldebaran Liqueurs

That Ol' Janx Spirit

Mostly Harmless

MuNu Blogroll

Dish of the Day

Feeds