Saturday, February 06

Geek

Mongo Angry! Mongo Smash!

One of the most intractable problems I have with Minx stems from it's inherent many-to-many structure.

Minx supports many sites.
Each site can have many folders.
Each folder can contain many thread.
Each thread can appear in many folders (even on different sites).
Each thread can have many items (posts, comments, and various other less used thingies).

What this means is that to display the 20 most recent comments on your blog, I have to - at least in theory - perform a four-way join, sort, and select.  I actually play some tricks to reduce it to a three-way join on a subset of the data, but once you start to page through the comments the tricks begin to break down.  Not enough that it's noticeably slow at present, but enough that it won't scale to really large numbers of users or really large sites.

I call it the grandparent problem.  If you're looking for one record - an individual comment - no problem, it's O(ln n).  If you're looking for the children of a record - comments on a post, comments by a particular user - no problem, it's O(n + log n).  But if you're looking for grandchildren of a record, its O(n log n), and that n no longer bears any relation to the number of records you actually want; you have to do a huge join, then sort the results, then select the handful you actually want.

MongoDB has a set of features that, put together, look like they solve exactly this problem.

First, you can have arrays in your records.  So, where I currently create duplicate thread records to place a thread in multiple folders (categories, for example), I can just add the category IDs to the array.

By itself that wouldn't be so useful, were it not for feature two: You can index arrays.  So I can create an index on the category array and post time, and simply adding and removing category IDs from that array will make the post show correctly in your folders with no performance hit.  In fact, it's more efficient (both in space and time) than the current technique.

So far so good.  Now for feature three: Arrays in records can contain not just single data values (like a category ID), but other records.  So I can put the posts and comments inside the thread record, and when I fetch a thread, I can fetch the entire thread contents in one go.

Now that wouldn't be so useful either except for feature four: You can build an index on a field in a record in an array in a record in your table.

That is, you can shove all your comments straight into the thread record, and then pick them out 50 at a time for paged comments, or in an entirely different order - say, the last 10 comments posted on your blog, no matter what post they're on.

Magic!

The one thing that seems slightly tricky is that MongoDB is document based.  You don't read fields from the database, you read documents.  You can store documents one inside another (comments inside threads, for example), and then you can get one or more of those comments without reading the whole thing.  But if you have information in the thread record itself, you can only get at it by reading the whole thread, comments and all.  For an Ace of Spades post with 1000+ comments, that would burn up all the performance I just gained.

There are some ways around that with a little bit of data duplication and other hackery, though it would be nice if MongoDB let you simply select a subset of fields to be returned.  It already has ways of updating individual fields inside a document, so something like that might already be on the way.

Anyway, that's where I'll be this weekend.

Update: MongoEngine provides a rather nice ORM - um, ODM - for Python and MongoDB.

Posted by: Pixy Misa at 02:50 AM | Comments (8) | Add Comment | Trackbacks (Suck)
Post contains 667 words, total size 4 kb.

1 Big Boss Pixy, for you I have a question.  Is there a way to upload multiple images at once that I've not figured out?  For example, my soon-to-be-done review of Ga-Rei Zero ep05 has a ridiculous number of screenshots that, as far as I'm aware, need to be uploaded one by one.

I'd REALLY like to just upload the whole shebang at once to Images... can that be done as Minx stands, or am I SOL?

Posted by: Wonderduck at Saturday, February 06 2010 01:22 PM (w5qDx)

2 Not at the moment.  But if you upload a zip of all the images, I can unzip it for you.

Posted by: Pixy Misa at Saturday, February 06 2010 01:25 PM (PiXy!)

3 No, I've actually already uploaded the ridiculous number of images, and it's hardly like I want to bother you MORE.  If it can't be done without troubling you, then I'll continue doing it the normal way.

Unless you WANT a zip of the pics?  :-)

Posted by: Wonderduck at Saturday, February 06 2010 11:49 PM (w5qDx)

4 smile

I'm looking at two options - a feature which lets you unzip a zip file you've unloaded, and a Flash-based multiple file upload (I have a very nice one that works with the other JavaScript stuff I use, but haven't integrated it with the back-end yet).

It's definitely something I plan to do.

Posted by: Pixy Misa at Sunday, February 07 2010 12:33 AM (PiXy!)

5 For what it's worth, I know I'd rather see the Flash version, but that's simply because there appears to be one less step for me to screw up.  *rolling eyes*

Posted by: Wonderduck at Sunday, February 07 2010 05:59 AM (w5qDx)

6

Are you sure Flash can do that? That sounds like it violates the sand box.

I thought that Flash could either access the internet or it could access local files, but it couldn't do both.

Posted by: Steven Den Beste at Sunday, February 07 2010 06:16 AM (+rSRq)

7 Yup.  Gmail uses it for selecting attachments, for example.

It's set up so it can't happen without user intervention, but it definitely works.

Posted by: Pixy Misa at Sunday, February 07 2010 07:25 AM (PiXy!)

8 They must have changed it. The either-or I mentioned was originally deliberately set up that way to prevent Flash from being used as an illicit data vacuum cleaner.

Posted by: Steven Den Beste at Sunday, February 07 2010 07:41 AM (+rSRq)

Hide Comments | Add Comment

Comments are disabled. Post is locked.
51kb generated in CPU 0.0155, elapsed 0.1581 seconds.
56 queries taking 0.1487 seconds, 346 records returned.
Powered by Minx 1.1.6c-pink.