Tuesday, June 06

Geek

Huh?

I was forced to kill the MT 2.6 search routine at Munu because it was (a) taking 180MB of memory, (b) taking a couple of minutes, and (c) because of (b) people were clicking on it multiple times until it, essentially, killed the server.

I just rewrote it.

I tested it on Ace's blog, by searching for "Bush". It's currently set to only search the last 500 entries, but for my test I set it to scan the last 5000.

It takes one second. And 19MB of memory. About 350ms for the SQL query and 650ms for the program itself.

It sure ain't optimised. It selects the last 5000 entries, sorts them (because God forbid there should be a useful compound index), yanks the entire result set into a list, scans them for each of the search terms, uses an SGML parser to remove HTML tags, and kicks the result out.

For more reasonable searches, like searching for "test" on the last 500 entries at Munuviana, it takes 35ms for the query and 130ms for the program.

About 120ms of that is start-up: Launching the Python interpreter and loading the seven or eight libraries involved.

Now, I'm not using a template system for this. Still, one-tenth the memory, one hundredth(?) the processing time. I can't be sure about the processing difference, because I can't run the original script right now. I configured Apache with a 100MB memory limit for CGI scripts.

For Minx, the 120ms start-up time wil disappear because the application runs as a multi-threaded server itself, not as individual CGI (or PHP or ASP) scripts. Can only do so much about the query time, but I'll play around with it. And I'll pre-store the excerpts rather than create them on the fly. Well, probably. I might be able to live with an average search time of 45ms.

Hmm.

What if I get MySQL to do the matching? Let's see...

Okay, not good. Hmm.

Ah, there we go. Don't use regexp's unless you need them. "LIKE" is nice and brisk. With a 500-result limit, the Ace/Bush search is 125ms for the SQL query and 300ms for the program. And doing it that way, I could actually page it, so 50 results at a time. Hmm. And when I'm taking the 50-word excerpts, rather than whitespace-split the entire entry, I'll just look at the first 400 bytes.

And let's go back and add that compound index while we're at it...

Okay, now we're cooking. 7ms for the query, 100ms for the search script. Since the resolution of the timer seems to be 10ms, and the search script takes 100ms if you feed it an invalid blog id, that's less than 10ms or so for the actual work.

That'll do.

Posted by: Pixy Misa at 03:47 AM | Comments (1) | Add Comment | Trackbacks (Suck)
Post contains 461 words, total size 3 kb.

1 I'm happy to report that there have been no problems with the new search function over at The Pond.  In the past, it would usually 'time out' and die on me (I did have ONE successful search, once).  Now, it just works, and well.  Brilliant.

Posted by: Wonderduck at Wednesday, June 07 2006 09:10 AM (7+BNY)

Hide Comments | Add Comment

Comments are disabled. Post is locked.
47kb generated in CPU 1.726, elapsed 1.7747 seconds.
56 queries taking 1.7309 seconds, 350 records returned.
Powered by Minx 1.1.6c-pink.