Tuesday, November 29
After loading a fresh copy of the database, you must analyse the tables before MySQL will do anything remotely sensible with the indexes.
If you fail to do this... DOH!
(Visualise that "DOH!" in 40-foot-high flashing red neon, with searchlights and helicopters flying overhead and police cars and fire engines and so on and so forth.)
Why exactly have you provided a structure for monthly archives in your database when the system is entirely dynamic and is already indexed by date?
Because the sticky field overrides the date ordering.
But you could add a new index?
And it would add, what, 10% to the thread table size?
And it would allow monthly archives by category and stuff like that?
Monday, November 28
The way to consistently achieve acceptable performance on full-text searches using MySQL* is to avoid full-text indexes at all costs.
The problem is threefold. Full-text indexes generally treat your text as one big splodge of data. Minx** is structured into recursive directories of sites containing recursive directories of folders containing recursive directories of threads, which contain posts and comments and various other type thingies, all of which are crosslinked like the great polymer of doom. There's all the structure you could possibly ask for when it comes to narrowing down a search. But if you use a full-text index, it searches the entire database first, and then looks at your selection. This wouldn't be a problem, though, were it not for the other two points.
MySQL somehow scatters its full-text index data all over the disk in bite-size pieces. If all the index data is in memory (and you can force that situation using a special SQL query), it is very fast. If it's not, then MySQL will gather up all its little pieces before doing anything with your search. This can take (literally) a minute. Once it's in memory, your search might take a tenth of a second, but the first time, it's likely to suck.
Finally, what with the time taken to scatter all those little pieces about, building the index in the first place takes forever. Add that one index and your database takes ten times longer to load.
How to avoid all this ugliness? Simple. Use brute-force searches. There are still some tricks there, for instance, that only indexed critera seem to be used to narrow the search before MySQL does the text match, so a carelessly defined selection can end up scanning the entire database. Also, MySQL is pretty darn slow at doing brute-force searches.***
Beyond that, the solution is to build your own search engine. Or use Google. But since I can search all of Ace's posts in about a second and all his comments in four, and since I can easily get it to restrict the search to, say, the last six months or whatever (proportionally reducing the search time) or expand it to multiple blogs or narrow it to specific categories... I think it's good enough to be getting on with.
And since that was the only problem I really still had with Minx** there is now nothing in the way of rolling out a preview release... Except that I have to move house first.****
* In the context of a large-scale blogging system.
** Which I am not working on. Not at all.
*** As it turns out, not really any slower than selecting all the text in the first place. So the problem isn't in the text search itself, which is something of a relief.
**** Yes, again. I don't want to talk about it. In fact, I don't know why I brought it up in the first place. Bah.
Sunday, November 27
or, Winning A Battle In The War On Spam
If you're wondering where I've been these past few days, well, I've been busy snarking.
Snark! is the new MuNu trackback filter. It's based on the simple but elegant idea that if someone sends us lots of trackbacks, we don't want them. Unlike most people, I am in the position to collate trackback data from across two hundred blogs in real time. So if all at once someone sends three pings to Little Miss Attila and two to Ace of Spades and another four to, say, the the Llamas, I can simply say, "This is Spam, and I shall delete it forthwith", and do so.
We get a lot of spam. Tens of thousands of trackbacks a day. Thousands of comments. We are running MT Blacklist, and most of it gets summarily rejected. But. Movable Type is not the most sprightly of applications. It's a dynamically configured CGI app written in Perl. That's not a recipe for sparkling performance, and indeed, sparkle it does not. It chugs along like a diesel engine, a plough horse rather than a thoroughbred. It can take close to half a second, sometimes more, for Movable Type to decide to reject a trackback.
And when the spammers really get to work, we can receive a thousand trackbacks a minute.
Snark stops 99.8% of trackback spam before it even gets to Movable Type, and it does it very very efficiently. How efficiently? This efficiently:
Blacklist Entries: 23 (plus 61 manual entries)3000 trackbacks stopped, 360 web pages updated, 360 blacklists exported, in the same time it takes Movable Type to do one.
Session Uptime: 6 hours 0 minutes
Pings Received: 3045
Processing Time: 0.50 seconds
This is not a slam on Movable Type itself. The Perl script I use to simply log the incoming trackbacks takes 40 milliseconds, 0.04 seconds, to run. Snark can process the trackbacks a hundred times faster than the system can record them.
What I'm saying is that there are better ways to do things than CGI and Perl. PHP is a significant improvement in terms of performance, but not so much in terms of the language itself.
Python and persistent application servers are where the action is. I tried writing a blogging system exactly that way, but I was unfortunate in my choice of databases (I used Metakit, and it simply doesn't scale). Fortunately, Python SQL programming isn't as bad as all that - it's at least comparable with Perl or PHP.* CherryPy is a very neat way to organise such a system without needing any sort of CGI or PHP front end. And Psyco speeds up even text-processing applications by a good 50%.
Which is not to say that I am busy working on Minx again and hope to have something to show before the end of the year. Not at all.
Update: I've cleaned up the code a little - although it still makes multiple passes through the trackback list - and changed the order in which the filters are applied so that the volume filter comes first (that is, after the whitelist) and the blacklist comes last. That should make things even more efficient since the volume and age filters are O(1) and the blacklist is O(n). Now I just need another 10,000 trackbacks so I can do a comparison. Come in spammer!
Update: I finished the code cleanup and optimisation, and the spammers obliged:
Blacklist Entries: 24 (plus 60 manual entries)I think this one's a keeper.
Session Uptime: 13 hours 3 minutes
Pings Received: 11888
Processing Time: 0.50 seconds
*In other words, about twenty years behind commercial systems.
Friday, November 25
Pixy's First Law of Economics: Spam is whatever you have too much of.
If you are trying to identify what is and isn't spam, forget blacklists and bayesian filters. Go by volume. Of course, you have to be in a position to measure the volume, but if you are, that's it for the spammers.
According to Snark!™ duepunti.net currently has a spam ranking of 48. Even if they send me no trackbacks at all for the next hour, they will still be considered a spammer and anything that comes from them will be automatically deleted (and bump their rank up).
Now they're up to 62.8. Slow learners. Of course, I don't provide any feedback, I just null-route the bastards.
I was just thinking - I could post the Snark!™ stats as a public service. Make it (ugh) XML and people could import it directly. Real-time dynamic spammer detection.
First I have to stop Snark!™ going mad and dropping the ball. It did that last night and generated a gigabyte of error messages. I think a leetle bit more tweaking is in order.
Update: See the link above. It still falls over now and then, so you can expect the values to suddenly get reset to zero on occasion until I (a) get that fixed and/or (b) get it to store the spam rankings.
Update: I did (b), 'cause it's easier to add code than to fix what's already there. Not better, just easier.
Update: Okay, I think I've managed (a) as well. Turned out to be a couple of bugs that only occured when there were no trackbacks to be processed. This didn't show up in my testing, because that would mean going an entire minute without getting spammed.
Update: The spammers have gone quiet for now. This is probably the first time I've ever wanted to get spamflooded. The point is, the more spam we get, the better the filter works, and the better the data we can provide to others. We now have an IP address list as well, but because the spam died down just as I implemented that function, it presently contains exactly one address.
We receive well over a million trackbacks a month, so I'm sure we'll have a nice set of sample data coming down the wire soon enough.
Update: Change log sort of thingy. Though I really just added that link to test the whitelist.
Sunday, November 13
Update: No more cows. Well, no major cows. We are on the new servers, and things seem to be working well.
Thursday, November 03
The Good: My laserdisc player, which I haven't used for, oh, a real long time, still works perfectly.
The Bad: My new video capture widget doesn't work at all well with NTSC. In fact, it craps it up completely. Of course, all my laserdiscs are NTSC. (I checked the specs and it said it supported NTSC. I didn't notice there were two different models.)
The Ugly: We've been really spoilt by the recent trend towards end-to-end digital transfers. Laserdiscs used to be considered high-quality video. Eh. Well, I guess my choice of test disc had something to do with it. (Slayers Next, which just happened to be handy.)
Back in the day, 80GB of disk would look like a warehouse full of washing machines.
The good thing was that this made it much harder to put it down somewhere and lose it.
57 queries taking 0.124 seconds, 316 records returned.
Powered by Minx 1.1.6c-pink.