Wednesday, November 30
This blog now appears at ai.mu.nu. Of course, it still appears at ambientirony.mu.nu and even at ambientirony.com.
I changed the default domain because there are an increasing number of blogs filtering referrers on the word "ambien". Which sucks.
It makes an artform out of awfulness. Every page you stumble across scales unexpected pinnacles of unfriendliness. Apparently they have an online store. I sure as hell can't find it.
Look at this, which is a subsite of the above train wreck. There is exactly one link on the page, and it does nothing.
What the hell? I mean, seriously, what the hell? If this had somehow languished on someone's server since 1996 and I'd just now stumbled across it, I could understand, but the copyright dates are this year (and often, next year).
International Illustrator brings you all the best tubes, tuts, images, fonts, filters, tags and more!Now you're just making things up. You sound like my granddaughter, and I know she makes that stuff up.
Minus thirty trillion Pixy Points. Reformat your server, install Linux and, say, Joomla, and start again from the beginning, because what we have here is a failure to communicate. [You were going to say something with the word "fuck" in it, weren't you? You could tell? Well, yeah.]
So, I have this little program that converts Movable Type blogs, singly or en masse, into Minx blogs.* And I am trying out various queries to see how the database performs when it is actually using the indexes, and now and then adding a new index.
One of the indexes I added was on the number of comments on posts, so you can quickly see where the action is (or was). And the number one post, with 1633 comments, can be found here. I'm surprised the poor system survived.
MySQL takes 0.11 seconds to bring up those comments the first time; 0.03 seconds after they've been cached in memory. Whether this is a worthwhile achievement or not I will leave to Madfish's readers to decide.
* Not that I am working on Minx.
Tuesday, November 29
Voters: We've got Giuliani and Rice; Allen and Rice; Romney and Rice; Rice, McCain and Rice; Rice, Rice, Thompson and Rice; and Frist.
Hugh: I'll take the Frist.
Update: Blargh. Links to Hewitt fail due to broken referrer-spam filter. Click here for working-link version.
How long has that stupid picture been sitting there anyway? Two weeks now?
Note to self: WHM account transfer function trashes symbolic links.
After loading a fresh copy of the database, you must analyse the tables before MySQL will do anything remotely sensible with the indexes.
If you fail to do this... DOH!
(Visualise that "DOH!" in 40-foot-high flashing red neon, with searchlights and helicopters flying overhead and police cars and fire engines and so on and so forth.)
Why exactly have you provided a structure for monthly archives in your database when the system is entirely dynamic and is already indexed by date?
Because the sticky field overrides the date ordering.
But you could add a new index?
And it would add, what, 10% to the thread table size?
And it would allow monthly archives by category and stuff like that?
Monday, November 28
The way to consistently achieve acceptable performance on full-text searches using MySQL* is to avoid full-text indexes at all costs.
The problem is threefold. Full-text indexes generally treat your text as one big splodge of data. Minx** is structured into recursive directories of sites containing recursive directories of folders containing recursive directories of threads, which contain posts and comments and various other type thingies, all of which are crosslinked like the great polymer of doom. There's all the structure you could possibly ask for when it comes to narrowing down a search. But if you use a full-text index, it searches the entire database first, and then looks at your selection. This wouldn't be a problem, though, were it not for the other two points.
MySQL somehow scatters its full-text index data all over the disk in bite-size pieces. If all the index data is in memory (and you can force that situation using a special SQL query), it is very fast. If it's not, then MySQL will gather up all its little pieces before doing anything with your search. This can take (literally) a minute. Once it's in memory, your search might take a tenth of a second, but the first time, it's likely to suck.
Finally, what with the time taken to scatter all those little pieces about, building the index in the first place takes forever. Add that one index and your database takes ten times longer to load.
How to avoid all this ugliness? Simple. Use brute-force searches. There are still some tricks there, for instance, that only indexed critera seem to be used to narrow the search before MySQL does the text match, so a carelessly defined selection can end up scanning the entire database. Also, MySQL is pretty darn slow at doing brute-force searches.***
Beyond that, the solution is to build your own search engine. Or use Google. But since I can search all of Ace's posts in about a second and all his comments in four, and since I can easily get it to restrict the search to, say, the last six months or whatever (proportionally reducing the search time) or expand it to multiple blogs or narrow it to specific categories... I think it's good enough to be getting on with.
And since that was the only problem I really still had with Minx** there is now nothing in the way of rolling out a preview release... Except that I have to move house first.****
* In the context of a large-scale blogging system.
** Which I am not working on. Not at all.
*** As it turns out, not really any slower than selecting all the text in the first place. So the problem isn't in the text search itself, which is something of a relief.
**** Yes, again. I don't want to talk about it. In fact, I don't know why I brought it up in the first place. Bah.
Sunday, November 27
or, Winning A Battle In The War On Spam
If you're wondering where I've been these past few days, well, I've been busy snarking.
Snark! is the new MuNu trackback filter. It's based on the simple but elegant idea that if someone sends us lots of trackbacks, we don't want them. Unlike most people, I am in the position to collate trackback data from across two hundred blogs in real time. So if all at once someone sends three pings to Little Miss Attila and two to Ace of Spades and another four to, say, the the Llamas, I can simply say, "This is Spam, and I shall delete it forthwith", and do so.
We get a lot of spam. Tens of thousands of trackbacks a day. Thousands of comments. We are running MT Blacklist, and most of it gets summarily rejected. But. Movable Type is not the most sprightly of applications. It's a dynamically configured CGI app written in Perl. That's not a recipe for sparkling performance, and indeed, sparkle it does not. It chugs along like a diesel engine, a plough horse rather than a thoroughbred. It can take close to half a second, sometimes more, for Movable Type to decide to reject a trackback.
And when the spammers really get to work, we can receive a thousand trackbacks a minute.
Snark stops 99.8% of trackback spam before it even gets to Movable Type, and it does it very very efficiently. How efficiently? This efficiently:
Blacklist Entries: 23 (plus 61 manual entries)3000 trackbacks stopped, 360 web pages updated, 360 blacklists exported, in the same time it takes Movable Type to do one.
Session Uptime: 6 hours 0 minutes
Pings Received: 3045
Processing Time: 0.50 seconds
This is not a slam on Movable Type itself. The Perl script I use to simply log the incoming trackbacks takes 40 milliseconds, 0.04 seconds, to run. Snark can process the trackbacks a hundred times faster than the system can record them.
What I'm saying is that there are better ways to do things than CGI and Perl. PHP is a significant improvement in terms of performance, but not so much in terms of the language itself.
Python and persistent application servers are where the action is. I tried writing a blogging system exactly that way, but I was unfortunate in my choice of databases (I used Metakit, and it simply doesn't scale). Fortunately, Python SQL programming isn't as bad as all that - it's at least comparable with Perl or PHP.* CherryPy is a very neat way to organise such a system without needing any sort of CGI or PHP front end. And Psyco speeds up even text-processing applications by a good 50%.
Which is not to say that I am busy working on Minx again and hope to have something to show before the end of the year. Not at all.
Update: I've cleaned up the code a little - although it still makes multiple passes through the trackback list - and changed the order in which the filters are applied so that the volume filter comes first (that is, after the whitelist) and the blacklist comes last. That should make things even more efficient since the volume and age filters are O(1) and the blacklist is O(n). Now I just need another 10,000 trackbacks so I can do a comparison. Come in spammer!
Update: I finished the code cleanup and optimisation, and the spammers obliged:
Blacklist Entries: 24 (plus 60 manual entries)I think this one's a keeper.
Session Uptime: 13 hours 3 minutes
Pings Received: 11888
Processing Time: 0.50 seconds
*In other words, about twenty years behind commercial systems.
Saturday, November 26
It's a where, not a what.
Spam - it's an education.
57 queries taking 0.034 seconds, 263 records returned.
Powered by Minx 1.1.6c-pink.