Sunday, November 27

Geek

Now That I Am Flu Nebraska

or, Winning A Battle In The War On Spam

If you're wondering where I've been these past few days, well, I've been busy snarking.

Snark! is the new MuNu trackback filter. It's based on the simple but elegant idea that if someone sends us lots of trackbacks, we don't want them. Unlike most people, I am in the position to collate trackback data from across two hundred blogs in real time. So if all at once someone sends three pings to Little Miss Attila and two to Ace of Spades and another four to, say, the the Llamas, I can simply say, "This is Spam, and I shall delete it forthwith", and do so.

We get a lot of spam. Tens of thousands of trackbacks a day. Thousands of comments. We are running MT Blacklist, and most of it gets summarily rejected. But. Movable Type is not the most sprightly of applications. It's a dynamically configured CGI app written in Perl. That's not a recipe for sparkling performance, and indeed, sparkle it does not. It chugs along like a diesel engine, a plough horse rather than a thoroughbred. It can take close to half a second, sometimes more, for Movable Type to decide to reject a trackback.

And when the spammers really get to work, we can receive a thousand trackbacks a minute.

Snark stops 99.8% of trackback spam before it even gets to Movable Type, and it does it very very efficiently. How efficiently? This efficiently:

Blacklist Entries: 23 (plus 61 manual entries)
Session Uptime: 6 hours 0 minutes
Pings Received: 3045
Processing Time: 0.50 seconds
3000 trackbacks stopped, 360 web pages updated, 360 blacklists exported, in the same time it takes Movable Type to do one.

This is not a slam on Movable Type itself. The Perl script I use to simply log the incoming trackbacks takes 40 milliseconds, 0.04 seconds, to run. Snark can process the trackbacks a hundred times faster than the system can record them.

What I'm saying is that there are better ways to do things than CGI and Perl. PHP is a significant improvement in terms of performance, but not so much in terms of the language itself.

Python and persistent application servers are where the action is. I tried writing a blogging system exactly that way, but I was unfortunate in my choice of databases (I used Metakit, and it simply doesn't scale). Fortunately, Python SQL programming isn't as bad as all that - it's at least comparable with Perl or PHP.* CherryPy is a very neat way to organise such a system without needing any sort of CGI or PHP front end. And Psyco speeds up even text-processing applications by a good 50%.

Which is not to say that I am busy working on Minx again and hope to have something to show before the end of the year. Not at all.

Update: I've cleaned up the code a little - although it still makes multiple passes through the trackback list - and changed the order in which the filters are applied so that the volume filter comes first (that is, after the whitelist) and the blacklist comes last. That should make things even more efficient since the volume and age filters are O(1) and the blacklist is O(n). Now I just need another 10,000 trackbacks so I can do a comparison. Come in spammer!

Update: I finished the code cleanup and optimisation, and the spammers obliged:

Blacklist Entries: 24 (plus 60 manual entries)
Session Uptime: 13 hours 3 minutes
Pings Received: 11888
Processing Time: 0.50 seconds
I think this one's a keeper.

*In other words, about twenty years behind commercial systems.

Posted by: Pixy Misa at 06:53 AM | Comments (8) | Add Comment | Trackbacks (Suck)
Post contains 619 words, total size 4 kb.

1 Duhhhhhhhhhhhhh... I like pie.

Posted by: Wonderduck at Sunday, November 27 2005 11:34 AM (mAAjO)

2 I like pie too. :)

Posted by: Pixy Misa at Sunday, November 27 2005 07:29 PM (QriEg)

3 Pixy, you should put subtitles on posts like this.

Posted by: Wonderduck at Monday, November 28 2005 12:26 AM (KnWO3)

4 English is Pixy's second language--his first is Geek... ;)

Posted by: Susie at Monday, November 28 2005 10:32 AM (a0oF7)

5 Nah... I understand Geek. Whatever that was, it wasn't Geek. Adminish, maybe.

Posted by: Wonderduck at Monday, November 28 2005 11:15 PM (mAAjO)

6 In non-Geek, what I am saying is this: 1. Instead of making a list of everything we don't want (the old way), now we just delete anything we get lots of and hope for the best. 2. So far this is working much much better than the old way for far less effort. 3. Yay!

Posted by: Pixy Misa at Tuesday, November 29 2005 07:38 AM (QriEg)

7 Yay!

Posted by: Susie at Tuesday, November 29 2005 08:18 AM (a0oF7)

8 Nice. Good application of common sense to technology. When the two meet, anything is achievable.

Posted by: TallDave at Wednesday, November 30 2005 03:21 PM (r7SJo)

Hide Comments | Add Comment

Comments are disabled. Post is locked.
50kb generated in CPU 0.0143, elapsed 0.1488 seconds.
56 queries taking 0.1397 seconds, 345 records returned.
Powered by Minx 1.1.6c-pink.