Sunday, November 27
or, Winning A Battle In The War On Spam
If you're wondering where I've been these past few days, well, I've been busy snarking.
Snark! is the new MuNu trackback filter. It's based on the simple but elegant idea that if someone sends us lots of trackbacks, we don't want them. Unlike most people, I am in the position to collate trackback data from across two hundred blogs in real time. So if all at once someone sends three pings to Little Miss Attila and two to Ace of Spades and another four to, say, the the Llamas, I can simply say, "This is Spam, and I shall delete it forthwith", and do so.
We get a lot of spam. Tens of thousands of trackbacks a day. Thousands of comments. We are running MT Blacklist, and most of it gets summarily rejected. But. Movable Type is not the most sprightly of applications. It's a dynamically configured CGI app written in Perl. That's not a recipe for sparkling performance, and indeed, sparkle it does not. It chugs along like a diesel engine, a plough horse rather than a thoroughbred. It can take close to half a second, sometimes more, for Movable Type to decide to reject a trackback.
And when the spammers really get to work, we can receive a thousand trackbacks a minute.
Snark stops 99.8% of trackback spam before it even gets to Movable Type, and it does it very very efficiently. How efficiently? This efficiently:
Blacklist Entries: 23 (plus 61 manual entries)3000 trackbacks stopped, 360 web pages updated, 360 blacklists exported, in the same time it takes Movable Type to do one.
Session Uptime: 6 hours 0 minutes
Pings Received: 3045
Processing Time: 0.50 seconds
This is not a slam on Movable Type itself. The Perl script I use to simply log the incoming trackbacks takes 40 milliseconds, 0.04 seconds, to run. Snark can process the trackbacks a hundred times faster than the system can record them.
What I'm saying is that there are better ways to do things than CGI and Perl. PHP is a significant improvement in terms of performance, but not so much in terms of the language itself.
Python and persistent application servers are where the action is. I tried writing a blogging system exactly that way, but I was unfortunate in my choice of databases (I used Metakit, and it simply doesn't scale). Fortunately, Python SQL programming isn't as bad as all that - it's at least comparable with Perl or PHP.* CherryPy is a very neat way to organise such a system without needing any sort of CGI or PHP front end. And Psyco speeds up even text-processing applications by a good 50%.
Which is not to say that I am busy working on Minx again and hope to have something to show before the end of the year. Not at all.
Update: I've cleaned up the code a little - although it still makes multiple passes through the trackback list - and changed the order in which the filters are applied so that the volume filter comes first (that is, after the whitelist) and the blacklist comes last. That should make things even more efficient since the volume and age filters are O(1) and the blacklist is O(n). Now I just need another 10,000 trackbacks so I can do a comparison. Come in spammer!
Update: I finished the code cleanup and optimisation, and the spammers obliged:
Blacklist Entries: 24 (plus 60 manual entries)I think this one's a keeper.
Session Uptime: 13 hours 3 minutes
Pings Received: 11888
Processing Time: 0.50 seconds
Posted by: Wonderduck at Sunday, November 27 2005 11:34 AM (mAAjO)
Posted by: Pixy Misa at Sunday, November 27 2005 07:29 PM (QriEg)
Posted by: Wonderduck at Monday, November 28 2005 12:26 AM (KnWO3)
Posted by: Susie at Monday, November 28 2005 10:32 AM (a0oF7)
Posted by: Wonderduck at Monday, November 28 2005 11:15 PM (mAAjO)
Posted by: Pixy Misa at Tuesday, November 29 2005 07:38 AM (QriEg)
Posted by: Susie at Tuesday, November 29 2005 08:18 AM (a0oF7)
Posted by: TallDave at Wednesday, November 30 2005 03:21 PM (r7SJo)
52 queries taking 0.2665 seconds, 289 records returned.
Powered by Minx 1.1.6c-pink.