Friday, November 25

Geek

Lawmaker

Pixy's First Law of Economics: Spam is whatever you have too much of.

If you are trying to identify what is and isn't spam, forget blacklists and bayesian filters. Go by volume. Of course, you have to be in a position to measure the volume, but if you are, that's it for the spammers.

According to Snark!™ duepunti.net currently has a spam ranking of 48. Even if they send me no trackbacks at all for the next hour, they will still be considered a spammer and anything that comes from them will be automatically deleted (and bump their rank up).

Now they're up to 62.8. Slow learners. Of course, I don't provide any feedback, I just null-route the bastards.

I was just thinking - I could post the Snark!™ stats as a public service. Make it (ugh) XML and people could import it directly. Real-time dynamic spammer detection.

First I have to stop Snark!™ going mad and dropping the ball. It did that last night and generated a gigabyte of error messages. I think a leetle bit more tweaking is in order.

Update: See the link above. It still falls over now and then, so you can expect the values to suddenly get reset to zero on occasion until I (a) get that fixed and/or (b) get it to store the spam rankings.

Update: I did (b), 'cause it's easier to add code than to fix what's already there. Not better, just easier.

Update: Okay, I think I've managed (a) as well. Turned out to be a couple of bugs that only occured when there were no trackbacks to be processed. This didn't show up in my testing, because that would mean going an entire minute without getting spammed.

Update: The spammers have gone quiet for now. This is probably the first time I've ever wanted to get spamflooded. The point is, the more spam we get, the better the filter works, and the better the data we can provide to others. We now have an IP address list as well, but because the spam died down just as I implemented that function, it presently contains exactly one address.

We receive well over a million trackbacks a month, so I'm sure we'll have a nice set of sample data coming down the wire soon enough.

Update: Change log sort of thingy. Though I really just added that link to test the whitelist.

Posted by: Pixy Misa at 02:34 AM | Comments (5) | Add Comment | Trackbacks (Suck)
Post contains 404 words, total size 2 kb.

1 Have I told you lately that you are brilliant as well as beautiful?

Posted by: Susie at Friday, November 25 2005 10:48 AM (a0oF7)

2 Not to mention the lovely accent, too.

Posted by: Wonderduck at Friday, November 25 2005 12:18 PM (mAAjO)

3 This is actually a nice heuristic. Trackback spammers who reduce the number of pings they send you defeat their own purpose, which is to hike their results in search engine results. But if they continue to flood you with pings, you ban them simply because of sheer volume.

Posted by: Steven Den Beste at Friday, November 25 2005 12:43 PM (CJBEv)

4 Yep. My initial thought was just to replace the Movable Type trackback script, which is nightmarishly inefficient. (It can use up to 200MB of memory!) Second thought was that nobody minds if trackbacks take a minute or so to appear, and if I sample trackbacks once a minute I can reject anything that we get more than, say, five copies of. And when I did that, suddenly 99.5% of the spam was eliminated before it even reached the blacklist. So I added some more tracking mechanisms, a "fading memory" effect so that people couldn't simply send us 4 spams per minute, a feedback system, and so it grew. It's now blocking 99.9% of spam with - as far as I can see - not one false positive so far. MT Blacklist is blocking about 50% of the remaining spam, so the amount that is actually getting through is tiny. I'm getting twice as much trackback spam on my blog (which is running MT 3.15) than all 200 Snark-protected blogs put together. That's what I call economy of scale.

Posted by: Pixy Misa at Friday, November 25 2005 11:49 PM (AIaDY)

5 Minor update: MT's trackback script is only nightmarishly inefficient if your MySQL database happens to be screwed up. Which it was.

Posted by: Pixy Misa at Tuesday, November 29 2005 08:55 AM (QriEg)

Hide Comments | Add Comment

Comments are disabled. Post is locked.
48kb generated in CPU 0.0125, elapsed 0.1658 seconds.
56 queries taking 0.1568 seconds, 342 records returned.
Powered by Minx 1.1.6c-pink.