Wednesday, March 16
I'm busy working on the new (and much needed) spam filter for mu.nu and mee.nu.
The old filter was based on heuristics and blacklists and a couple of security-by-obscurity tricks (a honeypot, a secret question).
The new filter is purely Bayesian.
It's more than a simple text analyser, though. Some of the things I'm doing:
- Contextual analysis: A comment about designer shoes might be fine on a fashion blog, but on a politics blog it's almost certainly spam.
- Language analysis: A comment in Chinese may or may not be spam, but a comment in Chinese replying to a post in French almost certainly is.
- Geographics analysis: Are you in a spam hotspot? Are you in the same part of the world as the blogger?
- Content analysis: Is the comment full of crappy Microsoft markup?
- Metadata analysis: You can put a name, URL, and email address on your comments. The system treats those specifically as names, URLs, and email addresses, not just more comment text.
- Trend analysis: How many comments have you posted in the last ten minutes? How many total? How about under that name vs. that IP? What's the average spam score for comments from that IP?
The key understanding here is that Bayesian analysis makes that problem go away. You don't feed the Bayesian score into a calculation along with a bunch of numbers generated by other heuristics. That just makes more work and reduces the reliability of the core mechanism.
What you do is you simplify the numbers in some way (rounding, logarithms, square roots), turn them into tokens, and throw them into the pool. You want to simplify the numbers so that there's a good chance of a match; for example, a five-digit ratio of content:markup isn't going to get many hits, but one or two digits will.
So what we do is we parse, compute, and calculate all these different tokens for a given post, and then we look for the most interesting ones in our database - the ones that, based on our training data, vary the most from the neutral point.
Then we just take the scores for each of those interesting elements, positive or negative, and throw them at Bayes' formula.
And out pops the probability that the comment is spam. (Not just an arbitrary score, but an actual, very realistic, probability.)
And then, based on that, we go and update the scores in the database for every token we pulled from the comment. So if it works out that a comment is spam using one set of criteria, it can train itself to recognise spam using the other identifiable criteria in the comment - based on how distinct those criteria are from non-spam.
Automatically. Which means I don't have to come back and tweak weights or add items to blacklists; it works it all out from context.
The framework is done; I need to write some database code now, load up some tables (like the GeoIP data), and then start training and testing it. If that goes well, I should have it in place early next week.
I have a ton (4 gigabytes) of known spam to train against, but I need to identify a similar amount of known good comments, and that alone is going to take me a day or two.
I looked at just using a service like Akismet. That, all by itself, would cost me more than all the other expenses for keeping the system running put together. Just filtering what's been filtered by the current edition of the spam filter would have cost upwards of $50,000.
A week or two of fiddly coding and training looks like it should pay for itself very quickly.
The filter doesn't knock them down quite enough to never show up.
Posted by: Will at Thursday, March 17 2011 02:12 AM (ZYwON)
I'm going to add some more behavioural analysis (because spam generators don't behave like humans in the way they connect to the server, and that's useful data) and possibly add Markovian analysis to the text analyser. But both of those should be relatively simple, because I'll just throw them into the Bayesian pool again.
Posted by: Pixy Misa at Thursday, March 17 2011 03:23 AM (PiXy!)
Posted by: Pixy Misa at Thursday, March 17 2011 03:47 AM (PiXy!)
Posted by: Steven Den Beste at Thursday, March 17 2011 05:11 AM (+rSRq)
I've also added improved link parsing, link count, link:text ratio, markup:text ratio, and language vs. location checks.
The only thing remaining is the Markovian analysis, which I'll leave for now because that could signficantly impact performance.
So, time to build myself a test and training framework!
Posted by: Pixy Misa at Thursday, March 17 2011 11:07 AM (PiXy!)
Posted by: J Greely at Thursday, March 17 2011 11:52 AM (fpXGN)
I'll be re-testing with the current browser range shortly - probably next week - and fixing some of the CSS oddities like that.
Posted by: Pixy Misa at Thursday, March 17 2011 03:08 PM (PiXy!)
Posted by: Steven Den Beste at Thursday, March 17 2011 03:09 PM (+rSRq)
It's not supposed to do that, but it does. Not sure if it was a bug at the time it was deployed, but it happens across multiple browsers now, so it needs to get fixed.
Posted by: Pixy Misa at Thursday, March 17 2011 03:15 PM (PiXy!)
It's a race condition, and if you wait for all the images to load and then refresh, it will show up fine.
Posted by: Pixy Misa at Thursday, March 17 2011 03:16 PM (PiXy!)
Posted by: Pete Zaitcev at Thursday, March 17 2011 06:12 PM (9KseV)
Posted by: Wonderduck at Friday, March 18 2011 02:49 PM (W8Men)
Posted by: Pixy Misa at Friday, March 18 2011 03:48 PM (PiXy!)
Posted by: Pete Zaitcev at Friday, April 01 2011 05:18 AM (9KseV)
Posted by: Steven Den Beste at Friday, April 01 2011 10:26 AM (+rSRq)
56 queries taking 0.4156 seconds, 345 records returned.
Powered by Minx 1.1.6c-pink.