Meet you back here in half an hour.
What are you going to do?
What I always do - stay out of trouble... Badly.

Friday, March 18

Life

How's The Weather In Sydney Now?

Rain.  Rain rain rainity rain.  Rain with cloudy periods and patchy rain.  With more rain, and occasional showers, sprinkles, and storms.

For at least the next week.

Good.

Posted by: Pixy Misa at 06:24 PM | Comments (7) | Add Comment | Trackbacks (Suck)
Post contains 34 words, total size 1 kb.

Thursday, March 17

Cool

Ready Please Mister Music!

Okay, looks like I'm going to have to either write some more music or wallow in guilt.  Sony Creative Software is going download only, so they're clearing out their stock of loops on physical media at 75% off.

I thought that the delivery charges to Australia would be prohibitive, but I guess CDs and DVDs in cardboard sleeves don't cost much to ship, because it's a flat $30 for FedEx Priority shipping.

So I went through the sale catalog, ticked just about everything I had ever wanted to buy but couldn't quite justify previously, and ordered the whole lot.  Whee!

Should land here late next week, which is perfect.

Posted by: Pixy Misa at 04:31 AM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 113 words, total size 1 kb.

Wednesday, March 16

Geek

All Grist For The Bayesian Mill

I'm busy working on the new (and much needed) spam filter for mu.nu and mee.nu.

The old filter was based on heuristics and blacklists and a couple of security-by-obscurity tricks (a honeypot, a secret question).

The new filter is purely Bayesian.

It's more than a simple text analyser, though.  Some of the things I'm doing:
  • Contextual analysis: A comment about designer shoes might be fine on a fashion blog, but on a politics blog it's almost certainly spam.
  • Language analysis: A comment in Chinese may or may not be spam, but a comment in Chinese replying to a post in French almost certainly is.
  • Geographics analysis: Are you in a spam hotspot?  Are you in the same part of the world as the blogger?
  • Content analysis: Is the comment full of crappy Microsoft markup?
  • Metadata analysis: You can put a name, URL, and email address on your comments.  The system treats those specifically as names, URLs, and email addresses, not just more comment text.
  • Trend analysis: How many comments have you posted in the last ten minutes?  How many total?  How about under that name vs. that IP?  What's the average spam score for comments from that IP?
The problem is, some of these produce tokens that I can add to my big spam token table, while others produce numbers.  So I need to work out some heuristics and weights by which to modify the Bayesian score with

SMACK

The key understanding here is that Bayesian analysis makes that problem go away.  You don't feed the Bayesian score into a calculation along with a bunch of numbers generated by other heuristics.  That just makes more work and reduces the reliability of the core mechanism.

What you do is you simplify the numbers in some way (rounding, logarithms, square roots), turn them into tokens, and throw them into the pool.  You want to simplify the numbers so that there's a good chance of a match; for example, a five-digit ratio of content:markup isn't going to get many hits, but one or two digits will.

So what we do is we parse, compute, and calculate all these different tokens for a given post, and then we look for the most interesting ones in our database - the ones that, based on our training data, vary the most from the neutral point.

Then we just take the scores for each of those interesting elements, positive or negative, and throw them at Bayes' formula.

And out pops the probability that the comment is spam.  (Not just an arbitrary score, but an actual, very realistic, probability.)

And then, based on that, we go and update the scores in the database for every token we pulled from the comment.  So if it works out that a comment is spam using one set of criteria, it can train itself to recognise spam using the other identifiable criteria in the comment - based on how distinct those criteria are from non-spam.

Automatically.  Which means I don't have to come back and tweak weights or add items to blacklists; it works it all out from context.

The framework is done; I need to write some database code now, load up some tables (like the GeoIP data), and then start training and testing it.  If that goes well, I should have it in place early next week.

I have a ton (4 gigabytes) of known spam to train against, but I need to identify a similar amount of known good comments, and that alone is going to take me a day or two.

I looked at just using a service like Akismet.  That, all by itself, would cost me more than all the other expenses for keeping the system running put together.  Just filtering what's been filtered by the current edition of the spam filter would have cost upwards of $50,000.

A week or two of fiddly coding and training looks like it should pay for itself very quickly.

Posted by: Pixy Misa at 04:16 PM | Comments (15) | Add Comment | Trackbacks (Suck)
Post contains 659 words, total size 4 kb.

Friday, March 11

Geek

Huffwin's Law

As an online discussion grows longer, the probability of someone citing the Huffington Post approaches one.

Depending on local statute, you may be allowed to shoot the offender.  In Texas, this is actually mandatory.

Posted by: Pixy Misa at 05:44 PM | Comments (1) | Add Comment | Trackbacks (Suck)
Post contains 36 words, total size 1 kb.

Thursday, March 10

Rant

Patria Non Grata

Is that even correct?

Anyway, if you're trying to read this from Turkey, that's probably not working out too well.

Posted by: Pixy Misa at 06:08 PM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 23 words, total size 1 kb.

Tuesday, March 08

Geek

Hiccups

Sorry about the hiccups earlier - incoming DDoS from Turkey (again) and I accidentally screwed up the networking while blocking it.

Posted by: Pixy Misa at 07:32 PM | Comments (10) | Add Comment | Trackbacks (Suck)
Post contains 22 words, total size 1 kb.

Monday, March 07

Cool

Anteater!


Posted by: Pixy Misa at 06:48 PM | Comments (2) | Add Comment | Trackbacks (Suck)
Post contains 1 words, total size 1 kb.

Life

My Entertainment Unit Just Exploded

No, really.  I was sitting here, reading my email, when there was a horrible crash from the other end of the living room.  One of the glass doors of the entertainment unit had spontaneously disintegrated.  The rubble is all over the floor, still ticking and popping.

The glass is - was - curved, so I suspect it's been under internal stress the entire time and just suddenly gave way.  I have my air conditioner on, and it's in the line of the air stream, so perhaps the temperature differential added to that.

Somewhat unsettling, having furniture explode without warning like that.

Posted by: Pixy Misa at 05:50 PM | Comments (4) | Add Comment | Trackbacks (Suck)
Post contains 106 words, total size 1 kb.

Saturday, March 05

Geek

Looking At Lupa

I'm doing some testing on Lupa:

Calling a LuaJIT function from Python: 363ns
Calling a Python function from LuaJIT: 447ns
Calling a LuaJIT function from Psyco: 253ns
Calling a Psyco function from LuaJIT: 730ns
Calling a Python function from Python: 177ns
Calling a Pysco function from Psyco: 3ns (!)

I also tested some sample code that calls a Lua function from Python and passes it a Python function as a parameter; that takes bout 1.8µs in Python and 2.1µs in Psyco (jumping into and out of the JIT clearly has some overhead).

The worst case, unfortunately, is likely to be the most common one - calling back to Python/Psyco (specifically the Minx API) to get data for the Lua script.  Lupa has some nice wrappers for using data structures rather than functions, so I'm going to see how they go.

That said, the worst case is 730 nanoseconds.

The one hiccup is that creating a Lupa LuaRuntime instance leaks about 30kB, and crashes Python after 13,000 to 15,000 instances - even if I force garbage collection.  I've posted that to the Lupa mailing list, and will follow up and see if I can help find the problem and fix it.

That can be solved using a worker pool on the web server, with worker processes being retired after (say) 100 requests.  The overhead on the server would be quite small, it would make for much better scalability, and would keep potentially buggy libraries or library use under control.  (A careless PIL call can use a huge amount of memory.)

Update: The author has fixed the problem and released a new version of Lupa (0.19) - on the weekend.  It now works flawlessly.

Posted by: Pixy Misa at 10:04 PM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 285 words, total size 2 kb.

Friday, March 04

Geek

Extra Crunchy

I just realised that with Lupa and the new internal Minx API, I can compile templates down to machine code.

Posted by: Pixy Misa at 06:16 PM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 22 words, total size 1 kb.

<< Page 2 of 3 >>
76kb generated in CPU 0.0745, elapsed 0.2401 seconds.
56 queries taking 0.2238 seconds, 404 records returned.
Powered by Minx 1.1.6c-pink.
Using http / http://ai.mee.nu / 402