Ambient Irony

CAN I BE OF ASSISTANCE?
Shut it!

Friday, March 25

Going, Going, Gone!

It took nearly 24 hours all told - not including the backups, which took 48 hours themselves - but I'm on Windows 7 now, and it's working fine.

One critical point: If you have a Realtek network controller (either a card or built in to your motherboard), download the Windows 7 driver for it from the manufacturer's site before upgrading, because your network will be seriously disfunctional afterwards. The driver that ships with Windows 7 delivers only slightly better average speeds than dial-up - even on your local network - and frequently stops working entirely for several seconds at a time.

Posted by: Pixy Misa at 12:30 PM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 104 words, total size 1 kb.

It Keeps Going, And Going, And Going

Thirteen hours into my Windows 7 upgrade now.

Still going.

The progress indicator has, thankfully, moved from where it was six hours ago, and is now at 2,099,020 of 2,777,119.

Posted by: Pixy Misa at 02:07 AM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 37 words, total size 1 kb.

Thursday, March 24

Please Wait...

Transferring files, settings, and programs (608,859 of 2,777,119 transferred)

This is the first time I've ever upgraded a Windows system. Usually I'll hang onto them until they're old enough to need replacing or the operating system gets corrupted and dies.*

Nagi is a quad-core machine with 8GB of RAM, and until AMD's new Bulldozer chips arrive later this year there's no upgrade that's worth bothering with. Not that I can reasonably afford, anyway.

So after carefully backing up 2.2TB of miscellaneous stuffs, I kicked off the upgrade at about 2 o'clock this afternoon. It's just gone 9 o'clock now, and the status is exactly as I gave above.

It's not a quick process, not when you start with a 2.5 year old Vista system with 748 applications installed.

And it's telling me that The Sims 2 may not work afterwards. sad

Also my IDE controller, but I don't think that's even in use.

* Which has happened to me twice, both times due to memory problems of one sort or another.

Posted by: Pixy Misa at 08:02 PM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 173 words, total size 1 kb.

Monday, March 21

Numbers

Running my little Python benchmark again:

	AMD 3.0GHz	Intel 2.93GHz	AMD 2.6GHZ	Intel 3.3GHz	Psyco
Loop	0.613	0.690	0.707	0.613	0.013
String	1.103	0.987	1.273	0.876	0.180
Scan	0.540	0.453	0.623	0.402	0.547
Call	1.383	1.140	1.596	1.012	0.100
Mean	3.639	3.270	4.199	2.903	0.840
Score	275	306	238	344	1190
Mark	1000	1113	867	1253	4332

After a little work to eliminate as many of the variables as possible, this is what I get. These scores are from my little Python benchmark, run on Fedora 13 under OpenVZ on my development machine, a 3GHz AMD Phenom II, and the main production server, a 2.93GHz dual Xeon 5670.

One tricky factor is that the Xeon 5670 can actually run at up to 3.33GHz when lightly loaded. I can't see directly what clock speed each core is running at, but by comparing results between busy and quiet times, and taking the best of ten scores for each test when the CPU was lightly loaded, I'm pretty sure I got a snapshot of it running at top speed, and the difference is about 7%. Intel's newer Xeons also have turbo boost, so I've left the numbers unchanged as averages measured on a moderately busy system.

When it comes to new server hardware, I'm projecting these scores to the Opteron 4180, a 2.6GHz $200 chip, and the Xeon E3-1245, a $280 3.3GHz chip. The Opteron clock speed is slower and the Xeon E3 somewhat faster than my test systems, making the difference much more significant. On the other hand, the Opteron has six cores vs. the Xeon E3's four. On the third hand, the Xeon has hyperthreading, which gives a small but measurable boost as well. All that means that the throughput is likely to be pretty much the same between the two chips.

And the Xeon E3 has a downside in that you can't put more than 16GB of RAM on it: It only supports unbuffered memory, and only four modules. Operon 4180 supports both unbuffered and registered memory, and up to six modules of the latter, so it can easily take 48GB. (More is possible, but requires more expensive high-density DIMMs.)

Also, the Xeon E3 got side-swiped by the Great Sandy Bridge Chipset Disaster, and isn't actually available.

So the new low-end Intel chips will be measurably faster than the current low-end AMD server chips, about 45%, in response times if not overall throughput.

On the other hand, there's that 16GB limit. Memory is dirt cheap and you want to put as much of it in a server as you can, and being able to put three times as much in the AMD system is pretty significant. (Oh, and the Opteron is a dual-socket CPU, so you can easily scale to 96GB and a dozen cores if you want.)

The Psyco numbers are from my dev environment, and point out once again what a nifty bit of work Psyco is, and that it should have been rolled into the Python core years ago.

Posted by: Pixy Misa at 10:20 PM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 459 words, total size 6 kb.

Wednesday, March 16

All Grist For The Bayesian Mill

I'm busy working on the new (and much needed) spam filter for mu.nu and mee.nu.

The old filter was based on heuristics and blacklists and a couple of security-by-obscurity tricks (a honeypot, a secret question).

The new filter is purely Bayesian.

It's more than a simple text analyser, though. Some of the things I'm doing:

Contextual analysis: A comment about designer shoes might be fine on a fashion blog, but on a politics blog it's almost certainly spam.
Language analysis: A comment in Chinese may or may not be spam, but a comment in Chinese replying to a post in French almost certainly is.
Geographics analysis: Are you in a spam hotspot? Are you in the same part of the world as the blogger?
Content analysis: Is the comment full of crappy Microsoft markup?
Metadata analysis: You can put a name, URL, and email address on your comments. The system treats those specifically as names, URLs, and email addresses, not just more comment text.
Trend analysis: How many comments have you posted in the last ten minutes? How many total? How about under that name vs. that IP? What's the average spam score for comments from that IP?

The problem is, some of these produce tokens that I can add to my big spam token table, while others produce numbers. So I need to work out some heuristics and weights by which to modify the Bayesian score with

SMACK

The key understanding here is that Bayesian analysis makes that problem go away. You don't feed the Bayesian score into a calculation along with a bunch of numbers generated by other heuristics. That just makes more work and reduces the reliability of the core mechanism.

What you do is you simplify the numbers in some way (rounding, logarithms, square roots), turn them into tokens, and throw them into the pool. You want to simplify the numbers so that there's a good chance of a match; for example, a five-digit ratio of content:markup isn't going to get many hits, but one or two digits will.

So what we do is we parse, compute, and calculate all these different tokens for a given post, and then we look for the most interesting ones in our database - the ones that, based on our training data, vary the most from the neutral point.

Then we just take the scores for each of those interesting elements, positive or negative, and throw them at Bayes' formula.

And out pops the probability that the comment is spam. (Not just an arbitrary score, but an actual, very realistic, probability.)

And then, based on that, we go and update the scores in the database for every token we pulled from the comment. So if it works out that a comment is spam using one set of criteria, it can train itself to recognise spam using the other identifiable criteria in the comment - based on how distinct those criteria are from non-spam.

Automatically. Which means I don't have to come back and tweak weights or add items to blacklists; it works it all out from context.

The framework is done; I need to write some database code now, load up some tables (like the GeoIP data), and then start training and testing it. If that goes well, I should have it in place early next week.

I have a ton (4 gigabytes) of known spam to train against, but I need to identify a similar amount of known good comments, and that alone is going to take me a day or two.

I looked at just using a service like Akismet. That, all by itself, would cost me more than all the other expenses for keeping the system running put together. Just filtering what's been filtered by the current edition of the spam filter would have cost upwards of $50,000.

A week or two of fiddly coding and training looks like it should pay for itself very quickly.

Posted by: Pixy Misa at 04:16 PM | Comments (15) | Add Comment | Trackbacks (Suck)
Post contains 659 words, total size 4 kb.

1 The annoying thing that I've noticed recently, is that the spambots (or heaven forbid, some idiot manually doing this) have taken to copy-pasting stuff randomly from within the blog/post to use as text before dumping their garbage URL under the Web field.

The filter doesn't knock them down quite enough to never show up.

Posted by: Will at Thursday, March 17 2011 02:12 AM (ZYwON)

2 Yeah. The advantage of a Bayesian solution is that (with training) it learns what rules work and what don't, and narrows in on just the ones that work. So any time I have a potentially neat idea for filtering spam, I don't have to spend long hours testing it and calculating optimal cutoff levels, I just chuck it into the mix and let the system train itself. Makes it much easier to keep up with their new tricks.

I'm going to add some more behavioural analysis (because spam generators don't behave like humans in the way they connect to the server, and that's useful data) and possibly add Markovian analysis to the text analyser. But both of those should be relatively simple, because I'll just throw them into the Bayesian pool again. smile

Posted by: Pixy Misa at Thursday, March 17 2011 03:23 AM (PiXy!)

3 Okay, just threw in a context-free behavioural analysis module. Only problem is I don't currently have any training data for that, so I'll have to patch the live server to collect the data.

Posted by: Pixy Misa at Thursday, March 17 2011 03:47 AM (PiXy!)

4 One of the biggest ways a spambot is different is that it's going to be a lot faster than a human. Transaction timestamps should be a huge clue.

Posted by: Steven Den Beste at Thursday, March 17 2011 05:11 AM (+rSRq)

5 Yes, I have a token based on a function of the posting rate.

I've also added improved link parsing, link count, link:text ratio, markup:text ratio, and language vs. location checks.

The only thing remaining is the Markovian analysis, which I'll leave for now because that could signficantly impact performance.

So, time to build myself a test and training framework!

Posted by: Pixy Misa at Thursday, March 17 2011 11:07 AM (PiXy!)

6 Unrelated, what problem does SetPageHeight() in util.js solve? I ask because it drives me crazy on Wonderduck's site. With Safari and Chrome, it seems to run before all of the pictures are loaded, calculating a maximum height for the page that is well before the end of each post (presumably because he's not putting height and width attributes on the IMG tags). I have to use Firefox to see all those Rio pictures...

-j

Posted by: J Greely at Thursday, March 17 2011 11:52 AM (fpXGN)

7 I'm going to either fix or remove SetPageHeight(). The system is set up to support a three-column layout with banner and footer, without forcing a fixed content ordering. As of 2008, the only way to make that work cross-browser was by manually recalculating the page height. Ghastly, and also buggy.

I'll be re-testing with the current browser range shortly - probably next week - and fixing some of the CSS oddities like that.

Posted by: Pixy Misa at Thursday, March 17 2011 03:08 PM (PiXy!)

8 Oh, is that what's happening? I've had that problem with Wonderduck's site for a long time now.

Posted by: Steven Den Beste at Thursday, March 17 2011 03:09 PM (+rSRq)

9 Yeah, sorry. It will happen on any page using the default 1.1 templates if you load up enough images - if you don't use size specifications (and frankly, who does?) and you don't have a lengthy sidebar.

It's not supposed to do that, but it does. Not sure if it was a bug at the time it was deployed, but it happens across multiple browsers now, so it needs to get fixed.

Posted by: Pixy Misa at Thursday, March 17 2011 03:15 PM (PiXy!)

10 At least, I think it does. It's definitely the culprit on Wonderduck's blog, anyway.

It's a race condition, and if you wait for all the images to load and then refresh, it will show up fine.

Posted by: Pixy Misa at Thursday, March 17 2011 03:16 PM (PiXy!)

11 I made it a rule to supply height attribute to all prictures that I post and thus my blog is immune to the height problem. BTW, Firefox breaks too at certain points. Old Brickmuppet's travel posts have to be reloaded, in Firefox.

Posted by: Pete Zaitcev at Thursday, March 17 2011 06:12 PM (9KseV)

12 I'm sorry The Pond is such a pain. If it's any consolation, friends, it does the same thing to me when I try to read my own blog.

Posted by: Wonderduck at Friday, March 18 2011 02:49 PM (W8Men)

13 The Pond is awesome. This is entirely the fault of a conflict between my CSS and Javascript and recent browsers.

Posted by: Pixy Misa at Friday, March 18 2011 03:48 PM (PiXy!)

14 I think one easy fix would be to update the upload code so that the suggested <img> were to include height= attribute always. The uploader knows the dimensions of the image.

Posted by: Pete Zaitcev at Friday, April 01 2011 05:18 AM (9KseV)

15 Pete's idea is an interesting one. In the file upload frame, when it generates the cut-and-paste code for using an image, it could include size parameters instead of just the filename.

Posted by: Steven Den Beste at Friday, April 01 2011 10:26 AM (+rSRq)

Hide Comments | Add Comment

Friday, March 11

Huffwin's Law

As an online discussion grows longer, the probability of someone citing the Huffington Post approaches one.

Depending on local statute, you may be allowed to shoot the offender. In Texas, this is actually mandatory.

Posted by: Pixy Misa at 05:44 PM | Comments (1) | Add Comment | Trackbacks (Suck)
Post contains 36 words, total size 1 kb.

Tuesday, March 08

Hiccups

Sorry about the hiccups earlier - incoming DDoS from Turkey (again) and I accidentally screwed up the networking while blocking it.

Posted by: Pixy Misa at 07:32 PM | Comments (10) | Add Comment | Trackbacks (Suck)
Post contains 22 words, total size 1 kb.

Saturday, March 05

Looking At Lupa

I'm doing some testing on Lupa:

Calling a LuaJIT function from Python: 363ns
Calling a Python function from LuaJIT: 447ns
Calling a LuaJIT function from Psyco: 253ns
Calling a Psyco function from LuaJIT: 730ns
Calling a Python function from Python: 177ns
Calling a Pysco function from Psyco: 3ns (!)

I also tested some sample code that calls a Lua function from Python and passes it a Python function as a parameter; that takes bout 1.8Âµs in Python and 2.1Âµs in Psyco (jumping into and out of the JIT clearly has some overhead).

The worst case, unfortunately, is likely to be the most common one - calling back to Python/Psyco (specifically the Minx API) to get data for the Lua script. Lupa has some nice wrappers for using data structures rather than functions, so I'm going to see how they go.

That said, the worst case is 730 nanoseconds.

The one hiccup is that creating a Lupa LuaRuntime instance leaks about 30kB, and crashes Python after 13,000 to 15,000 instances - even if I force garbage collection. I've posted that to the Lupa mailing list, and will follow up and see if I can help find the problem and fix it.

That can be solved using a worker pool on the web server, with worker processes being retired after (say) 100 requests. The overhead on the server would be quite small, it would make for much better scalability, and would keep potentially buggy libraries or library use under control. (A careless PIL call can use a huge amount of memory.)

Update: The author has fixed the problem and released a new version of Lupa (0.19) - on the weekend. It now works flawlessly.

Posted by: Pixy Misa at 10:04 PM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 285 words, total size 2 kb.

Friday, March 04

Extra Crunchy

I just realised that with Lupa and the new internal Minx API, I can compile templates down to machine code.

Posted by: Pixy Misa at 06:16 PM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 22 words, total size 1 kb.

75kb generated in CPU 0.0244, elapsed 0.1921 seconds.
53 queries taking 0.1763 seconds, 369 records returned.
Powered by Minx 1.1.6c-pink.

Using https / https://ai.mee.nu / 367

Friday, March 25

Thursday, March 24

Monday, March 21

Wednesday, March 16

Friday, March 11

Tuesday, March 08

Saturday, March 05

Friday, March 04

Praise for Ambient Irony

Contact Support

Contact Pixy

Business News

Search Thingy

Recent Comments

Topics

Monthly Traffic

Content

Categories

Archives

A Fine Selection of Aldebaran Liqueurs

That Ol' Janx Spirit

Mostly Harmless

MuNu Blogroll

Dish of the Day

Feeds