Aah lasagna's gone!

Sunday, July 05


Semantic UI

Semantic UI is Bootstrap done right.

Not that Bootstrap is awful; it's quite good and very useful.  It's just that Semantic UI actually makes sense.  And in the world of HTML, CSS, and Javascript, that's a rare thing.

Posted by: Pixy Misa at 12:12 AM | Comments (1) | Add Comment | Trackbacks (Suck)
Post contains 40 words, total size 1 kb.

Wednesday, June 17


Nano Nano

So AMD paper-launched their new video card lineup at E3 yesterday.  We already knew that most of the 300-series were just 200-series cards with new stencils and (in some cases) more memory.

The extra memory is welcome, though; with 2GB with 285 came up a bit short; with 4GB the 380 is a much better card, though it's the exact same chip.

The real excitement was around the new Fury cards - the top end cards now get a name and not just a number.  We knew that the Fury and Fury X were coming, because AMD announced their use of HBM (High Bandwidth Memory) months ago, but they still kept a surprise up their corporate sleeves.

HBM is a new answer to video cards' ever-growing demand for memory bandwidth.  If you look at Nvidia's current high-end cards, they use a 384-bit memory bus running at an effective 7GHz.  With HBM, AMD have flipped that around and only run at 1GHz - which demands far less power - but on a bus that's 4096 bits wide.

And they achieve that by attaching the memory not to a circuit board, but to a silicon interposer.  4096 traces on a circuit board would be hugely expensive - and just plain huge - but on silicon it's easy.  The interposer is far larger than a normal chip, but since it only carries wires and not transistors, it can be built easily on old, reliable equipment, and doesn't have the size restrictions of actual logic chips.

Anyway, AMD showed the water-cooled Fury X, which offers 50% more performance than their previous high-end card at the same power consumption - 8.6 TFLOPS vs. 5.6 TFLOPS - the air-cooled Fury, about 15% slower and 15% cheaper, the forthcoming Fury X2, which is two Fury Xs on a single card, and the R9 Nano, which came as a complete surprise.

Essentially, the Fury X is the fastest single-GPU card AMD can currently make; the Fury is the best price/performance they can achieve; the Fury X2 is the fastest card they can make that can actually fit in a normal computer.

The Nano is designed to deliver the best possible performance per watt.  The Fury X delivers 50% better performance per watt than the previous generation (using the same 28nm silicon process at TSMC), but the Nano is designed to run not at the optimum settings for performance, but at the optimum settings for power consumption, and the result is that it's faster than AMD's previous high-end card at about half the power.

And half the size.  By high-end video card standards, it's tiny, about 6" long.

AMD haven't yet release final specs and pricing for the Nano, but I'll be watching for it eagerly.  I don't need the absolute fastest video card I can get, but the card I have barely fits in the case, and makes upgrades incredibly awkward.  The Nano should be about twice as fast as the card I have, use less power, and take up half the room.  And give me more DisplayPort outputs so I can run a full 4K triple-monitor setup.

The real breakthroughs in performance will come in the next couple of years, as AMD and Nvidia combine HBM and HBM2 (twice as fast) with the next-generation 14nm silicon processes that are finally coming on line for them.  But AMD with its Fury range and Nvidia with their Maxwell linup (960, 970, 980, and Titan) have given us a tantalising taste of the near future.  Moore's law isn't dead quite yet.

Posted by: Pixy Misa at 09:12 PM | Comments (8) | Add Comment | Trackbacks (Suck)
Post contains 593 words, total size 4 kb.

Tuesday, June 16


Apparently Some Classic JRPG Fan Found A Magic Lamp...

They could have wished for world peace, a cure for cancer, and a really big pie.  But no...

At E3, Sony announced an actual release of the much delayed The Last Guardian, a PS4 remake of Final Fantasy VII, and (here much of the audience lost their minds), Shenmue III.

One commenter on Reddit described it as a fanfic version of an E3 event.

Posted by: Pixy Misa at 10:12 PM | Comments (9) | Add Comment | Trackbacks (Suck)
Post contains 73 words, total size 1 kb.

Wednesday, May 27



Glitchr explains Unicode.

Ben Frederickson explains Unicode.

Ai Shinozaki explains Unicode.


Posted by: Pixy Misa at 09:47 AM | Comments (6) | Add Comment | Trackbacks (Suck)
Post contains 12 words, total size 1 kb.

Sunday, May 24


So I Was Wondering

Note to self: Implement auto-save, dammit.

I already knew that LMDB supported multiple values per key.  Reading the docs last week, I noticed that values within a key were stored in sorted order.  This evening, I was wondering if it were possible to seek to a particular value within a key.

It is.

This is neat.  It means you can use LMDB as an engine to implement a two-dimensional database like Cassandra, or a data structure server like Redis, with the elements of lists, sets, and hashes individually addressable.

Plus the advantage that unlike Redis, it can live on disk and have B-tree indexes rather than just a hash table.  (Though of course Redis has the advantage of predictable performance - living in memory and accessed directly, it's very consistent.)

The other big advantage of LMDB (for me, anyway) is that it natively supports multiple processes - not just multiple threads, but independent processes - handling transactions and locking automatically.  I love Python, but it has what is known as the Global Interpreter Lock - while you can have many threads, only one of them can be executing Python code at any time.  The other threads can be handling I/O, or calling C libraries that don't access shared data, but can't actually be running your code at the same time.

That puts a limit on the performance of any single Python application, and breaking out into multiple processes means you need to find a way to share data between those processes, which is a lot more fiddly than it is with threads.

LMDB don't care.  Thread, process, all the same, just works.


It does have limitations - it's a single-writer/multiple-reader design, so it will only scale so far unless you implement some sort of sharding scheme on top of it.  But I've clocked it at 100,000 transactions per second, and a million bulk writes per second, so it's not bad at all.

Admittedly that was with the write safety settings disabled, so  server crash could have corrupted my database.  But my main use for LMDB is as a smart distributed data structure cache, so if one node dies it can just be flushed and restarted.  In practical use, as a robust database, the numbers are somewhat lower (though with a smart RAID controller you should still be able to do pretty well).

It also supports a rather nice hot backup facility, where the backup format is either a working LMDB database ready to go (without needing to restore) or a cdbmake format backup (which is plain text if you're using JSON for keys and values), and it can back up around 1GB per second - if you have the I/O bandwidth - and only about 20% slower if the database is in heavy use at the time.


Posted by: Pixy Misa at 01:08 AM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 472 words, total size 3 kb.

Friday, May 22


A Few Of My Favourite Things

Once I got past the segfaults, anyway.  You should be using these things.

MongoDB 3.0 (not earlier versions, though)
Elasticsearch (though their approach to security is remarkably ass-backwards)
LZ4 (and its friend, LZ4_HC)
LMDB and its magical set_range_dup


Ai Shinozaki

Posted by: Pixy Misa at 05:10 PM | Comments (11) | Add Comment | Trackbacks (Suck)
Post contains 65 words, total size 2 kb.

Wednesday, May 13


Some Pig

So, I'm tinkering with what will become Minx 1.2, and testing various stuff, and I'm pretty happy with the performance.

Then I run the numbers, and realise that I'm flooding a 10GbE connection with HTTP requests using a $15 cloud server.

I think we can count that part of the problem space as solved.


Posted by: Pixy Misa at 05:30 PM | Comments (4) | Add Comment | Trackbacks (Suck)
Post contains 56 words, total size 1 kb.


Hard Things

There are only two hard things in Computer Science: cache invalidation, naming things, and off-by-one errors.

Posted by: Pixy Misa at 11:27 AM | Comments (2) | Add Comment | Trackbacks (Suck)
Post contains 18 words, total size 1 kb.

Tuesday, May 12


That'll Do

I was getting about 1000 random record reads per second.
I needed to achieve 10,000 reads per second to make things work.
I wanted to reach 100,000 reads per second to make things run nicely.
I'm currently at 1,000,000.*

That'll do.


* Best test run so far was ~1.6 million records per second, with some special-case optimisations.**  Without optimisations, around 300k. Per thread.

** Since you asked, the problem was with unpacking large cached records into native objects.  A common case in templates is that you only want to access one or two fields in a record - perhaps just the user's name - but unless the record is already a native object you need to load the external representation and parse it to find the field you need.  The solution was to keep an immutable version of the of the object in the process, sign it with SHA-256, and sign the matching cache entry.  Then, when we need to access the record, we can read the binary data from the cache, compare the signatures, and if they match, we're safe to continue using the existing native structure.  If they don't match, we take the payload, decrypt it (if encryption is enabled) check that the decrypted payload matches the signature (if not, something is badly wrong), uncompress the payload (if compression is enabled), parse it (MsgPack or JSON), instantiate a new object, freeze it, and put it back into the native object cache.  This can take as long as 20 microseconds.

Posted by: Pixy Misa at 05:53 PM | Comments (1) | Add Comment | Trackbacks (Suck)
Post contains 253 words, total size 2 kb.

Sunday, April 26


Needs For Speeds

Testing various libraries and patterns on Python 2.7.9 and PyPy 2.5.1

Test Python PyPy Gain
Loop 0.27 0.017 1488%
Strlist 0.217 0.056 288%
Scan 0.293 0.003 9667%
Lambda 0.093 0.002 4550%
Pystache 0.213 0.047 353%
Markdown 0.05 0.082 -39%
ToJSON 0.03 0.028 7%
FromJSON 0.047 0.028 68%
ToMsgPack 0.023 0.012 92%
FromMsgPack 0.02 0.013 54%
ToSnappy 0.027 0.032 -16%
FromSnappy 0.027 0.024 13%
ToBunch 0.18 0.016 1025%
FromBunch 0.187 0.016 1069%
CacheSet 0.067 0.046 46%
CacheGet 0.037 0.069 -46%
CacheMiss 0.017 0.015 13%
CacheFast 0.09 0.067 34%
CachePack 0.527 0.162 225%
PixyMarks 13.16 40.60 209%

  • The benchmark script runs all the tests once to warm things up, then runs them three times and takes the mean.  The PixyMark score is simply the inverse of the geometric mean of the scores.  This matters for PyPy, because it takes some time for the JIT compiler to engage.

    Tests were run on a virtual machine on what I believe to be a Xeon E3 1230, though it might be a 1225 v2 or v3.

  • The Python Markdown library is very slow. The best alternative appears to be Hoep, which is a wrapper for the Hoedown library, which is a fork of the Sundown library, which is a fork of the unfortunately named Upskirt library.   (The author of which is not a native English speaker, and probably had not previously run into the SJW crowd.)

    Hoep is slower for some reason in PyPy than CPython, but still plenty fast.

  • cPickle is an order of magnitude slower than a good JSON or MsgPack codec.

  • The built-in JSON module in CPython is the slowest Python JSON codec. The built-in JSON module in PyPy appears to be the fastest.  For CPython I used uJSON, which seems to be the best option if you're not using PyPy.

  • CPython is very good at appending to strings. PyPy, IronPython (Python for .Net) and Jython (Python for Java) are uniformly terrible at this. This is due to a clever memory allocation optimisation that is tied closely to CPython's garbage collection mechanism, and isn't available in the other implementations.

    I removed the test from my benchmark because for large strings it's so slow that it overwhelms everything else.  Instead, append to a list and join it when you're done, or something along those lines.

  • I generally see about a 6x speedup from PyPy.  In these benchmarks I've been focusing on getting the best possible speed for various functions, using C libraries wherever possible.  A C library called from Python runs at exactly the same speed as a C library called from PyPy, so this has inherently reduced the relative benefits of PyPy.  PyPy is still about 3x faster, though; in other words, migrating to PyPy effectively turns a five-year-old mid-range CPU into 8GHz next-gen unobtainium.  

  • If you are very careful about selecting your libraries.  There's an alternate Snappy compression library available.  It's about the same speed under CPython, but 30x slower under PyPy due to inefficiencies in PyPy's CTypes binding.

  • uWSGI is pretty neat.  The cache tests are run using uWSGI's cache2 module; it's the fastest caching mechanism I've seen for Python so far.  Faster than the native caching decorators I've tested - and it's shared across multiple processes.  (It can also be shared across multiple servers, but that is certain to be slower, unless you have some seriously fancy networking hardware.)

    One note, though: The uWSGI cache2 Python API is not binary-safe.  You need to JSON-encode or Base64 or something along those lines.

  • The Bleach package - a handy HTML sanitiser - is so slow that it's useless for web output - you have to sanitise on input, which means that you either lose the original text or have to store both.  Unless, that is, you have a caching mechanism with a sub-microsecond latency.

  • The Bunch package on the other hand - which lets you use object notation on Python dictionaries, so you can say customer.address rather than customer['address'] - is really fast.  I've been using it a lot recently and knew it was fast, but 1.6us to wrap a 30-element dictionary under PyPy is a pretty solid result.

  • As an aside, if you can retrieve, uncompress, unpack, and wrap a record with 30 fields in 8us, it's worth thinking about caching database records.  Except then you have to worry about cache invalidation.  Except - if you're using MongoDB, you can tail the oplog to automatically invalidate cached records.  And if you're using uWSGI, you can trivially fork that off as a worker process.

    Which means that if you have, say, a blogging platform with a template engine that frequently needs to look up related records (like the author or category for a post) this becomes easy, fast, and almost perfectly consistent.  

Posted by: Pixy Misa at 01:28 PM | No Comments | Add Comment | Trackbacks (Suck)
Post contains 1403 words, total size 15 kb.

<< Page 1 of 122 >>
80kb generated in CPU 0.05, elapsed 0.0908 seconds.
57 queries taking 0.0512 seconds, 279 records returned.
Powered by Minx 1.1.6c-pink.