Thursday, September 03
Never Mind The Quality, Feel The Waveforms
This is the core of the wavetable synthesis algorithm, in Imagine assembler. The advantage of this approach is that it runs on a completely general CPU as long as the multiply instruction is reasonably fast - not more than, say, 20x slower than the TMS32010.
The problem is it uses memory-mapped registers, and makes 23 memory accesses per sample to read the instructions and data. With five voices and a 18.75kHz sample rate it would use 70% of the 3MHz system bus.
Only one of those 23 memory access actually reads the audio samples, though. So as little as 2k of RAM attached directly to the audio chip would reduce that from 70% to 3%.
It uses four registers - AB and CD as 20 bit accumulators, W and X for indexing.
I'd like to work out some DSP instructions that would reduce this count, and also figure out a nice way to handle stereo. This code is mono only, though it can assign a voice to either left or right the way the Amiga did.
The problem is it uses memory-mapped registers, and makes 23 memory accesses per sample to read the instructions and data. With five voices and a 18.75kHz sample rate it would use 70% of the 3MHz system bus.
Only one of those 23 memory access actually reads the audio samples, though. So as little as 2k of RAM attached directly to the audio chip would reduce that from 70% to 3%.
It uses four registers - AB and CD as 20 bit accumulators, W and X for indexing.
I'd like to work out some DSP instructions that would reduce this count, and also figure out a nice way to handle stereo. This code is mono only, though it can assign a voice to either left or right the way the Amiga did.
Instruction Cycles Comment
CLR CD 1 'Clear the audio accumulator
LD W, ($100) 2 'Load the base register for the voice
LD X, ($102) 2 'Load the current offset
ADD X, ($104) 2 'Add the step
CMP XH, ($107) 2 'Compare the high byte with the sample size
IFGE 1 1 'Next instruction only executes if the offset is past the end of the sample
CLR X 1 'Set the offset back to the start
ST X, ($102) 2 'And save it
ADD W, XH 1 'Add the high byte of the current offset to the address
CLR A 1 'Clear the high byte of AB
LD B, (W) 2 'Load the sample into the low byte of AB
MUL B, ($108) 2+N 'Multiply the sample by the volume and store in AB
ADD CD, AB 1 'And accumulate the resulting sample
OUT $01000, C 3 'Write the high byte of the result to the audio DAC
Registers per LFO:
$0 Sample base
$2 Scaled offset
$4 Step size
$6 Sample size
$8 Volume
$9 Balance
While this approach is limited in the number of voices unless I can speed up that loop - five or six at most - modulators can run at a much lower sample rate. For example, the ADSR envelope might be defined as 128 values with a sample rate of 500 Hz or so. Twenty such modulators would use fewer cycles than a single voice.
How about if we gave our audio processor 64 bytes of on-chip RAM, accessible as registers R0-R63? Just a simple single-ported register file, but with a 20 bit datapath and available same-cycle rather than next-cycle for selected operations only. We now have:
Instruction Cycles Comment
LD W, R0 1 'Load the base register for the voice
LD X, R2 1 'Load the current offset
ADD X, R4 1 'Add the step
CMP XH, R7 1 'Compare the high byte with the sample size
IFGE 1 1 'Next instruction only executes if the offset is past the end of the sample
CLR X 1 'Set the offset back to the start
ST X, R2 1 'And save it
ADD W, XH 1 'Add the high byte of the current offset to the address
LD B, (W) 2 'Load the sample into the low byte of AB
MAC AL, B, R8 1+N 'Multiply the sample by the volume and add to the left accumulator
MAC AR, B, R9 1+N 'And the same for the right
This more DSP-ish version is a fair bit faster and requires fewer registers. I've added two audio accumulators, AL and AR, and a MAC instruction. I've removed the initial clear and final write to DAC operations here, because they only operate once all five voices have been calculated.
So we can reduce 19+N cycles for mono down to 12+2N for stereo. How well this works rather depends on the value of N. I'd like to have a balance register rather than two volume controls, but it makes no sense algorithmically, requiring more than double the computation. It can be a virtual virtual virtual register though - if you update the virtual balance and mono volume registers in software, the virtual DSP updates the virtual stereo volume registers in the virtual hardware to comply.
8 of the 12 base cycles here are used just to calculate an address. But it's a really complicated address - essentially a 30 bit fixed-point value that we map back to a 20 bit address space. You can see we're calculating a 20 bit offset in X but then adding just the high byte of X to the low end of W. This allows us freedom to slide around the wavetables however we want - hang on one sample for 10 cycles, or advance 10 steps at a time. Or 10.001, if that's what we want. Just using integer addition.
I tried modifying the code to save the current address rather than using base+offset, but the 30 bit virtual virtual addresses make everything funky.
As to the value of N, I can set it to anything I can find an excuse for. I can also reduce the precision of the volume multiplier - the Amiga had eight bit samples but a six bit volume control. If there are only 32 volume levels the multiply would take half as long to execute.
I can also pipeline it. If there are two MAC units, left and right, then once the instructions are issued the processor can continue ticking until another instruction access AL or AR. The modulator algorithms typically don't use multiplication, so 10 or 12 cycles for a 10x10+20 MAC operation is plenty fast enough. If we pretend to do one bit worth of shift and add per cycle, then an add at the end, that's 11 cycles. I think the 68000 did this, with a couple of tricks like handing two bits at a time, and skipping cycles if both bits were zero. It handled clock cycles very differently, though, which is why it clocked so much higher than a 6809.
The other trick is that this version mostly doesn't interleave data and instruction fetches on main memory, which means that my page mode trick to read two bytes per cycle is effective. The first eight instructions can be grouped into four pairs, with each pair taking a single cycle to fetch. Same for the two MAC operations. So this version accesses main memory 7 times, down from 19 originally. That's still 20% of the system bus eaten for five stereo voices plus ten or so modulators, but if we optimise the CPU code similarly it will likely have 20% of its cycles spare anyway.
I think this works.
Update: Oops, don't have enough bits in the instruction word to access that new register file. Instead of a flat 64 registers it will be six banks of ten, and we'll need one more cycle in our loop to select the bank.
We can also remove three instructions if samples are simply fixed to 1k in length, but with a pipelined multiplier there's not that much to be gained; we can't issue another MAC instruction any sooner no matter how short the inner loop.
I'm also going to memory map the 64 word instruction file, for two reasons: First, it means any regular instruction can access it and I don't need to define a full set of DSP instructions for corner cases. And second, really sneaky programmers can load code into it and use it as a cache.
I'm going to do the same thing for the I/O controller - minus the dedicated MAC hardware. It will have another trick up its sleeve though.
The audio, video, and I/O processors will share a subset of the CPU's instruction set, both at assembler and binary level. I'll try to make it so there's no overlap so that an imaginary future version could have the full works in one package.
Posted by: Pixy Misa at
04:30 PM
| No Comments
| Add Comment
| Trackbacks (Suck)
Post contains 1347 words, total size 10 kb.
57kb generated in CPU 0.0882, elapsed 0.1576 seconds.
56 queries taking 0.1463 seconds, 358 records returned.
Powered by Minx 1.1.6c-pink.
56 queries taking 0.1463 seconds, 358 records returned.
Powered by Minx 1.1.6c-pink.









