FS#11759 - Rearrange libmad synthesis memory acceses for arm

Attached to Project: Rockbox
Opened by MichaelGiacomelli (saratoga) - Monday, 15 November 2010, 04:55 GMT
Task Type Patches
Category Codecs
Status New
Assigned To No-one
Operating System All players
Severity Low
Priority Normal
Reported Version Release 3.6
Due in Version Undecided
Due Date Undecided
Percent Complete 0%
Votes 0
Private No


Work in progress patch. Currently decodes audio but with some glitches. Has a small mountain of debug code included.

The basic idea is to rearrange the D filter coefficients in the synthesis filter so that pairs of them are used sequentially. This is not easy because the taps need to be loaded in the seemingly random order needed by the audio samples. However, this rearrangement seems to be possible:

0 1 2 3 4 5 6 7 (original sequence)
0 2 1 3 4 6 5 7 (new sequence)

The complication is that the code assumes that it can start a new filter at any offset, even odd ones, which means each and every filter needs to be rewritten 4 times, one for each of the 4 possible alignments. This patch does that.

Once I'm certain that it works, I intend to convert the D coefficients to packed 16 bit values, then use packed 16 bit multiply instructions on ARMv5E+. This should lead to a small speed up on armv4 (just because ldm instructions can be used instead of ldr) and a very large speed up on arm9E and arm11 (because packed multiplies are tremendously faster and much easier to pipeline).
This task depends upon

Comment by MichaelGiacomelli (saratoga) - Tuesday, 16 November 2010, 02:20 GMT
Corrected a bug in one of the filters.
Comment by MichaelGiacomelli (saratoga) - Tuesday, 16 November 2010, 02:31 GMT
Above patch is confirmed to produce bit per bit identical output to SVN using lame_128k.mp3
Comment by MichaelGiacomelli (saratoga) - Tuesday, 16 November 2010, 02:38 GMT
Above patch without debug code.
Comment by MichaelGiacomelli (saratoga) - Wednesday, 17 November 2010, 20:35 GMT
Converted to use 16 bit D coefficients. c code has an RMS error of 1.3 pcm levels, and a peak error of 8 levels for lame_128k.mp3. This seems more then acceptable.

Edit: Note that volume is off in that patch. I'll correct this later.
Comment by MichaelGiacomelli (saratoga) - Saturday, 04 December 2010, 21:21 GMT
Thought of a better way to rearrange the D coefficients. This one is both much simpler and should give significantly better performance on arm11. In this version the D coefficients are split into two table: D_even and D_odd, which unsurprisingly contain the even and odd coefficients from the old table. The dewindowing code is then unrolled and rearranged to accommodate the new even and odd tables.

As a result, all memory accesses are now fully sequential, each D coefficient can be packed into a 32 bit pair, and all windowed sample data are used to generate 2 samples for each time they are loaded.

Remove about 50KB of debug code from that patch.
Write ASM version.
Comment by MichaelGiacomelli (saratoga) - Sunday, 05 December 2010, 04:27 GMT
Above but with a lot of bugs fixed. Output should be correct now.
Comment by MichaelGiacomelli (saratoga) - Tuesday, 07 December 2010, 16:38 GMT
Overlooked some code in the above patch. Now fixed.

Comment by MichaelGiacomelli (saratoga) - Thursday, 09 December 2010, 21:08 GMT
Finally converted all filters to use the new even/odd coefficients. Removed old 'sorted' coefficients introduced in the original patch. Output is identical to SVN.
Comment by MichaelGiacomelli (saratoga) - Saturday, 11 December 2010, 22:46 GMT
* Delete a lot of debug code
* Reintroduce macros for code that won't be moved into the .S file
Comment by MichaelGiacomelli (saratoga) - Monday, 13 December 2010, 00:53 GMT
* Introduce ASM code for the 4 macro functions that won't be included in the .S file.
Comment by MichaelGiacomelli (saratoga) - Tuesday, 14 December 2010, 03:56 GMT
*Clean up most of the debug and dead code
*Finish reordering the body of for loop

Pretty much all thats left is actually converting the core each loop to ASM.
Comment by MichaelGiacomelli (saratoga) - Sunday, 19 December 2010, 22:24 GMT
* Rearranged arrays in memory to consolidate pointers and save 2 registers
* Wrote the first half of the first sb_sample function in assembly
Comment by MichaelGiacomelli (saratoga) - Friday, 08 April 2011, 02:11 GMT
Added a simple test file to try debugging the asm code. Not sure why it currently crashes on decode, probably a dumb mistake somewhere.