FS#11702 - mpc filterbank synthesis optimization
Attached to Project:
Rockbox
Opened by Andree Buschmann (Buschel) - Monday, 25 October 2010, 19:36 GMT
Opened by Andree Buschmann (Buschel) - Monday, 25 October 2010, 19:36 GMT
|
DetailsThis patch resorts the v-array within mpc's synthesis filter. Through this data is placed more locally for the performance critical function mpc_decoder_windowing_D() and allows ldm-usage in the asm'ed parts.
This first patch does work for simulation but not for ARM or CF builds. An update with more optimized ARM asm will follow, CF asm needs to be corrected by someone with knowledge on CF assembly. The output is binary identical. |
This task depends upon
Total speedup is ~0.9MHz.
svn PP5022: 22.2 MHz S5L: 36.0 MHz (svn is using the ARM asm comparable to 9b)
9a) PP5022: 22.3 MHz S5L: 41.9 MHz
9b) PP5022: 23.1 MHz S5L: 39.8 MHz
Both patches also exchange the "global" memmove with a loop to reduce the moved data. This has no measurable effect on PP5022, but saved ~3.1 MHz on S5L.
I had another idea for armv6, the set of most significant word multiply instrs can probably be used there to get a little more speed and free up a register but i don't want to code that before i know which version we will keep.
nano 2g: 36.0 MHz (svn), 34.6 MHz (patch v10)
GigaBeat (beast) was tested with a former patch version: 29.6 MHz (svn), 26.2 MHz (patched).
Test on any other CPU's -- arm9 and above -- is welcome!
oh and good job! you clearly had more success than i with this :)
Because ARM_ARCH==4 is also valid for arm9 CPUs like S5L870x (iPod nano 2g). As the stall reduced path is faster for those as well, I needed to search for a way to only #ifdef the "ARM_ARCH==4 and arm7" CPUs. As there is no such #define available -- at least to my knowledge -- I simply check for the two CPU-types which are relevant.
#ifdef CPU_ARM7TDMI would work
(it's defined in config.h for PP pnx0101 and dsc25) (the latter one not used in any functional port)
Decoding mpc 17kbps:
svn: 36.0 MHz
v11: 34.6 MHz
+icode in mpcdec.h: 33.5 MHz
+icode in synth_filter.S: 33.4 MHz
decoding speed with the final patch is ~24.2MHz so a speedup of ~2MHz over svn on the beast.
btw one small optimization from this patch chould be portable to the other arm versions, using indexed adressing in the second store in the loop, which saves one instr in the loop, also the very last instr add r1, r1, #4 can be deleted.
I'll do that after i commit this, do you think it looks good?
The only reason i see for this to be slower is that the stores are in the reversed order which might affect the cache.
1) use alignment of 16 bytes for the major arrays
2) re-activate your commit
With the alignment set to 16 bytes, there is not much difference in the performance anymore when comparing r28545 against r28544 of the synthesis filter. Now the difference is ~0.05 MHz, before it was ~0.8 MHz. In total mpc with this patch is even a bit faster (33.40 MHz) than svn (r28545: 33.45 MHz).
Submitted with r28561/r28562/r28563.