Rockbox

Tasklist

FS#11702 - mpc filterbank synthesis optimization

Attached to Project: Rockbox
Opened by Andree Buschmann (Buschel) - Monday, 25 October 2010, 19:36 GMT
Task Type Patches
Category Codecs
Status Assigned
Assigned To Andree Buschmann (Buschel)
Operating System All players
Severity Low
Priority Normal
Reported Version Release 3.6
Due in Version Undecided
Due Date Undecided
Percent Complete 0%
Votes 0
Private No

Details

This patch resorts the v-array within mpc's synthesis filter. Through this data is placed more locally for the performance critical function mpc_decoder_windowing_D() and allows ldm-usage in the asm'ed parts.

This first patch does work for simulation but not for ARM or CF builds. An update with more optimized ARM asm will follow, CF asm needs to be corrected by someone with knowledge on CF assembly.

The output is binary identical.
This task depends upon

Comment by Nils Wallménius (nls) - Tuesday, 26 October 2010, 12:20 GMT
here's a patch that adapts the cf asm. checksum for the 128k test sample matches before and after.
Total speedup is ~0.9MHz.
Comment by Andree Buschmann (Buschel) - Tuesday, 26 October 2010, 19:12 GMT
Some interesting (and somehow distracting) results after bringing the ARM asm back to work. Measured on PP5022 (iPod Video) and S5L870x (iPod nano 2G)

svn PP5022: 22.2 MHz S5L: 36.0 MHz (svn is using the ARM asm comparable to 9b)
9a) PP5022: 22.3 MHz S5L: 41.9 MHz
9b) PP5022: 23.1 MHz S5L: 39.8 MHz

Both patches also exchange the "global" memmove with a loop to reduce the moved data. This has no measurable effect on PP5022, but saved ~3.1 MHz on S5L.
Comment by Nils Wallménius (nls) - Sunday, 31 October 2010, 23:21 GMT
I tested some rearrangements of the asm code for arm and got some minor posiyive results on the beast but nothing close to reclaiming the speed loss of the patch so i think the issue is with the memory access changes themselves.
I had another idea for armv6, the set of most significant word multiply instrs can probably be used there to get a little more speed and free up a register but i don't want to code that before i know which version we will keep.
Comment by Andree Buschmann (Buschel) - Monday, 01 November 2010, 09:41 GMT
I am in favour of keeping the svn implementation. We should keep the above patches as documentation in case we may work on this re-arranging again.
Comment by Andree Buschmann (Buschel) - Monday, 01 November 2010, 21:39 GMT
This patch drops the changes from above and reduces stalls for >=arm9. Use two variants of synthesis filterbank: a) arm7, b) arm9 and above.

nano 2g: 36.0 MHz (svn), 34.6 MHz (patch v10)

GigaBeat (beast) was tested with a former patch version: 29.6 MHz (svn), 26.2 MHz (patched).

Test on any other CPU's -- arm9 and above -- is welcome!
Comment by Nils Wallménius (nls) - Tuesday, 02 November 2010, 06:38 GMT
why not test ARM_ARCH == 4 instead of defined(CPU_PP) || (CONFIG_CPU == PNX0101) ?

oh and good job! you clearly had more success than i with this :)
Comment by Andree Buschmann (Buschel) - Tuesday, 02 November 2010, 07:07 GMT
> why not test ARM_ARCH == 4 instead of defined(CPU_PP) || (CONFIG_CPU == PNX0101) ?

Because ARM_ARCH==4 is also valid for arm9 CPUs like S5L870x (iPod nano 2g). As the stall reduced path is faster for those as well, I needed to search for a way to only #ifdef the "ARM_ARCH==4 and arm7" CPUs. As there is no such #define available -- at least to my knowledge -- I simply check for the two CPU-types which are relevant.
Comment by Jonas Häggqvist (rasher) - Tuesday, 02 November 2010, 12:14 GMT
I know nothing of ARM architectures and differences, but maybe it would be better to invent a new define, than to slightly abuse an already existing one? Checking for specific CPUs rather than CPU features is the kind of thing that will get forgotten when adding a new target, isn't it?
Comment by Nils Wallménius (nls) - Tuesday, 02 November 2010, 14:49 GMT
Sorry, hadn't had enough coffee. There is a CPU_ARM7TDMI define though so checking
#ifdef CPU_ARM7TDMI would work
(it's defined in config.h for PP pnx0101 and dsc25) (the latter one not used in any functional port)
Comment by Andree Buschmann (Buschel) - Tuesday, 02 November 2010, 19:59 GMT
You right, it seems I was also low on coffee ;o)
Comment by Andree Buschmann (Buschel) - Tuesday, 02 November 2010, 21:10 GMT
Committed v11 and plus some additional changes to iram configuration.

Decoding mpc 17kbps:
svn: 36.0 MHz
v11: 34.6 MHz
+icode in mpcdec.h: 33.5 MHz
+icode in synth_filter.S: 33.4 MHz
Comment by Nils Wallménius (nls) - Tuesday, 02 November 2010, 23:21 GMT
I made a quick test using the smmul instr and friends, that cut off another 1.2MHz and using the freed up regs in the ldmia's saved another .5MHz so with that 24.5 MHz on the beast i haven't yet checked how much tat changes the output nut it sounds fine to me.
Comment by Andree Buschmann (Buschel) - Wednesday, 03 November 2010, 22:00 GMT
Could you measure the difference in output? If the results is kept in a single (32 bit) register, I would expect to loose about 4 bits of resolution -- which is within the fract part and therefore fine. The difference in the 16 bit output should be max +/-1.
Comment by Nils Wallménius (nls) - Thursday, 04 November 2010, 12:27 GMT
I measured output difference and the diff is +/-1 on the file i checked (128k sample from test_files) using truncating smmul/smmla 28k samples differed (0.18%) using the rounding variants (as in this patch) only 3.6k samples differed (0.02%).
decoding speed with the final patch is ~24.2MHz so a speedup of ~2MHz over svn on the beast.
btw one small optimization from this patch chould be portable to the other arm versions, using indexed adressing in the second store in the loop, which saves one instr in the loop, also the very last instr add r1, r1, #4 can be deleted.
I'll do that after i commit this, do you think it looks good?
Comment by Nils Wallménius (nls) - Thursday, 04 November 2010, 12:40 GMT
um, forgot to attach it
Comment by Andree Buschmann (Buschel) - Tuesday, 09 November 2010, 22:21 GMT
Nils, I rolled back your 2nd commit (indexed adressing) because it slowed down decoding by 2% on my nano 2G (S5L8701 CPU). Quite strange though...
Comment by Nils Wallménius (nls) - Wednesday, 10 November 2010, 12:29 GMT
I saw that but there was no need to revert the ARM7TDMI case too if this affected only the nano2g's core. Also i tested this on my fuzev1 with an arm9tdmi core and there was nodifference there, also that last add insruction that just increases a pointer in a caller saved reg that is just subsequently discarded is still not needed.

The only reason i see for this to be slower is that the stores are in the reversed order which might affect the cache.
Comment by Andree Buschmann (Buschel) - Wednesday, 10 November 2010, 14:30 GMT
Ok, I will re-measure the arm7tdmi change on my iPod Video. Maybe this was a too-fast shot... I will come back to you on irc.
Comment by Andree Buschmann (Buschel) - Wednesday, 10 November 2010, 20:33 GMT
The attached patch does 2 things:
1) use alignment of 16 bytes for the major arrays
2) re-activate your commit

With the alignment set to 16 bytes, there is not much difference in the performance anymore when comparing r28545 against r28544 of the synthesis filter. Now the difference is ~0.05 MHz, before it was ~0.8 MHz. In total mpc with this patch is even a bit faster (33.40 MHz) than svn (r28545: 33.45 MHz).
Comment by Andree Buschmann (Buschel) - Wednesday, 10 November 2010, 21:03 GMT
With the alignment setting there are further optimizations possible which slowed down decoding on S5L8701 before. This v03-version speeds up both S5L8701 and PP5022 by another 0.1 MHz.

Submitted with r28561/r28562/r28563.

Loading...