• Status Assigned
  • Percent Complete
  • Task Type Patches
  • Category Codecs
  • Assigned To
  • Operating System All players
  • Severity Low
  • Priority Very Low
  • Reported Version Release 3.6
  • Due in Version Undecided
  • Due Date Undecided
  • Votes
  • Private
Attached to Project: Rockbox
Opened by Buschel - 2010-10-25

FS#11702 - mpc filterbank synthesis optimization

This patch resorts the v-array within mpc’s synthesis filter. Through this data is placed more locally for the performance critical function mpc_decoder_windowing_D() and allows ldm-usage in the asm’ed parts.

This first patch does work for simulation but not for ARM or CF builds. An update with more optimized ARM asm will follow, CF asm needs to be corrected by someone with knowledge on CF assembly.

The output is binary identical.

nls commented on 2010-10-26 12:20

here’s a patch that adapts the cf asm. checksum for the 128k test sample matches before and
speedup is ~0.9MHz.

Some interesting (and somehow distracting) results after bringing the ARM asm back to work. Measured on PP5022 (iPod Video) and S5L870x (iPod nano 2G)

svn PP5022: 22.2 MHz S5L: 36.0 MHz (svn is using the ARM asm comparable to
PP5022: 22.3 MHz S5L: 41.9
PP5022: 23.1 MHz S5L: 39.8 MHz

Both patches also exchange the “global” memmove with a loop to reduce the moved data. This has no measurable effect on PP5022, but saved ~3.1 MHz on S5L.

nls commented on 2010-10-31 23:21

I tested some rearrangements of the asm code for arm and got some minor posiyive results on the beast but nothing close to reclaiming the speed loss of the patch so i think the issue is with the memory access changes
had another idea for armv6, the set of most significant word multiply instrs can probably be used there to get a little more speed and free up a register but i don’t want to code that before i know which version we will keep.

I am in favour of keeping the svn implementation. We should keep the above patches as documentation in case we may work on this re-arranging again.

This patch drops the changes from above and reduces stalls for >=arm9. Use two variants of synthesis filterbank: a) arm7, b) arm9 and above.

nano 2g: 36.0 MHz (svn), 34.6 MHz (patch v10)

GigaBeat (beast) was tested with a former patch version: 29.6 MHz (svn), 26.2 MHz (patched).

Test on any other CPU’s – arm9 and above – is welcome!

nls commented on 2010-11-02 06:38

why not test ARM_ARCH == 4 instead of defined(CPU_PP) || (CONFIG_CPU == PNX0101) ?

oh and good job! you clearly had more success than i with this :)

why not test ARM_ARCH == 4 instead of defined(CPU_PP) || (CONFIG_CPU == PNX0101) ?

Because ARM_ARCH==4 is also valid for arm9 CPUs like S5L870x (iPod nano 2g). As the stall reduced path is faster for those as well, I needed to search for a way to only #ifdef the “ARM_ARCH==4 and arm7” CPUs. As there is no such #define available – at least to my knowledge – I simply check for the two CPU-types which are relevant.

I know nothing of ARM architectures and differences, but maybe it would be better to invent a new define, than to slightly abuse an already existing one? Checking for specific CPUs rather than CPU features is the kind of thing that will get forgotten when adding a new target, isn’t it?

nls commented on 2010-11-02 14:49

Sorry, hadn’t had enough coffee. There is a CPU_ARM7TDMI define though so checking #ifdef CPU_ARM7TDMI would
defined in config.h for PP pnx0101 and dsc25) (the latter one not used in any functional port)

You right, it seems I was also low on coffee ;o)

Committed v11 and plus some additional changes to iram configuration.

Decoding mpc
in mpcdec.h: 33.5
in synth_filter.S: 33.4 MHz

nls commented on 2010-11-02 23:21

I made a quick test using the smmul instr and friends, that cut off another 1.2MHz and using the freed up regs in the ldmia’s saved another .5MHz so with that 24.5 MHz on the beast i haven’t yet checked how much tat changes the output nut it sounds fine to me.

Could you measure the difference in output? If the results is kept in a single (32 bit) register, I would expect to loose about 4 bits of resolution – which is within the fract part and therefore fine. The difference in the 16 bit output should be max +/-1.

nls commented on 2010-11-04 12:27

I measured output difference and the diff is +/-1 on the file i checked (128k sample from test_files) using truncating smmul/smmla 28k samples differed (0.18%) using the rounding variants (as in this patch) only 3.6k samples differed
speed with the final patch is ~24.2MHz so a speedup of ~2MHz over svn on the
one small optimization from this patch chould be portable to the other arm versions, using indexed adressing in the second store in the loop, which saves one instr in the loop, also the very last instr add r1, r1, #4 can be
do that after i commit this, do you think it looks good?

nls commented on 2010-11-04 12:40

um, forgot to attach it

Nils, I rolled back your 2nd commit (indexed adressing) because it slowed down decoding by 2% on my nano 2G (S5L8701 CPU). Quite strange though...

nls commented on 2010-11-10 12:29

I saw that but there was no need to revert the ARM7TDMI case too if this affected only the nano2g’s core. Also i tested this on my fuzev1 with an arm9tdmi core and there was nodifference there, also that last add insruction that just increases a pointer in a caller saved reg that is just subsequently discarded is still not needed.

The only reason i see for this to be slower is that the stores are in the reversed order which might affect the cache.

Ok, I will re-measure the arm7tdmi change on my iPod Video. Maybe this was a too-fast shot... I will come back to you on irc.

The attached patch does 2 things: 1) use alignment of 16 bytes for the major
re-activate your commit

With the alignment set to 16 bytes, there is not much difference in the performance anymore when comparing r28545 against r28544 of the synthesis filter. Now the difference is ~0.05 MHz, before it was ~0.8 MHz. In total mpc with this patch is even a bit faster (33.40 MHz) than svn (r28545: 33.45 MHz).

With the alignment setting there are further optimizations possible which slowed down decoding on S5L8701 before. This v03-version speeds up both S5L8701 and PP5022 by another 0.1 MHz.

Submitted with r28561/r28562/r28563.


Available keyboard shortcuts


Task Details

Task Editing