FS#11702 : mpc filterbank synthesis optimization

Quick Actions

View Dependency Graph

Status Assigned
Percent Complete
0%
Task Type Patches
Category Codecs
Assigned To

Buschel
Operating System All players
Severity Low
Priority Very Low
Reported Version Release 3.6
Due in Version Undecided
Due Date Undecided
Votes
Private

Attached to Project: Rockbox
Opened by Buschel - 2010-10-25

FS#11702 - mpc filterbank synthesis optimization

This patch resorts the v-array within mpc’s synthesis filter. Through this data is placed more locally for the performance critical function mpc_decoder_windowing_D() and allows ldm-usage in the asm’ed parts.

This first patch does work for simulation but not for ARM or CF builds. An update with more optimized ARM asm will follow, CF asm needs to be corrected by someone with knowledge on CF assembly.

The output is binary identical.

mpc_synth_resort_v05.patch (12.7 KiB)

Comments (20)
Related Tasks (0/0)

nls commented on 2010-10-26 12:20

here’s a patch that adapts the cf asm. checksum for the 128k test sample matches before and
speedup is ~0.9MHz.

mpc_cf_asm.diff (2.1 KiB)

Buschel commented on 2010-10-26 19:12

Some interesting (and somehow distracting) results after bringing the ARM asm back to work. Measured on PP5022 (iPod Video) and S5L870x (iPod nano 2G)

svn PP5022: 22.2 MHz S5L: 36.0 MHz (svn is using the ARM asm comparable to
PP5022: 22.3 MHz S5L: 41.9
PP5022: 23.1 MHz S5L: 39.8 MHz

Both patches also exchange the “global” memmove with a loop to reduce the moved data. This has no measurable effect on PP5022, but saved ~3.1 MHz on S5L.

mpc_synth_resort_v09a.patch (25.7 KiB)

mpc_synth_resort_v09b.patch (25.5 KiB)

nls commented on 2010-10-31 23:21

I tested some rearrangements of the asm code for arm and got some minor posiyive results on the beast but nothing close to reclaiming the speed loss of the patch so i think the issue is with the memory access changes
had another idea for armv6, the set of most significant word multiply instrs can probably be used there to get a little more speed and free up a register but i don’t want to code that before i know which version we will keep.

Buschel commented on 2010-11-01 09:41

I am in favour of keeping the svn implementation. We should keep the above patches as documentation in case we may work on this re-arranging again.

Buschel commented on 2010-11-01 21:39

This patch drops the changes from above and reduces stalls for >=arm9. Use two variants of synthesis filterbank: a) arm7, b) arm9 and above.

nano 2g: 36.0 MHz (svn), 34.6 MHz (patch v10)

GigaBeat (beast) was tested with a former patch version: 29.6 MHz (svn), 26.2 MHz (patched).

Test on any other CPU’s – arm9 and above – is welcome!

mpc_stall_v10.patch (9.3 KiB)

nls commented on 2010-11-02 06:38

why not test ARM_ARCH == 4 instead of defined(CPU_PP) || (CONFIG_CPU == PNX0101) ?

oh and good job! you clearly had more success than i with this :)

Buschel commented on 2010-11-02 07:07

why not test ARM_ARCH == 4 instead of defined(CPU_PP) || (CONFIG_CPU == PNX0101) ?

Because ARM_ARCH==4 is also valid for arm9 CPUs like S5L870x (iPod nano 2g). As the stall reduced path is faster for those as well, I needed to search for a way to only #ifdef the “ARM_ARCH==4 and arm7” CPUs. As there is no such #define available – at least to my knowledge – I simply check for the two CPU-types which are relevant.

rasher commented on 2010-11-02 12:14

I know nothing of ARM architectures and differences, but maybe it would be better to invent a new define, than to slightly abuse an already existing one? Checking for specific CPUs rather than CPU features is the kind of thing that will get forgotten when adding a new target, isn’t it?

nls commented on 2010-11-02 14:49

Sorry, hadn’t had enough coffee. There is a CPU_ARM7TDMI define though so checking #ifdef CPU_ARM7TDMI would
defined in config.h for PP pnx0101 and dsc25) (the latter one not used in any functional port)

Buschel commented on 2010-11-02 19:59

You right, it seems I was also low on coffee ;o)

mpc_stall_v11.patch (9.3 KiB)

Buschel commented on 2010-11-02 21:10

Committed v11 and plus some additional changes to iram configuration.

Decoding mpc
36.0
34.6
in mpcdec.h: 33.5
in synth_filter.S: 33.4 MHz

nls commented on 2010-11-02 23:21

I made a quick test using the smmul instr and friends, that cut off another 1.2MHz and using the freed up regs in the ldmia’s saved another .5MHz so with that 24.5 MHz on the beast i haven’t yet checked how much tat changes the output nut it sounds fine to me.

Buschel commented on 2010-11-03 22:00

Could you measure the difference in output? If the results is kept in a single (32 bit) register, I would expect to loose about 4 bits of resolution – which is within the fract part and therefore fine. The difference in the 16 bit output should be max +/-1.

nls commented on 2010-11-04 12:27

I measured output difference and the diff is +/-1 on the file i checked (128k sample from test_files) using truncating smmul/smmla 28k samples differed (0.18%) using the rounding variants (as in this patch) only 3.6k samples differed
speed with the final patch is ~24.2MHz so a speedup of ~2MHz over svn on the
one small optimization from this patch chould be portable to the other arm versions, using indexed adressing in the second store in the loop, which saves one instr in the loop, also the very last instr add r1, r1, #4 can be
do that after i commit this, do you think it looks good?

nls commented on 2010-11-04 12:40

um, forgot to attach it

mpc_armv6.diff (8.9 KiB)

Buschel commented on 2010-11-09 22:21

Nils, I rolled back your 2nd commit (indexed adressing) because it slowed down decoding by 2% on my nano 2G (S5L8701 CPU). Quite strange though...

nls commented on 2010-11-10 12:29

I saw that but there was no need to revert the ARM7TDMI case too if this affected only the nano2g’s core. Also i tested this on my fuzev1 with an arm9tdmi core and there was nodifference there, also that last add insruction that just increases a pointer in a caller saved reg that is just subsequently discarded is still not needed.

The only reason i see for this to be slower is that the stores are in the reversed order which might affect the cache.

Buschel commented on 2010-11-10 14:30

Ok, I will re-measure the arm7tdmi change on my iPod Video. Maybe this was a too-fast shot... I will come back to you on irc.

Buschel commented on 2010-11-10 20:33

The attached patch does 2 things: 1) use alignment of 16 bytes for the major
re-activate your commit

With the alignment set to 16 bytes, there is not much difference in the performance anymore when comparing r28545 against r28544 of the synthesis filter. Now the difference is ~0.05 MHz, before it was ~0.8 MHz. In total mpc with this patch is even a bit faster (33.40 MHz) than svn (r28545: 33.45 MHz).

mpc_align_v02.patch (6.5 KiB)

Buschel commented on 2010-11-10 21:03

With the alignment setting there are further optimizations possible which slowed down decoding on S5L8701 before. This v03-version speeds up both S5L8701 and PP5022 by another 0.1 MHz.

Submitted with r28561/r28562/r28563.

mpc_align_v03.patch (8.9 KiB)

	Tasks related to this task (0)

Loading...

Keyboard shortcuts

Available keyboard shortcuts

Alt + ⇧ Shift + l Login Dialog / Logout
Alt + ⇧ Shift + a Add new task
Alt + ⇧ Shift + m My searches
Alt + ⇧ Shift + t focus taskid search

Tasklist

o open selected task
j move cursor down
k move cursor up

Task Details

n Next task
p Previous task
Alt + ⇧ Shift + e ↵ Enter Edit this task
Alt + ⇧ Shift + w watch task
Alt + ⇧ Shift + y Close Task

Task Editing

Alt + ⇧ Shift + s save task