FS#11365 - eabi: libmad experiments on arm

Attached to Project: Rockbox
Opened by Andree Buschmann (Buschel) - Sunday, 06 June 2010, 21:08 GMT
Last edited by Andree Buschmann (Buschel) - Thursday, 10 June 2010, 19:07 GMT
Task Type Patches
Category Codecs
Status Closed
Assigned To No-one
Operating System SW-codec
Severity Low
Priority Normal
Reported Version Release 3.4
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No


When using the new eabi toolchain it possible to disable the asm version of dct32 within libmad and to use the C version instead. The non-eabi compiled libmad hard-crashed when doing so.
From my experience with mpc the asm version of dct32 is _slower_ than the C version compiled with -O1. I have measured decoding speed for a specific mp3 file with eabi and different libmad configurations:

1) -O2 and asm-dct32 => 21.1 MHz (svn)
2) -O2 and C-dct32 => 24.84 MHz
3) -O1 and asm-dct32 => 21.1 MHz
4) -O1 and C-dct32 => 20.5 MHz

So, on arm (arm7tdmi) there is no reasonable difference in speed between -O1 and -O2 when using the asm'ed dct32. Combining C-dct32 with -O1 results in the fastest decoding.

The following patch can be used for evaluation on other targets as well. It is not taking care of different CPU types or arm architectures.
This task depends upon

Closed by  Andree Buschmann (Buschel)
Thursday, 10 June 2010, 19:07 GMT
Reason for closing:  Accepted
Additional comments about closing:  Submitted with r26746.
Comment by MichaelGiacomelli (saratoga) - Sunday, 06 June 2010, 21:32 GMT
You should disable running the synth filter on COP if you do these tests on PP, otherwise its hard to measure the true speed up due to waiting on the COP thread.
Comment by Andree Buschmann (Buschel) - Monday, 07 June 2010, 00:17 GMT
You are right, I will do so for clear results. But in the end the dualcore measurement is what counts. This is what happens in the real world, and what will effect the boost/unboost behaviour of the player. Or did I make a logical mistake?
Comment by MichaelGiacomelli (saratoga) - Monday, 07 June 2010, 00:25 GMT
Dual core is what counts for PP. But if you find a way to make the dct faster in general, it would be nice to have it, even if on PP the improvement is hidden by waiting on the synth filter for dual core.
Comment by Andree Buschmann (Buschel) - Monday, 07 June 2010, 04:44 GMT
Here are the results on arm7tdmi without COP:

1) -O2 and asm-dct32 => 37.9 MHz (svn)
2) -O2 and C-dct32 => 41.0 MHz
3) -O1 and asm-dct32 => 37.9 MHz
4) -O1 and C-dct32 => 37.0 MHz

So, savings are ~0.9 MHz in total. Quite a lot when taking into account the latest efforts and results.

When reading the original flyspray entry that introduced the asm'ed dct32 it was described that this solution is not optimized for cycles but for size (to better fit into cache I guess). Seems like current CPUs do not scale with size that much anymore.
Comment by MichaelGiacomelli (saratoga) - Monday, 07 June 2010, 21:17 GMT
Testing on AS3525v2:

34.94MHz stock for 192k.
34.11 MHz without ASM + w/ EABI

Also, without synth_full:


Will try some other tests when I get a chance.
Comment by Andree Buschmann (Buschel) - Thursday, 10 June 2010, 17:59 GMT
This patch version will use -O1 for ARM and -O2 for other CPUs. When using ARM it will also disable the dct32-asm implementation.

Interesting: On PP502x (arm7tdmi, iPod Video) this patch will create a heavy crash when not using the eabi toolchain ('*Panic* Stkov mp3dec (1)'). This crash vanishes when disabling the multicore option for this CPU (not defining MPA_SYNTH_ON_COP in codecs/mpa.c). Any ideas?
Comment by Andree Buschmann (Buschel) - Thursday, 10 June 2010, 18:45 GMT
Reason for the crash with non-eabi in combination with COP was too small stack size of the COP-thread. The following patch simply doubles the COP-thread's stack size. The speed up of mp3 decoding on arm7tdmi is 20.7 MHz now (svn: 21.1 MHz). On other (non multicore) arm targets the savings are in the area of ~0.9 MHz.