This is the bug/patch tracker for Rockbox. Click here for more information.
Quick links: Bugs · Patches · Rockbox frontpage
FS#7166 - musepack decoder performance optimization
Attached to Project:
Rockbox
Opened by Andree Buschmann (Buschel) - Sunday, 13 May 2007, 21:15 GMT+2
Last edited by Thom Johansen (preglow) - Thursday, 30 August 2007, 13:41 GMT+2
Opened by Andree Buschmann (Buschel) - Sunday, 13 May 2007, 21:15 GMT+2
Last edited by Thom Johansen (preglow) - Thursday, 30 August 2007, 13:41 GMT+2
|
DetailsFinalized my "first shot" on musepack performance optimization. I compiled against current Trunk (13378) and tested on my iPOD-Video (30G). Tesfile was a .mpc-File with avg. 196kbps and 348 seconds duration and the "test-codec"-plugin.
Measurements: Trunk-Version decodes in 154s (2.26x realtime) Trunk-Version with -O1 decodes in 140s (2.49x realtime, +10%) Changes in code and their results on top of Trunk/-O1 (synth_filter.c, makefile): Apply 32bit-multiplies for calc_new_V() -> decodes in 131s (2.65x realtime, +17%) Apply 32bit-multiplies for windowing -> decodes in 128s (2.72x realtimem, +20%) Remark: The 32bit-multiplies are faster than the ARM-assembler for 64bit-multiplies on -O1 I also checked the decoder output (Thanks to linuxstb for the wav-write functionality) against the output of my WinAmp-Decoder -- I will do further testing against Trunk-version. The results show that there are differences with a maximum of +/- 30 (for 16Bit samples). In spectral view the additional noise is connected to the filterbank borders -- this was expected -- and has its maximum at roundabout -90dB. The signal itself it always roundabout 40-60dB above -- so it shall be masked anyway. What is contained is this patch: - enable the "test_codec"-feature for performance measurements and comparing decoder output - set compiler option from -O2 to -O1 for libmusepack - several changes in synth_filter.c: - small refacturation of synthesis-filter - minor code clean-up - add 32bit-multiply for fast-DCT-algorithm - add 32bit-multiply for windowing-algorithm - add two new #defines to configure usage of new multiplies (32bit or 64bit) ToDo: The switch to 32bit-multiplies was a first shot to get in touch with the possibilities we have on optimizations. The current implementation will do fixed defined shifts for dct- and window-coefficient-multiplies. For fewer loss of accuracy I propose to do further tweaking for multiplications of samples and coefficients, e.g. taking into account the value of the coefficient to do optimized shifts before multiplications. ToDo: Test with other targets. I adjusted the compile switch as well as the multiplies to optimal performance on my iPOD-Video. If there has gone anythin wrong with this patch, please be patient -- this is my first patch ;o) regards, Andree |
This task depends upon
Closed by Thom Johansen (preglow)
Thursday, 30 August 2007, 13:41 GMT+2
Reason for closing: Accepted
Additional comments about closing: Committed with small changes.
Thursday, 30 August 2007, 13:41 GMT+2
Reason for closing: Accepted
Additional comments about closing: Committed with small changes.
target normal build patched build
nano 170.64s, 230.75% realtime 140.54s, 280.17% realtime
h140 87.44s, 450.32% realtime 87.41s, 450.47% realtime
So from my limited testing your patch makes decoding on my nano faster without any negative effect on speed of my h140.
as such after a clean build it turns out that the real figures are as follows:
ipod nano
normal build 170.64s, 230.75% realtime
patched build 140.54s, 280.17% realtime
iriver h140
normal build h140 87.44s, 450.32% realtime
patched build 121.35s, 324.48% realtime
So your patch dose make decode faster on the ipod, but it also makes it slower on the iriver h140.
As I do not have other hardware than my iPOD can you try the following:
a) undefine USE_32BIT_MULTIPLY_FOR_D in synth_filter.c (apps/codecs/libmusepack) and still use -O1 in makefile
b) undefine USE_32BIT_MULTIPLY_FOR_D in synth_filter.c (apps/codecs/libmusepack) and use -O2 in makefile
c) still use USE_32BIT_MULTIPLY_FOR_D in synth_filter.c (apps/codecs/libmusepack) and use -O2 in makefile
As far as I understand from your measurements the coldfire-assembler for the windowing with D-coefficients
seems to scale quite different to the ARM-assembler.
Thanks in adcance :)
Different CPUs:
Regarding the results by Dwyloc and comments by preglow I added the following changes in configuration: When using Coldfire-CPU the 32bit-multiplies are disabled, Coldfire should now use all the assembler optimizations (and 64bit) as before. When using ARM or other CPU all multiplications are set to 32bit (ARM is definately faster with 32bit, for other CPUs I assume the same as there is no further assembler code within the mpc-decoder).
Performance on iPOD Video:
The decoder now performs with 2,96x realtime on my sample (last patch: 2,72x official version: 2,26x), which is +31% compared to official version.
Question:
Can anyone (maybe Dwyloc) check this patch with his Coldfire-target?
Can anyone check this patch with non-ARM and non-Coldfire-target?
What are the influences of the -O1-switch in makefile, is -O2 faster for non-ARM?
iriver h140
normal build decode time 87.47s, 450.16% realtime
patched build decode time 94.66s, 415.97% realtime
I assume the patch-result was measured with -O1 in makefile.
Can you please check with -O2 (/libmusepack/makefile)?
with -O2 I get decode time 92.52s, 425.59% realtime.
Shouldn't be too much difference at all.
synth_filter.c:
- defining V-coefficients instead of creating variables
- using 32bit-shift for V-coefficients in 64bit-mode to allow usage of MPC_MULTIPLY_FRACT-macro (optimized for Coldfire via assembler)
math.h:
- removed unused macros
makefile:
- use -O1 for iPOD only
- use -O2 for all other targets
This version should not decrease performance on Coldfire-targets anymore.
The performance-gap for coldfire will be further investigated.
The patch needs to be tested (performance measurments) on different platforms to be able to tweak the
compiler setting and the usage of 32x32- vs. 64x64-multiplies towards performance on each platform.
@preglow, dwyloc: Feel free to test and post the results. Thanks a lot for the time you've already spent.
patched v5 decode time 84.30, 467.09% realtime
64bit-standard: 64x64 = 64, then >>32, still 64bit result -> convert to 32bit, then <<5 (due to normalization)
Coldfire-assembler: 32x32 = 64, take upper 32bit -> 32bit-result -> <<5 (due to normalization)
I thought possible overflows would also occure with the standard implementation and not only with Coldfire-assembler.
enable line //#define MPC_MULTIPLY_V(sample, vcoef) MPC_MULTIPLY_EX(sample, vcoef, V_COEFFICIENT_EXPAND)
disable line #define MPC_MULTIPLY_V(sample, vcoef) ( (MPC_MULTIPLY_FRACT(sample, vcoef)) << (32-V_COEFFICIENT_EXPAND) )
This will use another ARM-macro for the multiply.
@preglow: at least v-coef will not reach 0x80000000, so there shall be no overflow at all :/ (*confused*)
iriver h140
patched v5 mod decode time 89.47s, 440.10% realtime
normal build decode time 87.45s, 450.26% realtime
Can you test the following:
use #define MPC_MULTIPLY_V(sample, vcoef) ( (MPC_MULTIPLY_FRACT(sample, vcoef)) )
This way we do not shift-left after multiply (result should have much lower volume then). If the result is still noisy, then the overflow occurs during the multiplication itself. In this case try the following additionally:
use #define _MAKE_V_FAC(value) (((MPC_SAMPLE_FORMAT_MULTIPLY)(value>>X))), whereas X is >=1
This way we can check whether reduction of sample-value will avoid the overflow.
The macro MPC_MULTIPLY_FRACT is extracted to Coldfire-assembler without any explicit type-conversion, the other multiply-macros for Coldfire use explicit type-conversion via function calls (e.g. MPC_MULTIPLY_EX). The only logical difference I've found is that my former patches used these macros with 64-bit integers for DCT-coefficents, the Trunk-version always uses 32bit-integer when using these macros. So, I assume that I also need to use 32-bit coefficients when using MPC_MULTIPLAY_FRACT.
The following patch was changed to take into account this issue. It would be great, if dwyloc and/or preglow could double-check this change (audible quality and performance).
Thank you in advance!
1. dec_trunk_vs_org.png -> this one shows trunk-decoded signal (pink) versus coding noise from encoder (blue)
2. dec_patch_vs_org.png -> the same for patch-decoded signal
3. dec_patch_vs_trunk.png -> this one shows the additional noise which is added through the 32bit-multiply compared to 64bit-multiply
As a first result you will see the additional noise has its maximum at low and very high frequencies. For the high frequencies this additional noise is by far below masking threshold in quiet -- so, not audible. For the low frequencies this noise is still below the simultaneous masking threshold (masking threshod = similar to blue graph in first two pictures).
1. MPC_MULTIPLY_FRACT(X,Y) -> int32 X, int32 Y -> int64 A = X*Y -> result = take upper 32bit of A
2. MPC_MULTIPLY_EX_NOTRUNCATE(X,Y,Z) -> int32 X, int32 Y, int32 Z -> int64 A = X*Y -> result = take 32 lower bits of (A>>Z)
3. MPC_MULTIPLY_NOTRUNCATE(X,Y) -> same as MPC_MULTIPLY_EX_NOTRUNCATE(X,Y,Z) but with Z = 14
I would like to check whether such assembler code would then be of comparable speed to the 32bit-multiplies I've added with in my patch. If so, we could still use the full precision of the decoder then.
So, any assembler is welcome :)
The MAD decoder we use for MP3 has those implemented for both ARM and Coldfire:
http://svn.rockbox.org/viewvc.cgi/trunk/apps/codecs/libmad/fixed.h?revision=9199&view=markup&pathrev=13533
Also, I suspect you'll find they're a lot faster. At least for WMA the c and asm fixed multiply the difference was huge.
New version's changes:
- bring in two defines (MPC_OPTIMIZATION_QUALITY and MPC_OPTIMIZATION_SPEED) int synth_filter.c, default for now is MPC_OPTIMIZATION_SPEED
- MPC_OPTIMIZATION_SPEED -> pure 32bit multiplication for synthesis filter (not for Coldfire), loss of accuracy as described before
- MPC_OPTIMIZATION_QUALITY -> 32bit multiplies with 64bit results (as trunk-version uses it)
- added ARM-assembler for MPC_MULTIPLY_FRACT, MPC_MULTIPLY_EX and MPC_MULTIPLY
Actual performance (on iPOD 5.5G) measured with test_codec:
- Trunk: 2.30x realtime
- MPC_OPTIMIZATION_QUALITY: 2.60x realtime (+13%)
- MPC_OPTIMIZATION_SPEED: 2.94x realtime (+28%)
The ARM-assembler brought some additional performance for full quality decoding.
ToDo's:
One or two ideas left for tweaking the macros, but not today...
After some code reviews, investigation and measurements I've dropped the idea of full 32bit-multiplies due to either severe accuracy issues (especially for high energy signals) or due to performance which compares to optimized ARM-assembler for iPOD.
So, this patch version dropped all of the 32bit-multiplication stuff and concentrates on assembler optimizations for ARM as well as for Coldfire. The last results showed a performance increase of +13% for iPOD and +3% for Coldfire.
I've also added some comments and cleaned up unused math-macros.
So, what's new:
- corrected the output scaling for non-ARM, non-Coldfire
- renamed constants for fast-dct algorithm (now better match to the real meaning)
- added comment how these dct-constants were calculated in detail
- recalculated these dct-constants (were quite inaccurate before)
- brought floating point decoder back to work (to test it: undef MPC_FIXED_POINT in math.h)
- make my changes work with floating point
- corrected major bug which is also contained in current trunk (distorted audio output with floating point)
- commented new ARM-assembler code
Request for test: Can anyboy with a ARM-based (not PP!) player test whether -O1 or -O2 is faster on such devices?
- makefile: Thanks to IRC a better way to use -O1 for all ARM-targets was found.
- math.h: small typo correction in comment
Performance on 5.5G iPOD -> 3.1x - 3.26x realtime (+41% vs. trunk, +20% compared to my former patch-version with full accuracy)
Accuracy -> output accuracy is comparable to and slightly better than real 15 bit output
Remarks:
- speed optimization is only available (and set by default) for ARM and non-Coldfire (so, in fact only for ARM :)
- speed optimization can be switched off via commenting OPTIMIZE_FOR_SPEED in synth_filter.c
- when speed optimization is switched off the full accuracy is available
- Removed TABs
So, a final test on Coldfire target would be a good idea.
How do other codecs handle this? Or is it maybe planned to adapt the dsp-source to cope with either float or int input?