Rockbox

Tasklist

FS#7166 - musepack decoder performance optimization

Attached to Project: Rockbox
Opened by Andree Buschmann (Buschel) - Sunday, 13 May 2007, 19:15 GMT
Last edited by Thom Johansen (preglow) - Thursday, 30 August 2007, 11:41 GMT
Task Type Patches
Category Codecs
Status Closed
Assigned To No-one
Operating System iPod 5G
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 2
Private No

Details

Finalized my "first shot" on musepack performance optimization. I compiled against current Trunk (13378) and tested on my iPOD-Video (30G). Tesfile was a .mpc-File with avg. 196kbps and 348 seconds duration and the "test-codec"-plugin.

Measurements:
Trunk-Version decodes in 154s (2.26x realtime)
Trunk-Version with -O1 decodes in 140s (2.49x realtime, +10%)

Changes in code and their results on top of Trunk/-O1 (synth_filter.c, makefile):
Apply 32bit-multiplies for calc_new_V() -> decodes in 131s (2.65x realtime, +17%)
Apply 32bit-multiplies for windowing -> decodes in 128s (2.72x realtimem, +20%)
Remark: The 32bit-multiplies are faster than the ARM-assembler for 64bit-multiplies on -O1

I also checked the decoder output (Thanks to linuxstb for the wav-write functionality) against the output of my WinAmp-Decoder -- I will do further testing against Trunk-version.
The results show that there are differences with a maximum of +/- 30 (for 16Bit samples). In spectral view the additional noise is connected to the filterbank borders -- this was expected -- and has its maximum at roundabout -90dB. The signal itself it always roundabout 40-60dB above -- so it shall be masked anyway.

What is contained is this patch:
- enable the "test_codec"-feature for performance measurements and comparing decoder output
- set compiler option from -O2 to -O1 for libmusepack
- several changes in synth_filter.c:
- small refacturation of synthesis-filter
- minor code clean-up
- add 32bit-multiply for fast-DCT-algorithm
- add 32bit-multiply for windowing-algorithm
- add two new #defines to configure usage of new multiplies (32bit or 64bit)

ToDo: The switch to 32bit-multiplies was a first shot to get in touch with the possibilities we have on optimizations. The current implementation will do fixed defined shifts for dct- and window-coefficient-multiplies. For fewer loss of accuracy I propose to do further tweaking for multiplications of samples and coefficients, e.g. taking into account the value of the coefficient to do optimized shifts before multiplications.

ToDo: Test with other targets. I adjusted the compile switch as well as the multiplies to optimal performance on my iPOD-Video.

If there has gone anythin wrong with this patch, please be patient -- this is my first patch ;o)

regards,
Andree
This task depends upon

Closed by  Thom Johansen (preglow)
Thursday, 30 August 2007, 11:41 GMT
Reason for closing:  Accepted
Additional comments about closing:  Committed with small changes.
Comment by Douglas Valentine (Dwyloc) - Thursday, 17 May 2007, 23:16 GMT
After some quick testing on my ipod nano and iriver h140 (coldfire target) the results are as follows:

target normal build patched build
nano 170.64s, 230.75% realtime 140.54s, 280.17% realtime
h140 87.44s, 450.32% realtime 87.41s, 450.47% realtime

So from my limited testing your patch makes decoding on my nano faster without any negative effect on speed of my h140.
Comment by Douglas Valentine (Dwyloc) - Thursday, 17 May 2007, 23:49 GMT
Oops it looks like I failed to do a clean build for the iriver h140 as I typed "make claen; make; make zip" the previous time for the patched build
as such after a clean build it turns out that the real figures are as follows:

ipod nano
normal build 170.64s, 230.75% realtime
patched build 140.54s, 280.17% realtime

iriver h140
normal build h140 87.44s, 450.32% realtime
patched build 121.35s, 324.48% realtime

So your patch dose make decode faster on the ipod, but it also makes it slower on the iriver h140.
Comment by Andree Buschmann (Buschel) - Friday, 18 May 2007, 08:02 GMT
Thank you for the testing -- interesting results.
As I do not have other hardware than my iPOD can you try the following:

a) undefine USE_32BIT_MULTIPLY_FOR_D in synth_filter.c (apps/codecs/libmusepack) and still use -O1 in makefile
b) undefine USE_32BIT_MULTIPLY_FOR_D in synth_filter.c (apps/codecs/libmusepack) and use -O2 in makefile
c) still use USE_32BIT_MULTIPLY_FOR_D in synth_filter.c (apps/codecs/libmusepack) and use -O2 in makefile

As far as I understand from your measurements the coldfire-assembler for the windowing with D-coefficients
seems to scale quite different to the ARM-assembler.

Thanks in adcance :)
Comment by Thom Johansen (preglow) - Friday, 18 May 2007, 08:59 GMT
The assembler code using 64 bit multiplies on Coldfire is very fast, and while it can be changed to do 32 bit multiplies too, I don't think it'll have any effect on performance. Replacing it with C code is always going to be a lot slower, no matter what kind of multiplies you use. Using 32 bit multiplies instead of 64 bit multiplies will have a small positive effect on performance for Coldfire code where we have no assembler, though.
Comment by Andree Buschmann (Buschel) - Friday, 18 May 2007, 17:25 GMT
Hello again, here's the second version of the patch. I played around a lot with the 32x32-multiplies and had to learn there is not much difference in the ouput signal when tweaking the shifts for each coefficient. The output's accuracy is nearly the same when using fixed shifts for multiplies of samples with a set of coefficients (like V-/DCT-coefs and D-/Windowing-coefs). So, I optimized the shifts towards output accuracy and came around the settings which are included in this patch. Compared to the first patch the additional noise is overall lowered by ~2dB, the noise floor is lowered from -84dB to ~-96dB and there are only few harmonic distortions at the subband borders.

Different CPUs:
Regarding the results by Dwyloc and comments by preglow I added the following changes in configuration: When using Coldfire-CPU the 32bit-multiplies are disabled, Coldfire should now use all the assembler optimizations (and 64bit) as before. When using ARM or other CPU all multiplications are set to 32bit (ARM is definately faster with 32bit, for other CPUs I assume the same as there is no further assembler code within the mpc-decoder).

Performance on iPOD Video:
The decoder now performs with 2,96x realtime on my sample (last patch: 2,72x official version: 2,26x), which is +31% compared to official version.

Question:
Can anyone (maybe Dwyloc) check this patch with his Coldfire-target?
Can anyone check this patch with non-ARM and non-Coldfire-target?
What are the influences of the -O1-switch in makefile, is -O2 faster for non-ARM?
Comment by Douglas Valentine (Dwyloc) - Saturday, 19 May 2007, 15:39 GMT
musepack_optimized_v2.patch

iriver h140
normal build decode time 87.47s, 450.16% realtime
patched build decode time 94.66s, 415.97% realtime
Comment by Andree Buschmann (Buschel) - Saturday, 19 May 2007, 15:43 GMT
Ok, looks much better now.
I assume the patch-result was measured with -O1 in makefile.
Can you please check with -O2 (/libmusepack/makefile)?
Comment by Douglas Valentine (Dwyloc) - Saturday, 19 May 2007, 16:11 GMT
yes it was with -O1
with -O2 I get decode time 92.52s, 425.59% realtime.
Comment by Andree Buschmann (Buschel) - Saturday, 19 May 2007, 17:42 GMT
Hmm, gotta check the differences left from old to new code for Coldfire.
Shouldn't be too much difference at all.
Comment by Andree Buschmann (Buschel) - Sunday, 20 May 2007, 13:08 GMT
Ok, reworked some minor issues:

synth_filter.c:
- defining V-coefficients instead of creating variables
- using 32bit-shift for V-coefficients in 64bit-mode to allow usage of MPC_MULTIPLY_FRACT-macro (optimized for Coldfire via assembler)

math.h:
- removed unused macros

makefile:
- use -O1 for iPOD only
- use -O2 for all other targets

This version should not decrease performance on Coldfire-targets anymore.
Comment by Andree Buschmann (Buschel) - Monday, 21 May 2007, 21:05 GMT
Sorry, I messed up the Coldfire-optimization (could not test @home), there will occure heavy overflows and aliasing. An Update will follow...
Comment by Andree Buschmann (Buschel) - Monday, 21 May 2007, 21:34 GMT
Corrected the 32bit-overflow for coldfire. Current patch-version is about 3% slower than trunk-version for coldfire, and about 30% faster for iPOD.
The performance-gap for coldfire will be further investigated.
Comment by Andree Buschmann (Buschel) - Tuesday, 22 May 2007, 18:46 GMT
Correct usage of MPC_MULTIPLY_FRACT-macro for Coldfire assembler (no overflow after type-conversion).

The patch needs to be tested (performance measurments) on different platforms to be able to tweak the
compiler setting and the usage of 32x32- vs. 64x64-multiplies towards performance on each platform.

@preglow, dwyloc: Feel free to test and post the results. Thanks a lot for the time you've already spent.
Comment by Matt M (Chesteta) - Wednesday, 23 May 2007, 00:56 GMT
whenever trying to apply this patch in cygwin, the file is 'erased' and if i look at the patch file in a text editor, there is nothing there... (yes, the patch data was contained in the file earlier. I am compiling for an ipod nano if that helps. Thanks
Comment by Matt M (Chesteta) - Wednesday, 23 May 2007, 12:04 GMT
I updated my svn, and tried to patch again, it worked... Never mind my prior post. Thanks!
Comment by Douglas Valentine (Dwyloc) - Thursday, 24 May 2007, 10:18 GMT
iriver h140
patched v5 decode time 84.30, 467.09% realtime
Comment by Douglas Valentine (Dwyloc) - Thursday, 24 May 2007, 10:48 GMT
Sadly all I get is noise when I try to listen to my test track with your optimization on my h140, so I guess it still needs some work on coldfire targets.
Comment by Andree Buschmann (Buschel) - Friday, 25 May 2007, 08:04 GMT
The Coldfire-stuff really gives me a hard time :/ Can you explain the difference of the assembler coder for MPC_MULTIPLY_FRACT and the "standard" calculation? As far as i understood it is as follows:

64bit-standard: 64x64 = 64, then >>32, still 64bit result -> convert to 32bit, then <<5 (due to normalization)
Coldfire-assembler: 32x32 = 64, take upper 32bit -> 32bit-result -> <<5 (due to normalization)

I thought possible overflows would also occure with the standard implementation and not only with Coldfire-assembler.
Comment by Thom Johansen (preglow) - Friday, 25 May 2007, 08:23 GMT
Coldfire uses the emac unit for that, which does a saturating (32x32) << 1 operation. We shift this down one bit again since Musepack expects all the top 32 bits. The only case where the Coldfire code should overflow is if both multiplicands are 0x80000000, because of the additional right shift needed, but I don't think we'll see that case often. Unfortunately, I don't have time to help with this right now :/
Comment by Andree Buschmann (Buschel) - Friday, 25 May 2007, 10:44 GMT
@dwyloc: could you please try to make the following changes in synth_filter.c (patch v5):

enable line //#define MPC_MULTIPLY_V(sample, vcoef) MPC_MULTIPLY_EX(sample, vcoef, V_COEFFICIENT_EXPAND)
disable line #define MPC_MULTIPLY_V(sample, vcoef) ( (MPC_MULTIPLY_FRACT(sample, vcoef)) << (32-V_COEFFICIENT_EXPAND) )

This will use another ARM-macro for the multiply.

@preglow: at least v-coef will not reach 0x80000000, so there shall be no overflow at all :/ (*confused*)
Comment by Douglas Valentine (Dwyloc) - Friday, 25 May 2007, 12:38 GMT
That fixed the audio problem and speed results are as follows.

iriver h140
patched v5 mod decode time 89.47s, 440.10% realtime
normal build decode time 87.45s, 450.26% realtime

Comment by Andree Buschmann (Buschel) - Friday, 25 May 2007, 12:48 GMT
so, the overflow-bug is connected to usage of the MPC_MULTIPLY_FRACT-assembler code for Coldfire...

Can you test the following:
use #define MPC_MULTIPLY_V(sample, vcoef) ( (MPC_MULTIPLY_FRACT(sample, vcoef)) )

This way we do not shift-left after multiply (result should have much lower volume then). If the result is still noisy, then the overflow occurs during the multiplication itself. In this case try the following additionally:

use #define _MAKE_V_FAC(value) (((MPC_SAMPLE_FORMAT_MULTIPLY)(value>>X))), whereas X is >=1

This way we can check whether reduction of sample-value will avoid the overflow.
Comment by Andree Buschmann (Buschel) - Friday, 25 May 2007, 21:13 GMT
Ok, did another review and found the following issue which I cannot check without your help:
The macro MPC_MULTIPLY_FRACT is extracted to Coldfire-assembler without any explicit type-conversion, the other multiply-macros for Coldfire use explicit type-conversion via function calls (e.g. MPC_MULTIPLY_EX). The only logical difference I've found is that my former patches used these macros with 64-bit integers for DCT-coefficents, the Trunk-version always uses 32bit-integer when using these macros. So, I assume that I also need to use 32-bit coefficients when using MPC_MULTIPLAY_FRACT.
The following patch was changed to take into account this issue. It would be great, if dwyloc and/or preglow could double-check this change (audible quality and performance).

Thank you in advance!
Comment by Andree Buschmann (Buschel) - Saturday, 26 May 2007, 13:27 GMT
To get an impression of the effect when using 32bit-multiplies for the ARM-based portables, I've attached some spectral views.
1. dec_trunk_vs_org.png -> this one shows trunk-decoded signal (pink) versus coding noise from encoder (blue)
2. dec_patch_vs_org.png -> the same for patch-decoded signal
3. dec_patch_vs_trunk.png -> this one shows the additional noise which is added through the 32bit-multiply compared to 64bit-multiply

As a first result you will see the additional noise has its maximum at low and very high frequencies. For the high frequencies this additional noise is by far below masking threshold in quiet -- so, not audible. For the low frequencies this noise is still below the simultaneous masking threshold (masking threshod = similar to blue graph in first two pictures).
Comment by Andree Buschmann (Buschel) - Friday, 01 June 2007, 20:22 GMT
If anyone is capable of writing ARM-assembler code I would be very interested in three macros which are also available for Coldfire-assembler. These macros are (in order of effect on performance):
1. MPC_MULTIPLY_FRACT(X,Y) -> int32 X, int32 Y -> int64 A = X*Y -> result = take upper 32bit of A
2. MPC_MULTIPLY_EX_NOTRUNCATE(X,Y,Z) -> int32 X, int32 Y, int32 Z -> int64 A = X*Y -> result = take 32 lower bits of (A>>Z)
3. MPC_MULTIPLY_NOTRUNCATE(X,Y) -> same as MPC_MULTIPLY_EX_NOTRUNCATE(X,Y,Z) but with Z = 14

I would like to check whether such assembler code would then be of comparable speed to the 32bit-multiplies I've added with in my patch. If so, we could still use the full precision of the decoder then.
So, any assembler is welcome :)
Comment by MichaelGiacomelli (saratoga) - Sunday, 03 June 2007, 05:38 GMT
Andree:

The MAD decoder we use for MP3 has those implemented for both ARM and Coldfire:

http://svn.rockbox.org/viewvc.cgi/trunk/apps/codecs/libmad/fixed.h?revision=9199&view=markup&pathrev=13533

Also, I suspect you'll find they're a lot faster. At least for WMA the c and asm fixed multiply the difference was huge.
Comment by Andree Buschmann (Buschel) - Sunday, 03 June 2007, 22:22 GMT
Saratoga: Thanks a lot, this helped to get some ARM-assembler working for libmusepack :)

New version's changes:
- bring in two defines (MPC_OPTIMIZATION_QUALITY and MPC_OPTIMIZATION_SPEED) int synth_filter.c, default for now is MPC_OPTIMIZATION_SPEED
- MPC_OPTIMIZATION_SPEED -> pure 32bit multiplication for synthesis filter (not for Coldfire), loss of accuracy as described before
- MPC_OPTIMIZATION_QUALITY -> 32bit multiplies with 64bit results (as trunk-version uses it)
- added ARM-assembler for MPC_MULTIPLY_FRACT, MPC_MULTIPLY_EX and MPC_MULTIPLY

Actual performance (on iPOD 5.5G) measured with test_codec:
- Trunk: 2.30x realtime
- MPC_OPTIMIZATION_QUALITY: 2.60x realtime (+13%)
- MPC_OPTIMIZATION_SPEED: 2.94x realtime (+28%)

The ARM-assembler brought some additional performance for full quality decoding.

ToDo's:
One or two ideas left for tweaking the macros, but not today...
Comment by Andree Buschmann (Buschel) - Tuesday, 05 June 2007, 19:53 GMT
Hi all, this will (with most probability) be my last commit for this patch.

After some code reviews, investigation and measurements I've dropped the idea of full 32bit-multiplies due to either severe accuracy issues (especially for high energy signals) or due to performance which compares to optimized ARM-assembler for iPOD.

So, this patch version dropped all of the 32bit-multiplication stuff and concentrates on assembler optimizations for ARM as well as for Coldfire. The last results showed a performance increase of +13% for iPOD and +3% for Coldfire.

I've also added some comments and cleaned up unused math-macros.
Comment by Andree Buschmann (Buschel) - Tuesday, 05 June 2007, 20:07 GMT
Additional remark: There is no loss of precision or accuracy in the output compared to trunk.
Comment by Andree Buschmann (Buschel) - Thursday, 07 June 2007, 17:50 GMT
Hi again, this thing still is too interesting for me :)

So, what's new:
- corrected the output scaling for non-ARM, non-Coldfire
- renamed constants for fast-dct algorithm (now better match to the real meaning)
- added comment how these dct-constants were calculated in detail
- recalculated these dct-constants (were quite inaccurate before)
Comment by Andree Buschmann (Buschel) - Friday, 08 June 2007, 17:15 GMT
New update:

- brought floating point decoder back to work (to test it: undef MPC_FIXED_POINT in math.h)
- make my changes work with floating point
- corrected major bug which is also contained in current trunk (distorted audio output with floating point)
- commented new ARM-assembler code

Request for test: Can anyboy with a ARM-based (not PP!) player test whether -O1 or -O2 is faster on such devices?

Comment by Andree Buschmann (Buschel) - Friday, 08 June 2007, 19:50 GMT
Small thing:

- makefile: Thanks to IRC a better way to use -O1 for all ARM-targets was found.
- math.h: small typo correction in comment
Comment by Thom Johansen (preglow) - Thursday, 14 June 2007, 09:34 GMT
This looks all fine and dandy now, but what's with all the moving around with code (especially in math.h)? Nothing bad in itself, but we'll always have the problem of syncing to libmpcdec, so I'd like it if the patch changed as little as possible (unless strictly necessary, of course). Then again, parts of this patch might be very applicable to libmpcdec itself, so we might want to ask the Musepack people to include it too.
Comment by Andree Buschmann (Buschel) - Thursday, 14 June 2007, 10:55 GMT
The (big) changes come from inserted indentions for all the #ifdef's for better code readability. Math.h's content was changed regarding removal of unused macros and adding of new ARM-assembler.
Comment by Andree Buschmann (Buschel) - Monday, 09 July 2007, 18:17 GMT
New update: Again I came along the 32=32x32 multiplications. The second attempt takes a lot more care about correct rounding, uses ARM-assembler code for multiply-add-operations, gives more accurate results and is even faster than my first experiments.

Performance on 5.5G iPOD -> 3.1x - 3.26x realtime (+41% vs. trunk, +20% compared to my former patch-version with full accuracy)
Accuracy -> output accuracy is comparable to and slightly better than real 15 bit output

Remarks:
- speed optimization is only available (and set by default) for ARM and non-Coldfire (so, in fact only for ARM :)
- speed optimization can be switched off via commenting OPTIMIZE_FOR_SPEED in synth_filter.c
- when speed optimization is switched off the full accuracy is available
Comment by Andree Buschmann (Buschel) - Saturday, 11 August 2007, 17:39 GMT
- Sync'ed against current SVN (#14284)
- Removed TABs
Comment by MichaelGiacomelli (saratoga) - Monday, 27 August 2007, 23:02 GMT
Andree: Is this patch completed? When you are comfortable that it is working properly (on ARM and Coldfire!) I would like to commit it.
Comment by Andree Buschmann (Buschel) - Tuesday, 28 August 2007, 06:08 GMT
On ARM and PC-Sim it is working very well (using this patch for weeks now), the last test on Coldfire that I know were made with patch_v11.
So, a final test on Coldfire target would be a good idea.
Comment by Thom Johansen (preglow) - Tuesday, 28 August 2007, 09:35 GMT
I don't see the need for the MPC_OUTPUT_FORMAT define. There's no reason to use a separate data type for output than for internal processing. If we want floating point output, it kind of follows that floating point is efficient enough that we'll also want to use it internally in the codec for computations. I'll do a Coldfire test today.
Comment by Andree Buschmann (Buschel) - Tuesday, 28 August 2007, 13:10 GMT
The MPC-OUTPUT_FORMAT is only used for the hand-over of the float-result to the integer-based ADC/DSP. This is the only way right now to get the decoder-to-ADC chain work with either float or int. Can be tested via PC-Sim.
How do other codecs handle this? Or is it maybe planned to adapt the dsp-source to cope with either float or int input?
Comment by Thom Johansen (preglow) - Tuesday, 28 August 2007, 14:38 GMT
The reason I asked you to make sure it works in floating point mode as well was to make sure we'd have no trouble in the future if we ever decide to use a floating point DSP chain (gigabeat S, for example, has a good FPU). Right now, we don't expect any codecs to be able to work in floating point mode, but we do want them to be able to do so easily when/if the time comes, and when it does, we'll want codecs to output in floating point as well as use it internally. So I can see no reason to use a different format internally than for the output buffer. Besides, the fewer modifications we have as compared to Musepack SVN, the better :)
Comment by Andree Buschmann (Buschel) - Tuesday, 28 August 2007, 15:26 GMT
Understood. At least we should then throw a warning during compilation when using float as internal computing type. Because in this case the output will be something like near-white noise with a little correlation to the audio signal :)
Comment by Thom Johansen (preglow) - Tuesday, 28 August 2007, 20:58 GMT
Works fine on Coldfire, even boosting performance a little. I don't think we'll need a warning during compilation, no other codecs currently do this and we haven't been confused by the lack of it so far ;) I will commit this soon if I can't find anything worth mentioning.

Loading...