Rockbox

Tasklist

FS#11498 - Speed optimization to libwmapro

Attached to Project: Rockbox
Opened by Andree Buschmann (Buschel) - Monday, 26 July 2010, 06:26 GMT
Last edited by MohamedTarek (mtarek16) - Thursday, 05 August 2010, 18:56 GMT
Task Type Patches
Category Codecs
Status Closed
Assigned To No-one
Operating System All players
Severity Low
Priority Normal
Reported Version Release 3.6
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

This flyspray contains speed optimizations to libwmapro.

v01 introduces asm routines for multiplications. Furthermore it adds fixmul31 and fixmul16 as faster routines can be used for those.

Decoding speed for wmapro_173k.wma on PP5022: 43.1 MHz (svn: 49.6 MHz)

Test on Coldfire is needed.
This task depends upon

Closed by  MohamedTarek (mtarek16)
Thursday, 05 August 2010, 18:56 GMT
Reason for closing:  Accepted
Additional comments about closing:  Committed in r27703.
Comment by Nils Wallménius (nls) - Monday, 26 July 2010, 11:36 GMT
For coldfire it probably makes sense to create two variants of vector_fixmul_scalar for the two different shifts used (16 and 24) since the fixmulshift function is pretty slow with 2 branches (othoh those branches could be eliminated here anyway since the shift is known to be less than 31).

edit: the coldifre fixmul32 function in libwma/wmafixed.h is doing the same as the fixmul16 in this patch but is 2 cycles faster and uses one register less
Comment by Andree Buschmann (Buschel) - Monday, 26 July 2010, 12:11 GMT
I did not see there are only used two variant of vector_fixmul_scalar(). You are right, we should definately define a 24 and a 16 bitshift variant of either vector_fixmul_scalar().

I will change Coldfire's fixmul16() to your proposed implementation. Is this also valid for other codecs implementations (e.g. atrac, mpc)? Or does this faster implementation use any knowledge about the codec's fixed point representation (e.g. 14 bits fract part)...
Comment by Nils Wallménius (nls) - Monday, 26 July 2010, 14:50 GMT
it does what the one in this patch does, takes the lower 16 bits of the high half of the 64bit result and make them the high 16 bits of the 32 bit final result and the high 16 bits of the lower half and make them the lower 16 bits of the final 32 bit result or (int32_t)((int64_t)x*y>>16), but you are right that it will not work for shifts other than 16.
Comment by Andree Buschmann (Buschel) - Monday, 26 July 2010, 18:21 GMT
Define fixmul24(), define two specific variants of vector_fixmul_scalar() (vector_fixmul_scalar_16 and vector_fixmul_scalar_24). Speed is same as measured with v01.
Comment by Andree Buschmann (Buschel) - Monday, 26 July 2010, 18:46 GMT
Define IRAM macros for different CPU types. Use IRAM for <WMAProDecodeCtx.tmp> and as many <WMAProChannelCtx.out> as fit to IRAM of the specific model.

Decoding speed for wmapro_173k.wma on PP5022: 37.5 MHz (svn: 49.6 MHz)
Comment by Andree Buschmann (Buschel) - Monday, 26 July 2010, 19:56 GMT
Differ between
a) models with large IRAM -> put <WMAProDecodeCtx.tmp> to IRAM.
b) models with normal IRAM -> cannot put <WMAProDecodeCtx.tmp> to IRAM, but move several window tables to IRAM as second best option.

Edit: Histogram of window length for all rockbox wmapro samples is:
len 128 = 1600 calls
len 256 = 370 calls
len 512 = 370 calls
len 1024 = 960 calls
len 2048 = 5700 calls
len 4096 = 0 calls

Decoding speed for wmapro_173k.wma on PP5022: 37.3 MHz (svn: 49.6 MHz)

Edit 2:
mcf5249 wmapro_173k.wma; svn: 179.40MHz 69.22% realtime; wmapro_v04.patch: 117.63MHz 105.57% realtime
Comment by MohamedTarek (mtarek16) - Monday, 26 July 2010, 21:18 GMT
This patch produces non interleaved output from wma pro decoder. Works directly on the buffers in wmaprodec.c instead of copying the data.
Comment by Andree Buschmann (Buschel) - Monday, 26 July 2010, 21:37 GMT
Minor change. We do not need vector_fixmul_scalar_16(), use vector_fixmul_scalar_24() instead. vector_fixmul_scalar_16() was only used to scale with sqrt(2), we can use higher precision for this rare use case.
Comment by Andree Buschmann (Buschel) - Monday, 26 July 2010, 21:43 GMT
Submitted wmapro_v05 with r27582.

Edit:
Submitted non-interleaving patch with r27583.
Comment by Andree Buschmann (Buschel) - Wednesday, 28 July 2010, 06:05 GMT
Add some comments and unroll loop in vector_fixmul_scalar(). Next step is to add some arm asm for vector_fixmul_scalar() and maybe vector_fixmul_window().

Decoding speed for wmapro_173k.wma on PP5022: 35.6 MHz (svn r27589: 35.9 MHz)
Comment by Andree Buschmann (Buschel) - Wednesday, 28 July 2010, 18:19 GMT
v06 submitted with r27595.
Comment by Andree Buschmann (Buschel) - Monday, 02 August 2010, 05:57 GMT
v07: Approach to use larger fract part for mdct and windowing. This preserves higher precision. The larger fract part is introduced via using fixmul16 instead of fixmul24 in the vector_fixmul_scalar() function. The asm for Coldfire has not been changed yet, especially Coldfire will speedup a bit through this change as well.

ToDo: Somebody to change the Coldfire asm and some cleanup of this change.
Comment by Nils Wallménius (nls) - Monday, 02 August 2010, 09:06 GMT
Doing this with the current coldfire fixmul16 asm is pretty trivial but it think there's a better way: switching the emac to integer mode because we can then get the result with only one multiplication instead of the 2 used in the fixmul16 function, otoh, if this multiplication overflows 48 bits that will not work very well i think, do you know if this can overflow into the top 16 bits of the 64 bit result?
Comment by Andree Buschmann (Buschel) - Monday, 02 August 2010, 10:05 GMT
If would overflow into the upper 16 bits the result after fixmul16 would overflow anyway. The deeper I think about this change the less convinced I am it will not internally overflow...

1) Requantized values were seen to be <600 * (1<<shift), shift's worst case is for 128-size window -> shift = 17 - (8-3) = 12. => 600*(1<<12) = ~(1<<22)
2) quant (the scaling) was seen to go up to idx 106 => 10^(156/20) = ~(1<<26)
3) This results in (1<<22)*(1<<26) = (1<<48)

quite tight....
Comment by Andree Buschmann (Buschel) - Monday, 02 August 2010, 17:59 GMT
Some cleanup. Still the fixmul16-implementation must be merged to the Coldfire asm of vector_fixmul_scalar().
Comment by MohamedTarek (mtarek16) - Wednesday, 04 August 2010, 23:27 GMT
v08 was committed in r27703. Thanks !
Task left open till coldfire vector_fixmul_scalar is done.
Comment by MohamedTarek (mtarek16) - Thursday, 05 August 2010, 05:52 GMT
here are my takes at the CF asm of fixmul_scalar : http://www.pastie.org/1076754
Comment by Nils Wallménius (nls) - Thursday, 05 August 2010, 10:57 GMT
Hi, i tested a version similar to your version 1 and also another approach switching the emac to unsigned integer mode and using that to get the lower half of the result but both turned out to be slower than the c loop using the fixmul16 macro so i think we can leave this as it is and close this task.
edit: my idea above about using the extension word of the emac to the the full result with one multiplication can not work.
Comment by MohamedTarek (mtarek16) - Thursday, 05 August 2010, 13:00 GMT
Is version 2 correct then ? If so, it would be faster than the c loop.
Comment by Nils Wallménius (nls) - Thursday, 05 August 2010, 13:20 GMT
no version 2 is not correct, we cannot get away with doing only one multiply because the multiplier yields only a 40 bit result, i was confused and thought it gave a 48 bit result but that is not the case
Comment by MohamedTarek (mtarek16) - Thursday, 05 August 2010, 14:06 GMT
just one thing iiuc, you're talking about integer mode, version 2 is intended to be in fractional mode.
Comment by MohamedTarek (mtarek16) - Thursday, 05 August 2010, 18:43 GMT
It's wrong .. totally wrong. I'll just remove the coldfire part.

Loading...