• Status Closed
  • Percent Complete
  • Task Type Patches
  • Category Codecs
  • Assigned To No-one
  • Operating System All players
  • Severity Low
  • Priority Very Low
  • Reported Version Release 3.6
  • Due in Version Undecided
  • Due Date Undecided
  • Votes
  • Private
Attached to Project: Rockbox
Opened by Buschel - 2010-07-26
Last edited by mtarek16 - 2010-08-05

FS#11498 - Speed optimization to libwmapro

This flyspray contains speed optimizations to libwmapro.

v01 introduces asm routines for multiplications. Furthermore it adds fixmul31 and fixmul16 as faster routines can be used for those.

Decoding speed for wmapro_173k.wma on PP5022: 43.1 MHz (svn: 49.6 MHz)

Test on Coldfire is needed.

Closed by  mtarek16
2010-08-05 18:56
Reason for closing:  Accepted
Additional comments about closing:   Warning: Undefined array key "typography" in /home/rockbox/flyspray/plugins/dokuwiki/inc/parserutils.php on line 371 Warning: Undefined array key "camelcase" in /home/rockbox/flyspray/plugins/dokuwiki/inc/parserutils.php on line 407

Committed in r27703.

nls commented on 2010-07-26 11:36

For coldfire it probably makes sense to create two variants of vector_fixmul_scalar for the two different shifts used (16 and 24) since the fixmulshift function is pretty slow with 2 branches (othoh those branches could be eliminated here anyway since the shift is known to be less than 31).

edit: the coldifre fixmul32 function in libwma/wmafixed.h is doing the same as the fixmul16 in this patch but is 2 cycles faster and uses one register less

I did not see there are only used two variant of vector_fixmul_scalar(). You are right, we should definately define a 24 and a 16 bitshift variant of either vector_fixmul_scalar().

I will change Coldfire's fixmul16() to your proposed implementation. Is this also valid for other codecs implementations (e.g. atrac, mpc)? Or does this faster implementation use any knowledge about the codec's fixed point representation (e.g. 14 bits fract part)…

nls commented on 2010-07-26 14:50

it does what the one in this patch does, takes the lower 16 bits of the high half of the 64bit result and make them the high 16 bits of the 32 bit final result and the high 16 bits of the lower half and make them the lower 16 bits of the final 32 bit result or (int32_t)((int64_t)x*y»16), but you are right that it will not work for shifts other than 16.

Define fixmul24(), define two specific variants of vector_fixmul_scalar() (vector_fixmul_scalar_16 and vector_fixmul_scalar_24). Speed is same as measured with v01.

Define IRAM macros for different CPU types. Use IRAM for <WMAProDecodeCtx.tmp> and as many <WMAProChannelCtx.out> as fit to IRAM of the specific model.

Decoding speed for wmapro_173k.wma on PP5022: 37.5 MHz (svn: 49.6 MHz)

Differ between
a) models with large IRAM → put <WMAProDecodeCtx.tmp> to IRAM.
b) models with normal IRAM → cannot put <WMAProDecodeCtx.tmp> to IRAM, but move several window tables to IRAM as second best option.

Edit: Histogram of window length for all rockbox wmapro samples is:
len 128 = 1600 calls
len 256 = 370 calls
len 512 = 370 calls
len 1024 = 960 calls
len 2048 = 5700 calls
len 4096 = 0 calls

Decoding speed for wmapro_173k.wma on PP5022: 37.3 MHz (svn: 49.6 MHz)

Edit 2:
mcf5249 wmapro_173k.wma; svn: 179.40MHz 69.22% realtime; wmapro_v04.patch: 117.63MHz 105.57% realtime

This patch produces non interleaved output from wma pro decoder. Works directly on the buffers in wmaprodec.c instead of copying the data.

Minor change. We do not need vector_fixmul_scalar_16(), use vector_fixmul_scalar_24() instead. vector_fixmul_scalar_16() was only used to scale with sqrt(2), we can use higher precision for this rare use case.

Submitted wmapro_v05 with r27582.

Submitted non-interleaving patch with r27583.

Add some comments and unroll loop in vector_fixmul_scalar(). Next step is to add some arm asm for vector_fixmul_scalar() and maybe vector_fixmul_window().

Decoding speed for wmapro_173k.wma on PP5022: 35.6 MHz (svn r27589: 35.9 MHz)

v06 submitted with r27595.

v07: Approach to use larger fract part for mdct and windowing. This preserves higher precision. The larger fract part is introduced via using fixmul16 instead of fixmul24 in the vector_fixmul_scalar() function. The asm for Coldfire has not been changed yet, especially Coldfire will speedup a bit through this change as well.

ToDo: Somebody to change the Coldfire asm and some cleanup of this change.

nls commented on 2010-08-02 09:06

Doing this with the current coldfire fixmul16 asm is pretty trivial but it think there's a better way: switching the emac to integer mode because we can then get the result with only one multiplication instead of the 2 used in the fixmul16 function, otoh, if this multiplication overflows 48 bits that will not work very well i think, do you know if this can overflow into the top 16 bits of the 64 bit result?

If would overflow into the upper 16 bits the result after fixmul16 would overflow anyway. The deeper I think about this change the less convinced I am it will not internally overflow…

1) Requantized values were seen to be <600 * (1«shift), shift's worst case is for 128-size window → shift = 17 - (8-3) = 12. ⇒ 600*(1«12) = ~(1«22)
2) quant (the scaling) was seen to go up to idx 106 ⇒ 10^(156/20) = ~(1«26)
3) This results in (1«22)*(1«26) = (1«48)

quite tight….

Some cleanup. Still the fixmul16-implementation must be merged to the Coldfire asm of vector_fixmul_scalar().

v08 was committed in r27703. Thanks !
Task left open till coldfire vector_fixmul_scalar is done.

here are my takes at the CF asm of fixmul_scalar :

nls commented on 2010-08-05 10:57

Hi, i tested a version similar to your version 1 and also another approach switching the emac to unsigned integer mode and using that to get the lower half of the result but both turned out to be slower than the c loop using the fixmul16 macro so i think we can leave this as it is and close this task.
edit: my idea above about using the extension word of the emac to the the full result with one multiplication can not work.

Is version 2 correct then ? If so, it would be faster than the c loop.

nls commented on 2010-08-05 13:20

no version 2 is not correct, we cannot get away with doing only one multiply because the multiplier yields only a 40 bit result, i was confused and thought it gave a 48 bit result but that is not the case

just one thing iiuc, you're talking about integer mode, version 2 is intended to be in fractional mode.

It's wrong .. totally wrong. I'll just remove the coldfire part.


Available keyboard shortcuts


Task Details

Task Editing