Rockbox

  • Status Closed
  • Percent Complete
    100%
  • Task Type Patches
  • Category Codecs
  • Assigned To No-one
  • Operating System All players
  • Severity Low
  • Priority Very Low
  • Reported Version Daily build (which?)
  • Due in Version Undecided
  • Due Date Undecided
  • Votes
  • Private
Attached to Project: Rockbox
Opened by Buschel - 2009-08-27
Last edited by Buschel - 2009-09-27

FS#10565 - atrac3 performance optimization (mostly ARM)

The atrac3 decoding speed of svn implementation (r22519) is very slow. It needs about 165MHz on PP502x (138MHz are used for the iqmf filterbank) for realtime decoding.

This patch introduces some optimizations:
1) ARM asm for fixmul16, fixmul31 and fixmul32 → realtime decoding @98MHz
2) minor loop unrolling in dewindowing → realtime decoding @97MHz
3) ARM asm for whole matrixing → realtime decoding @92MHz
The patched iqmf synthesis is more than twice as fast as the svn version.

Next step is asm’ing of the dewindowing with usage of multiply-add instructions.

Closed by  Buschel
2009-09-27 19:13
Reason for closing:  Accepted
Additional comments about closing:   Warning: Undefined array key "typography" in /home/rockbox/flyspray/plugins/dokuwiki/inc/parserutils.php on line 371 Warning: Undefined array key "camelcase" in /home/rockbox/flyspray/plugins/dokuwiki/inc/parserutils.php on line 407

r22561

Hi Buschel. One minor point, libmad is under the GPL, while libatrac is LGPL, so if possible could you take the fixed mul functions from one of the LGPL or BSD codecs (Vorbis, Cook, etc) instead? You've already rewritten them so it should be pretty trivial I think to remove the couple lines still from libmad and of course the comments.

This version contains asm'ed dewindowing and some minor cleanups. Realtime decoding on PP502x is possible @78,5MHz now.

Some measurements:
- matrixing needs ~1MHz → not of interest for further optimization right now
- dewindowing needs ~54,5 MHz → further optimization needed
- all other parts need ~23MHz → further optimization needed

Some more ideas and questions:
1. It is possible to speed up the dewindowing by ~8MHz through simply reducing the precision of the window coefficients from 31 Bit to 16 Bit (quick-n-dirty → v02b). Question: Will this have effect on the output? If so, what is the lowest precision without loss of output precision?
2. Same is valid for the windowing in the imdct. Additionally the symmetry within the window shall be used (win[i] = win[511-i]) and there are no multiplies needed for i=128…383 because of win[i] = 1. After changing this in the C-code asm'ing this section will save some more MHz. Question: Why is the window defined with inverted sign?

So, there is still some potential of >10MHz left…

The Sony PDFs on their ATRAC DSPs claim 32x16 bit precision multiplies used for dewindowing, so I'm guessing they mean 16 bit window coefficients accumulated into a 32 bit buffer. Though I don't know how they implemented it. When I converted the coefficients, most were well conditioned (>24 bits non-zero) but a few were not. Its possible they used different precisions for the middle coefficients in order to control rounding error.

I did some further code walk throughs and tests and needed to change the patch a bit. This patch does not drop 1 bit precision in fixmul31. Reason for this is that it seems to me like the internal sample representation is s31.0. In this case samples will loose precision in each filter stage (like iqmf, windowing, overlap-add…) because there is no fract part. MT/saratoga: Is this correct or am I wrong? If so, this should be changed. After such change the av_clip16-stuff should also be removed and dword should be handed over to the dsp-routines which already have fast truncation methods.

Reason for this is that it seems to me like the internal sample representation is s31.0.

I scaled the iQMF window coefficients by 2^31 since the original range was -1<x<1. I would call this s0.31 not s31.0 since there is no integer part to the original numbers but I'm not too experienced with the terminology here. I don't think precision is a problem though. The coefficients are all well conditioned, I think intentionally, to make this easier to implement on <32 bit systems.

Also, looking more closely, they're actually -0.5<x<0.5 so they could be scaled by 2^32 to maximize accuracy.

Hi Michael, yes, the coefficients are s0.31 (btw, they are scaled by another «1 in the init-function). But the signal samples seem to use 16 bit without any fract part. The effect is that e.g. s1 += fixmul(x, y) will truncate the fract part of the mulitplication result before each add. Lot of precision is lost with this implementation. The asm'ed part will not loose any precision as the full 64 bit result (with its 31 bit fract part) is used during the multiply add. The truncation is only done with the final result. So, the asm'ed optimized version is faster and its result is more precise.
A simple proof of my assumption is that a simple change of operand order in the fixmul-parts speeds up the decoder by another 8 MHz (patch attached). This happens because the second operand influences the number of cycles needed for the multiplication (e.g. 4 non-zero bytes → 7 cycles, 3 non-zero bytes → 6 cycles, 2 non-zero bytes → 5 cycles, 0 non-zero bytes → 4 cycles). When using the window coefficients shifted by «31as second operand this results in the slowest implementation. When using «16 shifted coefs (like v02b patch did) the code becomes much faster. As the same happens when using the samples as second operand, this means the samples onyl use 16 bits out of 32. So, they do not seem to have any fract part.

Additional optimization in imdct windoing. Just create and use the relevant coefficients to save RAM and to avoid multiplication with 1 (respective -1).

Realtime decoding possible @69MHz now.

This version was submitted with r22548. I keep this flyspay entry open for further optimizations.

v06 changes the DSP configuration and the buffer handling towards dsp routines. Through this a 4KB buffer and the resorting/copying/clipping routines in the atrac decoder could be removed (now the highly optimized dsp routines are used for this). The decoder is 1MHz faster now.

As a side effect I needed to add a fract part to the internal sample representation within the atrac synthesis. Now the samples are in s15.2 format. My intention was to have this fract part at the end of the spectral synthesis and before imdct and iqmf. Another interesting effect: The channels switched. I am not sure whether they were correct or false before. Are there any test files known?

v07 adds usage of very large IRAM capabilities (PP5022/PP5024/MCF5250). Additional speed up by 2MHz on PP502x.

Can someone test the effect on a X5 or M5?

M5 results:

[20:48] <pixelma> speed test result of the fun_rm track with path: 19.31% realtime, 643.11 MHz needed
[20:48] <pixelma> patch too
[20:53] <pixelma> without patch: 18.24% realtime, 680.84 MHz

MPlayer has some atrac samples hosted here : http://samples.mplayerhq.hu/real/AC-atrc/

Small changes to quick-n-dirty fract part, add Coldfire ASM. Patch submitted with r22561.

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing