This is the bug/patch tracker for Rockbox. Click here for more information.
Quick links: Bugs · Patches · Rockbox frontpage
FS#10565 - atrac3 performance optimization (mostly ARM)
Attached to Project:
Rockbox
Opened by Andree Buschmann (Buschel) - Friday, 28 August 2009, 00:35 GMT+2
Last edited by Andree Buschmann (Buschel) - Sunday, 27 September 2009, 21:13 GMT+2
Opened by Andree Buschmann (Buschel) - Friday, 28 August 2009, 00:35 GMT+2
Last edited by Andree Buschmann (Buschel) - Sunday, 27 September 2009, 21:13 GMT+2
|
DetailsThe atrac3 decoding speed of svn implementation (r22519) is very slow. It needs about 165MHz on PP502x (138MHz are used for the iqmf filterbank) for realtime decoding.
This patch introduces some optimizations: 1) ARM asm for fixmul16, fixmul31 and fixmul32 -> realtime decoding @98MHz 2) minor loop unrolling in dewindowing -> realtime decoding @97MHz 3) ARM asm for whole matrixing -> realtime decoding @92MHz The patched iqmf synthesis is more than twice as fast as the svn version. Next step is asm'ing of the dewindowing with usage of multiply-add instructions. |
This task depends upon
Closed by Andree Buschmann (Buschel)
Sunday, 27 September 2009, 21:13 GMT+2
Reason for closing: Accepted
Additional comments about closing: r22561
Sunday, 27 September 2009, 21:13 GMT+2
Reason for closing: Accepted
Additional comments about closing: r22561
Some measurements:
- matrixing needs ~1MHz -> not of interest for further optimization right now
- dewindowing needs ~54,5 MHz -> further optimization needed
- all other parts need ~23MHz -> further optimization needed
Some more ideas and questions:
1. It is possible to speed up the dewindowing by ~8MHz through simply reducing the precision of the window coefficients from 31 Bit to 16 Bit (quick-n-dirty -> v02b). Question: Will this have effect on the output? If so, what is the lowest precision without loss of output precision?
2. Same is valid for the windowing in the imdct. Additionally the symmetry within the window shall be used (win[i] = win[511-i]) and there are no multiplies needed for i=128...383 because of win[i] = 1. After changing this in the C-code asm'ing this section will save some more MHz. Question: Why is the window defined with inverted sign?
So, there is still some potential of >10MHz left...
I scaled the iQMF window coefficients by 2^31 since the original range was -1<x<1. I would call this s0.31 not s31.0 since there is no integer part to the original numbers but I'm not too experienced with the terminology here. I don't think precision is a problem though. The coefficients are all well conditioned, I think intentionally, to make this easier to implement on <32 bit systems.
Also, looking more closely, they're actually -0.5<x<0.5 so they could be scaled by 2^32 to maximize accuracy.
A simple proof of my assumption is that a simple change of operand order in the fixmul-parts speeds up the decoder by another 8 MHz (patch attached). This happens because the second operand influences the number of cycles needed for the multiplication (e.g. 4 non-zero bytes -> 7 cycles, 3 non-zero bytes -> 6 cycles, 2 non-zero bytes -> 5 cycles, 0 non-zero bytes -> 4 cycles). When using the window coefficients shifted by <<31as second operand this results in the slowest implementation. When using <<16 shifted coefs (like v02b patch did) the code becomes much faster. As the same happens when using the samples as second operand, this means the samples onyl use 16 bits out of 32. So, they do not seem to have any fract part.
Realtime decoding possible @69MHz now.
This version was submitted with r22548. I keep this flyspay entry open for further optimizations.
As a side effect I needed to add a fract part to the internal sample representation within the atrac synthesis. Now the samples are in s15.2 format. My intention was to have this fract part at the end of the spectral synthesis and before imdct and iqmf. Another interesting effect: The channels switched. I am not sure whether they were correct or false before. Are there any test files known?
Can someone test the effect on a X5 or M5?
[20:48] <pixelma> speed test result of the fun_rm track with path: 19.31% realtime, 643.11 MHz needed
[20:48] <pixelma> patch too
[20:53] <pixelma> without patch: 18.24% realtime, 680.84 MHz