FS#10565 - atrac3 performance optimization (mostly ARM)

Attached to Project: Rockbox
Opened by Andree Buschmann (Buschel) - Thursday, 27 August 2009, 22:35 GMT
Last edited by Andree Buschmann (Buschel) - Sunday, 27 September 2009, 19:13 GMT
Task Type Patches
Category Codecs
Status Closed
Assigned To No-one
Operating System All players
Severity Low
Priority Normal
Reported Version Daily build (which?)
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No


The atrac3 decoding speed of svn implementation (r22519) is very slow. It needs about 165MHz on PP502x (138MHz are used for the iqmf filterbank) for realtime decoding.

This patch introduces some optimizations:
1) ARM asm for fixmul16, fixmul31 and fixmul32 -> realtime decoding @98MHz
2) minor loop unrolling in dewindowing -> realtime decoding @97MHz
3) ARM asm for whole matrixing -> realtime decoding @92MHz
The patched iqmf synthesis is more than twice as fast as the svn version.

Next step is asm'ing of the dewindowing with usage of multiply-add instructions.
This task depends upon

Closed by  Andree Buschmann (Buschel)
Sunday, 27 September 2009, 19:13 GMT
Reason for closing:  Accepted
Additional comments about closing:  r22561
Comment by MichaelGiacomelli (saratoga) - Friday, 28 August 2009, 00:46 GMT
Hi Buschel. One minor point, libmad is under the GPL, while libatrac is LGPL, so if possible could you take the fixed mul functions from one of the LGPL or BSD codecs (Vorbis, Cook, etc) instead? You've already rewritten them so it should be pretty trivial I think to remove the couple lines still from libmad and of course the comments.

Comment by Andree Buschmann (Buschel) - Friday, 28 August 2009, 21:46 GMT
This version contains asm'ed dewindowing and some minor cleanups. Realtime decoding on PP502x is possible @78,5MHz now.

Some measurements:
- matrixing needs ~1MHz -> not of interest for further optimization right now
- dewindowing needs ~54,5 MHz -> further optimization needed
- all other parts need ~23MHz -> further optimization needed

Some more ideas and questions:
1. It is possible to speed up the dewindowing by ~8MHz through simply reducing the precision of the window coefficients from 31 Bit to 16 Bit (quick-n-dirty -> v02b). Question: Will this have effect on the output? If so, what is the lowest precision without loss of output precision?
2. Same is valid for the windowing in the imdct. Additionally the symmetry within the window shall be used (win[i] = win[511-i]) and there are no multiplies needed for i=128...383 because of win[i] = 1. After changing this in the C-code asm'ing this section will save some more MHz. Question: Why is the window defined with inverted sign?

So, there is still some potential of >10MHz left...
Comment by MichaelGiacomelli (saratoga) - Friday, 28 August 2009, 22:08 GMT
The Sony PDFs on their ATRAC DSPs claim 32x16 bit precision multiplies used for dewindowing, so I'm guessing they mean 16 bit window coefficients accumulated into a 32 bit buffer. Though I don't know how they implemented it. When I converted the coefficients, most were well conditioned (>24 bits non-zero) but a few were not. Its possible they used different precisions for the middle coefficients in order to control rounding error.
Comment by Andree Buschmann (Buschel) - Saturday, 29 August 2009, 13:44 GMT
I did some further code walk throughs and tests and needed to change the patch a bit. This patch does not drop 1 bit precision in fixmul31. Reason for this is that it seems to me like the internal sample representation is s31.0. In this case samples will loose precision in each filter stage (like iqmf, windowing, overlap-add...) because there is no fract part. MT/saratoga: Is this correct or am I wrong? If so, this should be changed. After such change the av_clip16-stuff should also be removed and dword should be handed over to the dsp-routines which already have fast truncation methods.

Comment by MichaelGiacomelli (saratoga) - Saturday, 29 August 2009, 15:01 GMT
>Reason for this is that it seems to me like the internal sample representation is s31.0.

I scaled the iQMF window coefficients by 2^31 since the original range was -1<x<1. I would call this s0.31 not s31.0 since there is no integer part to the original numbers but I'm not too experienced with the terminology here. I don't think precision is a problem though. The coefficients are all well conditioned, I think intentionally, to make this easier to implement on <32 bit systems.

Also, looking more closely, they're actually -0.5<x<0.5 so they could be scaled by 2^32 to maximize accuracy.
Comment by Andree Buschmann (Buschel) - Saturday, 29 August 2009, 17:44 GMT
Hi Michael, yes, the coefficients are s0.31 (btw, they are scaled by another <<1 in the init-function). But the signal samples seem to use 16 bit without any fract part. The effect is that e.g. s1 += fixmul(x, y) will truncate the fract part of the mulitplication result before each add. Lot of precision is lost with this implementation. The asm'ed part will not loose any precision as the full 64 bit result (with its 31 bit fract part) is used during the multiply add. The truncation is only done with the final result. So, the asm'ed optimized version is faster and its result is more precise.
A simple proof of my assumption is that a simple change of operand order in the fixmul-parts speeds up the decoder by another 8 MHz (patch attached). This happens because the second operand influences the number of cycles needed for the multiplication (e.g. 4 non-zero bytes -> 7 cycles, 3 non-zero bytes -> 6 cycles, 2 non-zero bytes -> 5 cycles, 0 non-zero bytes -> 4 cycles). When using the window coefficients shifted by <<31as second operand this results in the slowest implementation. When using <<16 shifted coefs (like v02b patch did) the code becomes much faster. As the same happens when using the samples as second operand, this means the samples onyl use 16 bits out of 32. So, they do not seem to have any fract part.
Comment by Andree Buschmann (Buschel) - Saturday, 29 August 2009, 19:32 GMT
Additional optimization in imdct windoing. Just create and use the relevant coefficients to save RAM and to avoid multiplication with 1 (respective -1).

Realtime decoding possible @69MHz now.

This version was submitted with r22548. I keep this flyspay entry open for further optimizations.
Comment by Andree Buschmann (Buschel) - Saturday, 29 August 2009, 23:12 GMT
v06 changes the DSP configuration and the buffer handling towards dsp routines. Through this a 4KB buffer and the resorting/copying/clipping routines in the atrac decoder could be removed (now the highly optimized dsp routines are used for this). The decoder is 1MHz faster now.

As a side effect I needed to add a fract part to the internal sample representation within the atrac synthesis. Now the samples are in s15.2 format. My intention was to have this fract part at the end of the spectral synthesis and before imdct and iqmf. Another interesting effect: The channels switched. I am not sure whether they were correct or false before. Are there any test files known?
Comment by Andree Buschmann (Buschel) - Sunday, 30 August 2009, 00:09 GMT
v07 adds usage of very large IRAM capabilities (PP5022/PP5024/MCF5250). Additional speed up by 2MHz on PP502x.

Can someone test the effect on a X5 or M5?
Comment by MichaelGiacomelli (saratoga) - Sunday, 30 August 2009, 01:00 GMT
M5 results:

[20:48] <pixelma> speed test result of the fun_rm track with path: 19.31% realtime, 643.11 MHz needed
[20:48] <pixelma> patch too
[20:53] <pixelma> without patch: 18.24% realtime, 680.84 MHz
Comment by MohamedTarek (mtarek16) - Sunday, 30 August 2009, 08:40 GMT
MPlayer has some atrac samples hosted here :
Comment by Andree Buschmann (Buschel) - Sunday, 30 August 2009, 14:18 GMT
Small changes to quick-n-dirty fract part, add Coldfire ASM. Patch submitted with r22561.