 Status Closed
 Percent Complete
 Task Type Patches
 Category Codecs
 Assigned To Noone
 Operating System All players
 Severity Low
 Priority Very Low
 Reported Version Daily build (which?)
 Due in Version Undecided

Due Date
Undecided
 Votes
 Private
FS#10565  atrac3 performance optimization (mostly ARM)
The atrac3 decoding speed of svn implementation (r22519) is very slow. It needs about 165MHz on PP502x (138MHz are used for the iqmf filterbank) for realtime decoding.
This patch introduces some optimizations:
1) ARM asm for fixmul16, fixmul31 and fixmul32 → realtime decoding @98MHz
2) minor loop unrolling in dewindowing → realtime decoding @97MHz
3) ARM asm for whole matrixing → realtime decoding @92MHz
The patched iqmf synthesis is more than twice as fast as the svn version.
Next step is asm’ing of the dewindowing with usage of multiplyadd instructions.
Closed by Buschel
20090927 19:13
Reason for closing: Accepted
Additional comments about closing:
20090927 19:13
Reason for closing: Accepted
Additional comments about closing:
r22561
Loading...
Available keyboard shortcuts
 Alt + ⇧ Shift + l Login Dialog / Logout
 Alt + ⇧ Shift + a Add new task
 Alt + ⇧ Shift + m My searches
 Alt + ⇧ Shift + t focus taskid search
Tasklist
 o open selected task
 j move cursor down
 k move cursor up
Task Details
 n Next task
 p Previous task
 Alt + ⇧ Shift + e ↵ Enter Edit this task
 Alt + ⇧ Shift + w watch task
 Alt + ⇧ Shift + y Close Task
Task Editing
 Alt + ⇧ Shift + s save task
Hi Buschel. One minor point, libmad is under the GPL, while libatrac is LGPL, so if possible could you take the fixed mul functions from one of the LGPL or BSD codecs (Vorbis, Cook, etc) instead? You've already rewritten them so it should be pretty trivial I think to remove the couple lines still from libmad and of course the comments.
This version contains asm'ed dewindowing and some minor cleanups. Realtime decoding on PP502x is possible @78,5MHz now.
Some measurements:
 matrixing needs ~1MHz → not of interest for further optimization right now
 dewindowing needs ~54,5 MHz → further optimization needed
 all other parts need ~23MHz → further optimization needed
Some more ideas and questions:
1. It is possible to speed up the dewindowing by ~8MHz through simply reducing the precision of the window coefficients from 31 Bit to 16 Bit (quickndirty → v02b). Question: Will this have effect on the output? If so, what is the lowest precision without loss of output precision?
2. Same is valid for the windowing in the imdct. Additionally the symmetry within the window shall be used (win[i] = win[511i]) and there are no multiplies needed for i=128…383 because of win[i] = 1. After changing this in the Ccode asm'ing this section will save some more MHz. Question: Why is the window defined with inverted sign?
So, there is still some potential of >10MHz left…
atrac_v02b.patch (13.7 KiB)
The Sony PDFs on their ATRAC DSPs claim 32x16 bit precision multiplies used for dewindowing, so I'm guessing they mean 16 bit window coefficients accumulated into a 32 bit buffer. Though I don't know how they implemented it. When I converted the coefficients, most were well conditioned (>24 bits nonzero) but a few were not. Its possible they used different precisions for the middle coefficients in order to control rounding error.
I did some further code walk throughs and tests and needed to change the patch a bit. This patch does not drop 1 bit precision in fixmul31. Reason for this is that it seems to me like the internal sample representation is s31.0. In this case samples will loose precision in each filter stage (like iqmf, windowing, overlapadd…) because there is no fract part. MT/saratoga: Is this correct or am I wrong? If so, this should be changed. After such change the av_clip16stuff should also be removed and dword should be handed over to the dsproutines which already have fast truncation methods.
I scaled the iQMF window coefficients by 2^31 since the original range was 1<x<1. I would call this s0.31 not s31.0 since there is no integer part to the original numbers but I'm not too experienced with the terminology here. I don't think precision is a problem though. The coefficients are all well conditioned, I think intentionally, to make this easier to implement on <32 bit systems.
Also, looking more closely, they're actually 0.5<x<0.5 so they could be scaled by 2^32 to maximize accuracy.
Hi Michael, yes, the coefficients are s0.31 (btw, they are scaled by another «1 in the initfunction). But the signal samples seem to use 16 bit without any fract part. The effect is that e.g. s1 += fixmul(x, y) will truncate the fract part of the mulitplication result before each add. Lot of precision is lost with this implementation. The asm'ed part will not loose any precision as the full 64 bit result (with its 31 bit fract part) is used during the multiply add. The truncation is only done with the final result. So, the asm'ed optimized version is faster and its result is more precise.
A simple proof of my assumption is that a simple change of operand order in the fixmulparts speeds up the decoder by another 8 MHz (patch attached). This happens because the second operand influences the number of cycles needed for the multiplication (e.g. 4 nonzero bytes → 7 cycles, 3 nonzero bytes → 6 cycles, 2 nonzero bytes → 5 cycles, 0 nonzero bytes → 4 cycles). When using the window coefficients shifted by «31as second operand this results in the slowest implementation. When using «16 shifted coefs (like v02b patch did) the code becomes much faster. As the same happens when using the samples as second operand, this means the samples onyl use 16 bits out of 32. So, they do not seem to have any fract part.
Additional optimization in imdct windoing. Just create and use the relevant coefficients to save RAM and to avoid multiplication with 1 (respective 1).
Realtime decoding possible @69MHz now.
This version was submitted with r22548. I keep this flyspay entry open for further optimizations.
v06 changes the DSP configuration and the buffer handling towards dsp routines. Through this a 4KB buffer and the resorting/copying/clipping routines in the atrac decoder could be removed (now the highly optimized dsp routines are used for this). The decoder is 1MHz faster now.
As a side effect I needed to add a fract part to the internal sample representation within the atrac synthesis. Now the samples are in s15.2 format. My intention was to have this fract part at the end of the spectral synthesis and before imdct and iqmf. Another interesting effect: The channels switched. I am not sure whether they were correct or false before. Are there any test files known?
v07 adds usage of very large IRAM capabilities (PP5022/PP5024/MCF5250). Additional speed up by 2MHz on PP502x.
Can someone test the effect on a X5 or M5?
M5 results:
[20:48] <pixelma> speed test result of the fun_rm track with path: 19.31% realtime, 643.11 MHz needed
[20:48] <pixelma> patch too
[20:53] <pixelma> without patch: 18.24% realtime, 680.84 MHz
MPlayer has some atrac samples hosted here : http://samples.mplayerhq.hu/real/ACatrc/
Small changes to quickndirty fract part, add Coldfire ASM. Patch submitted with r22561.