Faster MDCT Experiment
Work has started on writing a new MDCT for the codec library, that would hopefully be faster than the current one, originally ported from Tremor. The whole thing started out as an attempt to modify liba52 to use rockbox's MDCT rather than its own one. The main milestone for achieving a fast MDCT - for an FFT-based transform - is a fast FFT.
Current MDCT performance
The current rockbox MDCT is a heavily optimized version of the Tremor MDCT. On ARM it has been almost entirely rewritten in ASM. Unfortunately, its still not as fast as we would like. On fast transform codecs such as Vorbis and WMA, it makes up >50% of the entire decode time. For 192k WMA on an e200v1, it uses 16.79MHz out of a total of less then 31MHz for the entire decoder.
These are the developers currently working on this, of course anyone is welcome to add their name here and help with the development.
- Isolated ffmpeg's latest FFT and converted to fixed-point arithmetic.
- Factored out liba52's FFT.
- Various comparisons and plots have been done between both FFTs; liba52's FFT is currently about 28% faster than ffmpeg's, due to being aware of some special forms of input data. But since ffmpeg's FFT is all written out of macros, there is still room for optimizations there, and would probably be eventually faster than liba52's. Also, licensing-wise, ffmpeg's FFT is better since it is LGPL'd, while liba52's is GPL'd.
- ffmpeg's FFT was further tested by integrating it to the MDCT of an older revision of rockbox. The audio was distorted a little, and it was later discovered that the bit-reversal tables changed between ffmpeg's old and new FFTs.
- [12 Dec 2009] : The FFT/MDCT now work correctly without distortion. The problem was that bit-reversal was handled in the MDCT along with the pre-rotation. Now the reversal is handled in the FFT itself. The new MDCT has been branched into svn://svn.rockbox.org/rockbox/branches/mdctexp/
- [14 Dec 2009] : Note that in the branch (as of 14Dec09) no codecs yet make use of the new mdct. However with some further patching it ought to be 'straightforward' to drop in the fft-based mdct in place of the existing Tremor-based one. Very rough proof-of-concept for WMA and VORBIS at www.beermex.com/wma_and_vorbis_fft_mdct.patch (this is a patch on-top-of the mdctexp branch) . This appears to work on sim - I wouldn't assume yet that the patched version works ok on target . Even if it does I have not yet measured performance..
- Note - above wma_and_vorbis_fft_mdct.patch updated . Confirmed working on target (ipod 5g). Performance is around 10MHz slower than current build at the moment. To successfully build for target, I needed to eliminate some of the larger-sized fft tables (which are unused in current codecs anyway). I've also made it so that the fsincos setup only takes place once for each fft tablesize (so long as the codec remains loaded). The vorbis codec in particular is also crying out for some further optimisations: making ff_imdct_calc able to operate inplace rather than require separate input&output buffers and memcpying results around; copying some common tables into iram at runtime.
- [17 Dec 2009] - above patch merged into branch, along with accuracy improvements originally in r17783.
- [18 Dec 2009] - fft-ffmpeg twiddle factors table initialization code removed, and rewritten to use static fixed twiddle factor tables (in fact reusing same trig tables from Tremor). I've left this as a #define option with the original behaviour still there as, on target, the new behaviour seems oddly slightly slower, even though the new trig tables are in iram.
- [31 Jan 2010] - Several changes including hand-written asm for fft4 and fft8, and loop unrolling and some pre/post twiddle changes in mdct - vorbis performance seems on a par (maybe <1% slower) compared to trunk.
- [8 Feb 2010] - Changes committed to support operating the 'full' mdct (although not the 'half' mdct) in-place, with some modified final reflection code, means that the mdctexp-version of Vorbis codec no longer needs a memcpy back/forth to a temp buffer. This, combined with a handful of miscellaneous tweaks across fft and mdct, means vorbis is now faster than trunk (see test_codec output attachments). WMA also confirmed faster than trunk (even more so relative to vorbis performance). FasterMDCT_20100208_vorbis.txt FasterMDCT_20100208_wma.txt
- [12 Feb 2010] - Modified Cook and ATRAC3 codecs to use the new mdct library. Performance gain is ~1.5MHz. cook_FasterMDCTvsTrunk.txt atrac3_FasterMDCTvsTrunk.txt
- [12 Feb 2010] - or thereabouts - significant performance gains on ARM (~1Mhz) by hand-optimising most of the TRANSFORM and BUTTERFLIES code (used in pass() function)
- [17 Feb 2010] - mdctexp branch merged into trunk!
- [20 Feb 2010] - Doh! FasterMDCT wasn't running from icode on targets that support this. Enabling that gives an extra 0.5Mhz boost on ARM (pp ipod video) and 5MHz boost on coldfire (iriver h120)
- Get the MDCT working in Rockbox without distortion (DONE! decoded audio quality also confirmed via comparison with oggdec - there was a sign error which has now been fixed in r24739)
- Decide which tables should go into IRAM, and how to allocate them (different codecs will require different amounts of IRAM). In progress - trig tables currently using iram, bitrev not using iram.
- Remove any remaining dynamic initialization (on assumption that we should be able to use statically allocated fixed tables for everything). (DONE! mdct and fft use the same trig table (as the old tremor code) - bitrev table is now static for the largest fft size we care about)
- Maybe consider slightly-reduced (interpolated) trig for the largest block sizes (Tremor does this) (DONE!)
- Maybe consider reordering/transposing between FFT stages so that end result is a regular bit reverse as opposed to a complex addressing scheme (may be faster, and would eliminate need for existing bitrev/digitrev table .. but hard to do)
- Consider replacing the current 4 mul / 4 add complex multiply function with a 3 mul 3 add version and one additional lookup table, as other fast MDCTs seem to do this
- Determine if there are additional algebraic short cuts available in the transform itself
- ARM/Coldfire/MIPS optimization of the basic FFT functions (probably 4, 8 and 16 point FFTs will be enough) (significant amounts of ARM optimisation already done, but nothing for coldfire other than the cross-product blocks)
- Improve MDCT pre/post rotation code if possible (mostly done - unrolling and using iram trig tables has brought significant improvements - could still do more ldm/stm work on arm)
- Port codecs over to the ffmpeg "mdct_half" trick, where the symmetric parts of the mdct are not returned, and instead the windowing code duplicates the symmetric part later to save memory and multiplies (note - likely will not save multiplies for e.g. vorbis, due to shape of window function; but would save memory)
- ffmpeg has an AAC-HE decoder that uses an MDCT and an FFT to do SBR. We should look into using the new MDCT/FFT for libfaad.
libtremor upstream merge
We have prepared a working back port of our changes to libtremor current SVN:
svn co http://svn.xiph.org/trunk/Tremor/
./ivorbisfile_example < file.ogg > out.raw
sox --rate 44100 -c 2 --bits 16 -s out4raw out.wav
It has been cleaned up to match Tremor's style. However more testing (particularly on non-x86 linux or x86 Windows systems) is welcome.
Copyright © by the contributing authors.