Rockbox

Tasklist

FS#11666 - codeclib arm hand-tuned assembly in fft-ffmpeg doesn't work on Android/-PIC builds

Attached to Project: Rockbox
Opened by Dave Hooper (stripwax) - Monday, 11 October 2010, 23:53 GMT
Last edited by Dave Hooper (stripwax) - Friday, 12 November 2010, 10:19 GMT
Task Type Patches
Category Codecs
Status Closed
Assigned To No-one
Operating System Another
Severity Low
Priority Normal
Reported Version Release 3.6
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

See mailing list thread entitled "Segfault with FasterMDCT patch and -fPIC" for more details.

Background:
The hand-rolled asm in fft-ffmpeg_arm.h uses a combination of preprocessor macros and individual named register selection, in order to enable most use of stm/ldm (requiring register ordering constraints) without writing all of the asm manually - in theory enabling gcc to still make register allocations, local optimisations, etc.
However, it doesn't seem to work very well. In particular, using -fPIC in a regular ARM environment means that gcc needs to use r10 for its own needs, and the current code doesn't seem to enable gcc to realise that I am using r10 for my needs. As a result, gcc doesn't preserve r10 when it ought to, resulting in data aborts.

This patch serves three purposes-
1. Remove a number of the preprocessor macros in fft-ffmpeg_arm.h , in favour of static inlines, hopefully resulting in increased readability
2. Propagate address increments outside the scope of the local function (by simply returning the updated address pointer) for a slight reduction in stacking/unstacking and pointer arith (observed thru disassembly)
3. Reduce number of manually-selected specific register allocations. In particular, the register used for "t1" in the TRANSFORM_xxx macros does not need to be manually allocated if we make some small rearrangements. This also means the code no longer needs to specifically reference r10, which hopefully means gcc will track the PIC register correctly now.
This task depends upon

Closed by  Dave Hooper (stripwax)
Friday, 12 November 2010, 10:19 GMT
Reason for closing:  Accepted
Additional comments about closing:  submittied in r28262
Comment by Dave Hooper (stripwax) - Tuesday, 12 October 2010, 07:01 GMT
sim seems to work fine still, so I think just need confirmation that the -fPIC problem has been resolved
Comment by Andree Buschmann (Buschel) - Tuesday, 12 October 2010, 08:00 GMT
is there any change in decoding speed?
Comment by Dave Hooper (stripwax) - Tuesday, 12 October 2010, 12:48 GMT
I did not test rigorously (i.e. didn't run test-codec) but didn't observe any obvious boost% differences from "debug buffering thread". Ideally these changes should actually increase decoding speed (fractionally) as I'm hard-coding fewer manual register allocations.

There actually do appear to be some additional ('loosely-documented'...) constraints for use with gcc inline asm - some of which sound fully appropriate to the class of stm/ldm problems. I hadn't realised this before and it doesn't seem to be universal knowledge.
http://hardwarebug.org/2010/07/06/arm-inline-asm-secrets/
If these work, it might be worth playing with "=M"-type constraints to simplify the code somewhat and (ideally) fully remove the manual register allocation. However, the info in the above link seems to disagree with the info in the 'standard' docs (below). If anybody has any insight/first-hand experience on this, would be great to know..
http://gcc.gnu.org/onlinedocs/gcc-4.5.0/gcc/Machine-Constraints.html#Machine-Constraints
Comment by Dave Hooper (stripwax) - Tuesday, 12 October 2010, 12:52 GMT
(actually, rereading that hardwarebug link, seems that the way to use this isn't "=M" in the input/output/clobbers, but rather %M0, %M1 etc in the asm itself. "undocumented feature" indeed...)
Comment by Dave Hooper (stripwax) - Tuesday, 12 October 2010, 22:45 GMT
Well the above patch didn't seem to work successfully for me on an alternative arm platform (NSLU2, if you were wondering).

Updated patch attached. Also test_codec results suggest this is faster by about 2% compared to July data on the wiki. (I didn't actually run a comparison of svn vs patched but will do that shortly)
Comment by Dave Hooper (stripwax) - Tuesday, 12 October 2010, 22:59 GMT
I suspect the wiki data is stale. http://www.rockbox.org/wiki/CodecPerformanceComparison#Portal_Player_ARM7TDMI
Without above patch, I get almost identical test_codec results as the file attached earlier.
Comment by Dave Hooper (stripwax) - Tuesday, 12 October 2010, 23:33 GMT
commited .. but i'll keep the patch tracker open to discuss any issues
Comment by Thomas Martitz (kugel.) - Sunday, 17 October 2010, 17:27 GMT
I think this can be closed, the bug is fixed.

Loading...