Rockbox

Tasklist

FS#11398 - libmad: changes to mad_bit_read()

Attached to Project: Rockbox
Opened by Andree Buschmann (Buschel) - Sunday, 13 June 2010, 10:01 GMT
Last edited by Andree Buschmann (Buschel) - Tuesday, 15 June 2010, 06:34 GMT
Task Type Patches
Category Codecs
Status Closed
Assigned To Andree Buschmann (Buschel)
Operating System All players
Severity Low
Priority Normal
Reported Version Release 3.4
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

When reviewing the libmad code segment mad_bit_read() it became obvious that on LITTLE_ENDIAN targets the function betoh32() is called in each call of mad_bit_read(). This is not needed as the address from which the data is loaded does not change on each call of mad_bit_read(). So, we can buffer betoh32'ed data.

Patch v01 adds two global variables. The first holds the address, the second holds the data. If the address did not change in the current call of mad_bit_read(), the old -- already betoh32'ed -- data is used. In case of the need to load the new address the global variables are set accordingly.

Measurements on a sample file showed that buffered data is used in 2/3 of all calls. So, on LITTLE_ENDIAN targets 2/3 of betoh32()-calls are eliminated, on Coldfire targets it might be of interest to measure speed as well. Reason is that the buffered variables are located in IRAM. A very similar approach in libmpc sped up Coldfire by several MHz.
Strange thing is that the effect on arm7tdmi is (slightly) negative. Without patch 37.0 MHz (single core), with patch 37.15 MHz (single core).

Patch v02 gets rid of global variables and adds both variables to the struct mad_bitptr. Libmad's code defined two different structs of this name! Therefore patch v02 needs to add dummy variables to the "other" struct as well to avoid overwriting of memory. I will check for a better solution which cleans up this mess...

Needed:
a) Tests on Coldfire targets!
b) Ideas why there is a slight slowdown on arm (which should be LITTLE_ENDIAN)?
c) Clean up of struct mess.

This task depends upon

Closed by  Andree Buschmann (Buschel)
Tuesday, 15 June 2010, 06:34 GMT
Reason for closing:  Rejected
Additional comments about closing:  Slower an arm and coldfire.
Comment by Andree Buschmann (Buschel) - Sunday, 13 June 2010, 10:49 GMT
There were several double dedfined macros and structs in mad.h. This patch cleans up mad.h in addition the the buffering change.
Comment by Andree Buschmann (Buschel) - Monday, 14 June 2010, 05:58 GMT
Submitted the clean up of mad.h with r26838. Added a detaioed comment and updated patch.
Comment by Nils Wallménius (nls) - Monday, 14 June 2010, 08:37 GMT
Hi, i'm not familiar with the mad code but wouldn't it make sense to byteswap that data in one go when it's loaded intsead of this?
Comment by Andree Buschmann (Buschel) - Monday, 14 June 2010, 09:05 GMT
nls, I was thinking of the same. Maybe this could be a future solution for LITTLE_ENDIAN targets. Nevertheless this change needs to be tested on Coldfire due to the DRAM-vs-IRAM accesses.
Comment by Marcin Bukat (MarcinBukat) - Monday, 14 June 2010, 11:44 GMT
tested on MPIO HD200 (mcf5249)
virgin svn: 463.19% realtime 26.81MHz needed for realtime
libmad_v04.patch: 458.22% realtime 27.10MHz needed for realtime
Comment by Andree Buschmann (Buschel) - Monday, 14 June 2010, 11:59 GMT
I'm puzzled :/
Comment by Nils Wallménius (nls) - Monday, 14 June 2010, 12:35 GMT
for coldfire it doesn't seem so surprising, iiuc your patch introduces an extra load from iram (bitptr->buffered_addr), a comparison and a conditional branch while replacing a load from dram with one from iram.
Comment by Andree Buschmann (Buschel) - Monday, 14 June 2010, 13:04 GMT
Ok. I expected at least some gain as DRAM access is sloooow on Coldfire... In mpc's bit fiddling there were 3 DRAM accesses per call instead of only 1. Maybe this made up the difference...
Comment by Andree Buschmann (Buschel) - Tuesday, 15 June 2010, 06:34 GMT
Just disassembled the resulting asm for betoh32() -- it is only 4 cycles. When taking into account the added code and compare it to the unpatched code the break-even for arm is reached when (betoh32() + ldr from DRAM) is ~10 cycles. If it takes more cycles, the patch will speed up, otherwise it will slow down.

I'll close this entry.

Loading...