Rockbox

  • Status Closed
  • Percent Complete
    100%
  • Task Type Patches
  • Category Codecs
  • Assigned To No-one
  • Operating System Sansa e200
  • Severity Low
  • Priority Very Low
  • Reported Version Daily build (which?)
  • Due in Version Undecided
  • Due Date Undecided
  • Votes
  • Private
Attached to Project: Rockbox
Opened by Aoyumi - 2009-02-08
Last edited by saratoga - 2009-08-02

FS#9882 - Speed-up mod for large block (Vorbis)

This patch speeds up the decode of the OggVorbis file of mainly “q-1(/-2)[32kHz~48kHz]” equivalency encoded in libvorbis.

I tested this patch in Sansa e200, but probably even Sansa c200, iaudio M5/X5, iPod color/Photo/Mini 2nd gen effective.

Closed by  saratoga
2009-08-02 18:31
Reason for closing:  Accepted
Additional comments about closing:  

Accepted in r20783 and r20843 so I think its safe to close this now.

nls commented on 2009-02-08 11:15

It seems like attaching the actual patch file failed.

Sorry, I try the upload of the file again.

I like the idea of this patch but I don’t think we should be allocating IRAM for special cases that are rarely used (very low bitrates). I think it should be shared with other buffers and moved via memcpy at runtime if low bit rate files are played.

The new patch can use IRAM more flexibly.
Under the present conditions, it is used only in the window lookup table and a PCM buffer.

Converted in unix format

nls commented on 2009-03-17 20:49

hi, i made a test run with test_codec with this patch on my h300 and got some varying results.
some of my test files are decodign faster with your patch and some without… attatched are the test codec logs

Updated patch with following additions

1. Create pcm back buffer (used by block.c) in IRAM if possible also. if we can do this, we can then use a true double-buffer approach (flip between bufferA and bufferB when synthesising), meaning vect_add can operate iram-to-iram, and save a memcpy for each block. If we can’t do this, we use the old back buffer approach where we just memcpy the right-hand subframe of the current frame
2. replace call to vect_copy with call to memcpy instead, as they are surely interchangable and memcpy ought to be optimal
3. move vorbis_synthesis_blockin into ICODE so that vect_add operations (which are inline) operate in icode.
4. move vorbis_apply_window into ICODE so that vect_mult operations (which are also inline) operate in icode
5. On ARM, save one instruction for each register multiply in vect_mult_bw, by removing the left-shift (i.e. effectively use MULT32 instead of MULT31), and we then do the left-shift inline in vect_add_xxx. Implemented C code version (misc.h) also, although not yet for Coldfire (not sure if coldfire would have a benefit from that approach anyway)
6. Remove memset in vorbis_apply_window, since we already only add the parts of the frames that we know overlap, and the data outside that range is not used for lapping anyway

I would be very interested to know what improvement people observe on targets, especially non ARM targets. I see about 1.5MHz imrprovement over svn, with greatest improvement on lower bitrates.

Curious - on Coldfire (iriver H120) I see about 1.5MHz improvement on 96kbps file according to test_codec, but all other rates seem to run slower with these changes in fact. I can’t think why that should be, unless the allocation is borked and we end up with buffers in mem that are in iram on svn

I made quite a stupid mistake in synthesis.c - which probably caused the Coldfire problem - updated patch

And I just made another stupid mistake in oggmalloc. I think that’s the last of the silly mistakes but would appreciate a second pair of eyes on this.. did I accidentally move something into/outof icode?
Also seems my assertion #2 is flawed for coldfire - vect_copy indeed seems to run faster than memcpy (anyone know why that should be). Hacking it to use vect_copy again (only on coldfire; memcpy on all other targets) gives me the attached test_codec output. Still seems odd to me that coldfire should now run 2MHz slower on some bitrates than before the change, which is I guess the same thing nls observed earlier. Not sure if I’ve made it worse in fact.

did I accidentally move something into/outof icode

Perhaps its worth comparing the .map files to see how memory allocation differs with this patch.

Also seems my assertion #2 is flawed for coldfire - vect_copy indeed seems to run faster than memcpy (anyone know why that should be)

amiconn once told me that different code should be used for IRAM access then DRAM access in order to maximize write performance. I don’t really understand why, but it seems to make a pretty big difference for WMA:

http://svn.rockbox.org/viewvc.cgi?view=rev&revision=18228

This could be an issue where the memcpy isn’t optimal because of IRAM or DRAM use. You may want to ask amiconn about it.

Wonder if some of that could be down to a buggy alignment check in the (svn) asm_mcf5249.h - surely should be testing addr&15 not addr&16 . Trying that out first

Ah-hah…

Killed another typo - if prevwindow and thiswindow are both the same size, we were overlap/adding based on the large window size, rather than the actual window size. It meant previously I was overlapping a wider range than the minimum necessary.
Updated patch. This also now incorporates both the patches from  FS#10140  also.

Patch also now has /*comment*/ instead of // comments, and also includes patched apps/codecs/lib/* (to also incorporate  FS#10140 , plus using memcpy for vect_copy on arm). Note that apps/codecs/lib/* does NOT have the vorbis trick of delaying the shift in vect_mult until we do the vect_add : thinking that apps/codecs/lib/* is generic, I shouldn’t assume that anyone using the lib would always and only call vect_add after vect_mult. So that optimisation is only in libtremor.

On coldfire-h120 I now see an extra 1MHz improvement (rather than worse performance than svn). (That’s extra on top of  FS#10140 ). So about 2MHz improvement over svn.
On ipod-video-64MB I now see about 2MHz improvement over current svn.

nls - would be interested to know what performance difference you get on your test files.

Sigh, fixed patch (and removed memcpy in codecs/lib/asm_arm.h)
Does anything actually use those vect_xx functions in codecs/lib/ ? I couldn’t see anything..

Aoyumi - would of course also be very interested to know if this still retains your original performance speedup on q1/2 files on Sansa e200 ..!

nls commented on 2009-04-19 09:08

Ran a fresh set of benchmarks on my h300, it looks good now, q1-q5 get a speedup of 1-2MHz and the higher qualities are pretty much unaffected.

Dave - I tested your patch in e200. The result was good generally.
Good work!

Darn, there appears to be something that causes occasional clicks/pops with this patch :( verified on sim.
I’ll want to do more tests before committing

Ok, can’t reproduce after all. Committing but if anyone notices audio degradation let me know

Dave - I noticed a problem to occur with your latest patch.
It causes a noise sound with high probability after processing to seek.
When a noise occurs once, my e200 continues producing a noise till rockbox initialize the libtremor again.

I have similar noise problems here (e200 too, playing low-bitrate mono files).

Well, I don’t understand why, but limited testing (on a low-bitrate file) suggests that removing the reset_pcmb flag removes the noise after seek… :)

Rather, it fixed the noise on seek, but now I get it on track change instead… I think I understand why, so I’ll try with only removing the set of reset_pcmb in vorbis_synthesis_restart (or maybe that should be done depending on the iram_pcm_doublebuffer flag?). Can’t test it right now though.

Right. The logic’s not completely obvious:
If iram_pcm_storage > pcmend (i.e. we have at least one pcm buffer in iram), then we need to setup the buffer pointer, but only once.
From that point onwards, it should either flip automatically after each block (if we have two pcm buffers in iram), or the buffer will remain unchanged (if we do a memcpy between that buffer and the one in regular mem). Either way, we shouldn’t reset the buffer pointer in synthesis - the buffer pointer is under the control of block.c
Except - on track change when bitrate changes and we now need a pcm buffer that doesn’t fit in iram. (and from that point it should behave as before, with buffer pointer managed by block.c - in this case if the pcm buffer doesn’t fit in iram then the buffer pointer doesn’t need to change at all since we will always memcpy rather than buffer flip)
Except - on track change when the bitrate changes to something smaller again, and we go back to a pcm buffer that does fit into iram, when we need to reset the buffer pointer again to an initial state that is consistent between both synthesis.c and block.c.

I think I hadn’t appreciated that seeking will go through vorbis_synthesis_restart - so maybe after a seek we just need to reset buffer pointers again to be consistent in both block.c and synthesis.c

Now actually, if the pcm buffer doesn’t fit in iram (synthesis.c) we probably shouldn’t care about reset_pcmb at all - we should always create a buffer in mem. So I think one other thing that’s missing is the logic to selectively oggfree the vb→pcm buffer if it wasn’t in iram. I’m not sure where that went, thought it was there but can’t see it right now ..

I discovered the cause of the problem. And I revised it.
The 2nd PCM buffer is initialized after processing to seek now.

Yep, that works in the tests I’ve done.

Great - I’ll commit that with a small change, we don’t need the iram_pcm_doublebuffer flag anymore, just observe v→iram_double_pcm non-null.

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing