FS#9882 - Speed-up mod for large block (Vorbis)

Attached to Project: Rockbox
Opened by Sei Aoyumi (Aoyumi) - Sunday, 08 February 2009, 07:23 GMT
Last edited by MichaelGiacomelli (saratoga) - Sunday, 02 August 2009, 18:31 GMT
Task Type Patches
Category Codecs
Status Closed
Assigned To No-one
Operating System Sansa e200
Severity Low
Priority Normal
Reported Version Daily build (which?)
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No


This patch speeds up the decode of the OggVorbis file of mainly "q-1(/-2)[32kHz~48kHz]" equivalency encoded in libvorbis.

I tested this patch in Sansa e200, but probably even Sansa c200, iaudio M5/X5, iPod color/Photo/Mini 2nd gen effective.
This task depends upon

Closed by  MichaelGiacomelli (saratoga)
Sunday, 02 August 2009, 18:31 GMT
Reason for closing:  Accepted
Additional comments about closing:  Accepted in r20783 and r20843 so I think its safe to close this now.
Comment by Nils Wallménius (nls) - Sunday, 08 February 2009, 11:15 GMT
It seems like attaching the actual patch file failed.
Comment by Sei Aoyumi (Aoyumi) - Sunday, 08 February 2009, 13:58 GMT
Sorry, I try the upload of the file again.
Comment by MichaelGiacomelli (saratoga) - Tuesday, 10 February 2009, 02:11 GMT
I like the idea of this patch but I don't think we should be allocating IRAM for special cases that are rarely used (very low bitrates). I think it should be shared with other buffers and moved via memcpy at runtime if low bit rate files are played.
Comment by Sei Aoyumi (Aoyumi) - Thursday, 12 February 2009, 14:20 GMT
The new patch can use IRAM more flexibly.
Under the present conditions, it is used only in the window lookup table and a PCM buffer.
Comment by Rosso Maltese (asettico) - Tuesday, 03 March 2009, 13:29 GMT
Converted in unix format
Comment by Nils Wallménius (nls) - Tuesday, 17 March 2009, 20:49 GMT
hi, i made a test run with test_codec with this patch on my h300 and got some varying results.
some of my test files are decodign faster with your patch and some without...
attatched are the test codec logs
Comment by Dave Hooper (stripwax) - Saturday, 28 March 2009, 14:33 GMT
Updated patch with following additions

1. Create pcm back buffer (used by block.c) in IRAM if possible also. if we can do this, we can then use a true double-buffer approach (flip between bufferA and bufferB when synthesising), meaning vect_add can operate iram-to-iram, and save a memcpy for each block. If we can't do this, we use the old back buffer approach where we just memcpy the right-hand subframe of the current frame
2. replace call to vect_copy with call to memcpy instead, as they are surely interchangable and memcpy ought to be optimal
3. move vorbis_synthesis_blockin into ICODE so that vect_add operations (which are inline) operate in icode.
4. move vorbis_apply_window into ICODE so that vect_mult operations (which are also inline) operate in icode
5. On ARM, save one instruction for each register multiply in vect_mult_bw, by removing the left-shift (i.e. effectively use MULT32 instead of MULT31), and we then do the left-shift inline in vect_add_xxx. Implemented C code version (misc.h) also, although not yet for Coldfire (not sure if coldfire would have a benefit from that approach anyway)
6. Remove memset in vorbis_apply_window, since we already only add the parts of the frames that we know overlap, and the data outside that range is not used for lapping anyway

I would be very interested to know what improvement people observe on targets, especially non ARM targets. I see about 1.5MHz imrprovement over svn, with greatest improvement on lower bitrates.

Comment by Dave Hooper (stripwax) - Sunday, 29 March 2009, 17:22 GMT
Curious - on Coldfire (iriver H120) I see about 1.5MHz improvement on 96kbps file according to test_codec, but all other rates seem to run slower with these changes in fact. I can't think why that should be, unless the allocation is borked and we end up with buffers in mem that are in iram on svn
Comment by Dave Hooper (stripwax) - Sunday, 29 March 2009, 23:01 GMT
I made quite a stupid mistake in synthesis.c - which probably caused the Coldfire problem - updated patch
Comment by Dave Hooper (stripwax) - Sunday, 29 March 2009, 23:48 GMT
And I just made another stupid mistake in oggmalloc. I think that's the last of the silly mistakes but would appreciate a second pair of eyes on this.. did I accidentally move something into/outof icode?
Also seems my assertion #2 is flawed for coldfire - vect_copy indeed seems to run faster than memcpy (anyone know why that should be). Hacking it to use vect_copy again (only on coldfire; memcpy on all other targets) gives me the attached test_codec output. Still seems odd to me that coldfire should now run 2MHz slower on some bitrates than before the change, which is I guess the same thing nls observed earlier. Not sure if I've made it worse in fact.
Comment by MichaelGiacomelli (saratoga) - Sunday, 29 March 2009, 23:57 GMT
>did I accidentally move something into/outof icode

Perhaps its worth comparing the .map files to see how memory allocation differs with this patch.

>Also seems my assertion #2 is flawed for coldfire - vect_copy indeed seems to run faster than memcpy (anyone know why that should be)

amiconn once told me that different code should be used for IRAM access then DRAM access in order to maximize write performance. I don't really understand why, but it seems to make a pretty big difference for WMA:

This could be an issue where the memcpy isn't optimal because of IRAM or DRAM use. You may want to ask amiconn about it.
Comment by Dave Hooper (stripwax) - Tuesday, 14 April 2009, 22:20 GMT
Wonder if some of that could be down to a buggy alignment check in the (svn) asm_mcf5249.h - surely should be testing addr&15 not addr&16 . Trying that out first
Comment by Dave Hooper (stripwax) - Saturday, 18 April 2009, 15:27 GMT
Comment by Dave Hooper (stripwax) - Saturday, 18 April 2009, 16:33 GMT
Killed another typo - if prevwindow and thiswindow are both the same size, we were overlap/adding based on the large window size, rather than the actual window size. It meant previously I was overlapping a wider range than the minimum necessary.
Updated patch. This also now incorporates both the patches from  FS#10140  also.

Patch also now has /*comment*/ instead of // comments, and also includes patched apps/codecs/lib/* (to also incorporate  FS#10140 , plus using memcpy for vect_copy on arm). Note that apps/codecs/lib/* does NOT have the vorbis trick of delaying the shift in vect_mult until we do the vect_add : thinking that apps/codecs/lib/* is generic, I shouldn't assume that anyone using the lib would always and only call vect_add after vect_mult. So that optimisation is only in libtremor.

On coldfire-h120 I now see an extra 1MHz improvement (rather than worse performance than svn). (That's extra on top of  FS#10140 ). So about 2MHz improvement over svn.
On ipod-video-64MB I now see about 2MHz improvement over current svn.

nls - would be interested to know what performance difference you get on your test files.
Comment by Dave Hooper (stripwax) - Saturday, 18 April 2009, 16:45 GMT
Sigh, fixed patch (and removed memcpy in codecs/lib/asm_arm.h)
Does anything actually use those vect_xx functions in codecs/lib/ ? I couldn't see anything..
Comment by Dave Hooper (stripwax) - Saturday, 18 April 2009, 16:58 GMT
Aoyumi - would of course also be very interested to know if this still retains your original performance speedup on q1/2 files on Sansa e200 ..!
Comment by Nils Wallménius (nls) - Sunday, 19 April 2009, 09:08 GMT
Ran a fresh set of benchmarks on my h300, it looks good now, q1-q5 get a speedup of 1-2MHz and the higher qualities are pretty much unaffected.
Comment by Sei Aoyumi (Aoyumi) - Sunday, 19 April 2009, 15:26 GMT
Dave - I tested your patch in e200. The result was good generally.
Good work!
Comment by Dave Hooper (stripwax) - Saturday, 25 April 2009, 11:01 GMT
Darn, there appears to be something that causes occasional clicks/pops with this patch :( verified on sim.
I'll want to do more tests before committing
Comment by Dave Hooper (stripwax) - Saturday, 25 April 2009, 11:23 GMT
Ok, can't reproduce after all. Committing but if anyone notices audio degradation let me know
Comment by Sei Aoyumi (Aoyumi) - Sunday, 26 April 2009, 09:24 GMT
Dave - I noticed a problem to occur with your latest patch.
It causes a noise sound with high probability after processing to seek.
When a noise occurs once, my e200 continues producing a noise till rockbox initialize the libtremor again.
Comment by Magnus Holmgren (learman) - Sunday, 26 April 2009, 20:43 GMT
I have similar noise problems here (e200 too, playing low-bitrate mono files).
Comment by Magnus Holmgren (learman) - Tuesday, 28 April 2009, 19:13 GMT
Well, I don't understand why, but limited testing (on a low-bitrate file) suggests that removing the reset_pcmb flag removes the noise after seek... :)
Comment by Magnus Holmgren (learman) - Wednesday, 29 April 2009, 07:03 GMT
Rather, it fixed the noise on seek, but now I get it on track change instead... I think I understand why, so I'll try with only removing the set of reset_pcmb in vorbis_synthesis_restart (or maybe that should be done depending on the iram_pcm_doublebuffer flag?). Can't test it right now though.
Comment by Dave Hooper (stripwax) - Wednesday, 29 April 2009, 07:37 GMT
Right. The logic's not completely obvious:
If iram_pcm_storage > pcmend (i.e. we have at least one pcm buffer in iram), then we need to setup the buffer pointer, but only once.
From that point onwards, it should either flip automatically after each block (if we have two pcm buffers in iram), or the buffer will remain unchanged (if we do a memcpy between that buffer and the one in regular mem). Either way, we shouldn't reset the buffer pointer in synthesis - the buffer pointer is under the control of block.c
Except - on track change when bitrate changes and we now need a pcm buffer that doesn't fit in iram. (and from that point it should behave as before, with buffer pointer managed by block.c - in this case if the pcm buffer doesn't fit in iram then the buffer pointer doesn't need to change at all since we will always memcpy rather than buffer flip)
Except - on track change when the bitrate changes to something smaller again, and we go back to a pcm buffer that does fit into iram, when we need to reset the buffer pointer again to an initial state that is consistent between both synthesis.c and block.c.

I think I hadn't appreciated that seeking will go through vorbis_synthesis_restart - so maybe after a seek we just need to reset buffer pointers again to be consistent in both block.c and synthesis.c

Now actually, if the pcm buffer doesn't fit in iram (synthesis.c) we probably shouldn't care about reset_pcmb at all - we should always create a buffer in mem. So I think one other thing that's missing is the logic to selectively oggfree the vb->pcm buffer if it wasn't in iram. I'm not sure where that went, thought it was there but can't see it right now ..
Comment by Sei Aoyumi (Aoyumi) - Wednesday, 29 April 2009, 14:43 GMT
I discovered the cause of the problem. And I revised it.
The 2nd PCM buffer is initialized after processing to seek now.
Comment by Magnus Holmgren (learman) - Wednesday, 29 April 2009, 17:57 GMT
Yep, that works in the tests I've done.
Comment by Dave Hooper (stripwax) - Saturday, 02 May 2009, 10:38 GMT
Great - I'll commit that with a small change, we don't need the iram_pcm_doublebuffer flag anymore, just observe v->iram_double_pcm non-null.