Rockbox

Tasklist

FS#12223 - sinc resampling for rockbox

Attached to Project: Rockbox
Opened by Stefan Keller (drezon) - Tuesday, 09 August 2011, 10:36 GMT
Task Type Patches
Category Music playback
Status Unconfirmed
Assigned To No-one
Operating System All players
Severity Low
Priority Normal
Reported Version Daily build (which?)
Due in Version Undecided
Due Date Undecided
Percent Complete 0%
Votes 0
Private No

Details

The attached patch implements sinc based resampling in rockbox.
I have only tested it stand-alone so far (i.e. the code outside of rockbox -- I am waiting for a replacement for my broken android phone to test it there).
There is (of course) quite some room for improvement.
Things which could be improved:
- buffer handling
- calculate the correct sinc spanning the whole table for downsampling. Then the downsampling and upsampling code would be the same (no need for sinc_increment, saving some additions). Also the result would be more accurate resampling.
- Save some memory by always calculating the correct sinc and putting it into resample_data. Would need to use fp_sincos() for that.
- Save some more memory by taking advantage of sinc's symmetry, only storing half of it in the table.

I know that I'm breaking the coding-style by using more than 80 chars in some lines. I will fix that when I have time. I wanted to get this code finally out to get some feedback after it sitting for two weeks (basically unchanged) on my HD.
This task depends upon

Comment by Stefan Keller (drezon) - Tuesday, 09 August 2011, 16:30 GMT
Fixed memset call (duh!). Don't compile asm versions of linear resampling.

Now it WorksForMe on my Android phone.
Comment by MichaelGiacomelli (saratoga) - Tuesday, 09 August 2011, 19:32 GMT
I just looked at this quickly, but one thing stood out. Why do you do "current>>16" and "dsp->data.resample_data.current = FILTER_SIZE<<16;"? Wouldn't it be a lot easier to read if you just stored the actual index value?
Comment by Stefan Keller (drezon) - Wednesday, 10 August 2011, 11:28 GMT
current is a fixed-point 16.16 value. Of course you could store the fractional part in another var (complicating addition -- at least in C).
Comment by Thomas Martitz (kugel.) - Wednesday, 10 August 2011, 11:44 GMT
I think he means that you do "currentX = current >> 16" only once and use currentX for the rest of the calculation. The same goes for (current>>8)&0xff and other places I guess.
I also think the sinc tables should be int16_t to save some space.
Comment by MichaelGiacomelli (saratoga) - Wednesday, 10 August 2011, 15:28 GMT
I think I understand what you're doing a little better. I see that you have current as a fixed point value because you might need to increment it by a fractional value. In sinc_upsample I think this can be avoided easily enough by just passing in the nearest integer. In sinc_downsample this looks more awkward, and the index calculations for each tap of the filter depend on a fixed point add and round thats not easily removed from the filter kernel.

However, since sinc_increment depends only on the sample rate conversion, theres only a few possible values it can take for downsampling (corresponding to 48->44.1, 88.2->44.1 and 96->44.1). Would it be possible to just store the needed coefficients for each of these three conversions (and perhaps also for the upsample cases 11.025->44.1, 22.05->44.1 and 32->44.1)? This would have the added advantage that we'd have regular stride lengths, which are much easier to optimize, particularly on systems like ours with small cache memory.
Comment by Stefan Keller (drezon) - Thursday, 11 August 2011, 09:33 GMT
As I wrote above in the initial post, we could get rid of sinc_downsample() and sinc_increment, if the sinc is precalced (scaled and stretched) according to the ratio in resampler_new_delta().
How would you get rid of the fixed-point current in sinc_upsample? Note, that the highest 8 bit of the fractional part are used as an index into the sinc (to get the sinc's value between the two samples / zero-crossings).
Comment by Stefan Keller (drezon) - Thursday, 11 August 2011, 09:38 GMT
New version fixes sinc calculation / windowing in the octave script used to create sinc.c, plus new sinc.c. Also used temporary current_int & _frac vars where it made sense to improve readability.
Comment by Stefan Keller (drezon) - Monday, 15 August 2011, 15:35 GMT
New version:
- exploit sinc's symmetry and only store one half
- stretch and scale for downsampling in resampler_new_delta to have unified core up/downsampling code
- reverse sinc indices to help keeping related data in a cache line
Comment by Thomas Martitz (kugel.) - Monday, 15 August 2011, 15:37 GMT
Have you done any performance testing by now? We have the test_codec plugin which can also test with and without dsp effects for comparison.
Comment by Stefan Keller (drezon) - Thursday, 18 August 2011, 21:32 GMT
Fixed memset again (double-duh!) -- messed up cherry-pick
Fixed resample_data usage to not have to adapt all assembler code
sinc is alloced using buffer_alloc (not enough IRAM on Clip+)
Corrected input/output sample number calculations
Comment by Stefan Keller (drezon) - Thursday, 18 August 2011, 21:35 GMT
I did some test_codec benchmarks on the Clip+ (48KHz, 320kbps MP3):
Without DSP: 668.14% 35.92 MHz
With DSP linear interpolation: 561.49% 42.74 MHz
With DSP sinc interpolation: 364.41% 65.85 MHz
Linear interpolation used the ARM assembler code, for sinc there is only C so far -- I don't know how much speedup one could expect by using assembler.
Comment by Andree Buschmann (Buschel) - Monday, 29 August 2011, 08:33 GMT
The filter operation in sinc_resample() can be optimized a lot in assembler. Multiply-add (ARM: mla, mlal) can be used to save the additions, load-multiple (ARM: ldm) can be introduced to save memory access cycles, operations could be re-ordered to avoid stalls. For ARM I would expect a factor of 2-3.

Edit: I just tried v1.4 in my sim for a 48kHz-file, down-sampled to 44.1kHz and several upsampled files. The upsampling results in heavy distortions...
Comment by Andree Buschmann (Buschel) - Monday, 29 August 2011, 10:19 GMT
v1.5 does a little refactoring of sinc_resample(). v1.5 uses FRACMUL which has ASM-code for ARM and CF and provides higher precision. Right now this new macro pre-scales the sinc-coefficients by <<12, this can be removed via re-generating the coefficients in apps/sinc.c . The new code does not result in the clipping artifacts which the v1.4 patch showed with my upsampled files.
Comment by Andree Buschmann (Buschel) - Monday, 29 August 2011, 21:00 GMT
The attached patches demonstrates ARM ASM for the most CPU consuming function sinc_resample(). v04 uses the 15-tap filter as defined in all former patches, v05 uses a 11-tap filter. I could not hear a significant difference between the results of a 11-tap and a 15-tap filter on my sample file (8 kHz upsampled to 44.1 kHz). I have attached screenshots (sorry for the quality) of spectral analysis of svn, 11-tap and 15-tap sinc-resapling. You can easily see how much better the sinc-resampling is (and sounds!) and that 11-tap does not loose too much aliasing suppression compared to the 15-tap sinc-filter.

Some performance measurements on my iPod Video or 48 kHz to 44.1 kHz downasampling of a VBR MP3 file:
svn (no resampling): ~24 MHz
svn (linear interpolation): ~31 MHz
v1.5 (no ASM): ~58 MHz
v1.5 (v04 ASM): ~49 MHz
v1.5 (v05 ASM): ~44 MHz

The speedup from v1.5 to v1.5_v04 is smaller as I have expected.

@drezon: Could you review all the memcpy's that are inserted for sinc-resampling. Those will most likely hurt some of our targets. Is it possible to get rid of them?
Comment by MichaelGiacomelli (saratoga) - Monday, 29 August 2011, 21:17 GMT
Thats not bad for ARMv4. I suspect on v5E it'll be 2-3 MHz faster, and probably another 2-3 MHz on top of that if we can use 16 bit packed multiplies. A ~10-15MHz hit for better resampling would be quite reasonable I think.
Comment by Stefan Keller (drezon) - Tuesday, 06 September 2011, 06:32 GMT
@Buschel: Thanks for reviewing, testing an enhancing the patch!
I was (and currently am) on vacation and don't really have internet access. I did a quick check on your patches and noticed some things:
- the sinc is static again, but you need two of them because there are two dsp states (for voice and audio) with possibly different sampling rates. Maybe we should use linear interpolation for voice.
- the cache-align code was left in but then one need to allocate CACHEALIGN-1 more Bytes than actually needed
- for the 11 tap version the sinc should be calculated differently (the windowing should be done differently). Also for this version one only needs 256*6 int32_ts, saving some memory in addition to cpu cycles.

Changing the octave script to scale the sinc by 2^12 to get rid of the shifts is easy. Once I'm back from vacation I will do this (and enhance the other things mentioned above).

Getting rid of the memcpys isn't that easy. By merging the two buffers we could get rid of some memcpys but the code would become pretty complex and hard to read I guess. I'll ponder over this. :)
Comment by Michael Sevakis (MikeS) - Tuesday, 13 September 2011, 17:23 GMT
Perhaps it's time to consider moving stuff out of dsp.c into their own source files. It's getting messy in there.

DSP stages must obey output constraints as any output buffer request may in fact get the size truncated (crossfade will do this at all times). tdspeed is already a troublesome thing in this respect due to storing data and then spitting out whatever it wants when it's ready to output a frame (bad, bad, bad). Input/output counts based upon frequency ratios aren't sufficient hints.

Another thing is it's not a good idea to use this resampling for all codecs on all processors. Some are already near the limit of cababilities of the CPU (like SPC). Linear must remain available and hints given by the metadata parsers or some other mechanism as to which to use.
Comment by Michael Sevakis (MikeS) - Wednesday, 28 March 2012, 00:50 GMT
Now have a patch (g200) that does address what I commented upon in the prior post and should make integration easier for more "exotic" effect and this alternate resampler to share space with the fast one.

Loading...