Rockbox

This is the bug/patch tracker for Rockbox. Click here for more information.

Quick links: Bugs · Patches · Rockbox frontpage

Tasklist

FS#8894 - Speeding playback up/down without affecting pitch

Attached to Project: Rockbox
Opened by Stephane Doyon (sdoyon) - Tuesday, 15 April 2008, 03:23 GMT+2
Task Type Patches
Category Music playback
Status New
Assigned To No-one
Player type All players
Severity Low
Priority Normal
Reported Version current build
Due in Version Undecided
Due Date Undecided
Percent Complete 0%
Private No

Details

Speeding playback up/down without affecting pitch
aka time scaling.

The good news is: this actually works.
The bad news: it still needs work.

I've been using this on my players for many months now. Life has made it
so that my Rockbox spare time has been reduced to very little,
unfortunately. And this works just well enough that I have not felt
a pressing need to complete this. Therefore I am putting this up in its
current state so that it can be useful to others and in the hope that
someone will pick it up and give it the love it needs.

This work it based on a previously unreleased implementation by
Nicolas Pitre <nico@cam.org>. So the credit for this mostly goes to him.
It's loosely based on the WSOLA algorithm. Nicolas implemented this from
scratch, working from a good understanding of the general algorithm. My
contributions: I helped with a bit of tuning and a bug report or two, and
I started on this half-baked integration into Rockbox which is still
pretty rough.

Nicolas and I both used this implementation for a few years on our
Linux'ified iPAQ H3600 handhelds, to speed up talking books. Those have a
StrongArm processor running at 206MHz, which is relatively modest and
does not support floating point operations.

Nicolas released the code to me under GPL, with the explicit
understanding that I would post it here for integration into Rockbox. He
is not himself a Rockbox user at this time.

This patch has been tested on X5 and E200. It works well enough for
speeding up audio books (which are typically lower bit rate and mono). I
cannot stress enough how tremendously useful this feature is to me.

Slowing down speech works, but intelligibility is not much
improved. Music can also be sped up or slowed, but with significant
distortion.

I can speed up low bit rate speech to about a factor 3, although in actual
use one would normally use a factor of 1.6 to 2. Speeding up high bit
rate music runs out of CPU at a somewhat lower factor.

Since I was familiar with Nicolas's implementation and I knew it did not
require too much CPU power, I naturally used that when trying to speed up
playback on Rockbox. Nowadays there are other implementations that could
potentially be used. The only one I have actually tried is Soundtouch,
and that comparison was admittedly done in haste. My findings were that
for speeding up speech, Nicolas's algorithm appeared to sound somewhat
better (less clicking distortion), while for slowing down music,
Soundtouch was better. Since my goal is to speed up audio books, and
since this implementation works well enough for me, I am not really
motivated to further investigate alternatives.

The main difficulty in integrating this algorithm into Rockbox is that it
needs a relatively large sound buffer to work on, a latency of about
0.1s, and this would be the first Rockbox DSP effect to have this kind of
requirement AFAICT. Also the implementation was meant to process larger
chunks at a time, and I do not have a very accurate estimate of the
required input buffer size for the algorithm, and so I am feeding it
larger chunks than absolutely necessary.

Some latency can be felt in the UI: little or none for low bitrate files,
but pretty bad for high bit rate files. A better integration with dsp.c
and better buffering estimations would presumably prevent this.

I haven't measured the effect on my battery life. Subjectively, it
doesn't feel disastrous, but I imagine it could be improved.

I've bypassed the IRAM buffer that was too small for my needs. It should
be easy to add logic to use the IRAM buffer at least when time scaling is
not in effect.

I've also left a bunch of debugging macros in there.

The algorithm has several tunables that trade quality for CPU
utilization. I imagine some DSP gurus might like to tinker with these and
with the code. I have played with this a bit and I think the current
quality level is (subjectively) just about right for speech.

Another interesting feature to add would be a true pitch shift function:
combining this time scaling function with the
sampling rate alteration effect (what Rockbox
currently calls pitch, to produce an effect that shifts pitch without
affecting speed, or that allows controlling both speed and pitch
independently. I imagine musicians would find that useful.

I hope this will make other speech listeners as happy as it's made me.
This task depends upon

Comment by Stephane Doyon (sdoyon) - Tuesday, 15 April 2008, 03:32 GMT+2
If/when this is committed, don't forget to add Nico's name to CREDITS.
Separate patch because this one is likely to go stale quickly...
Comment by Michael Sevakis (MikeS) - Tuesday, 15 April 2008, 10:09 GMT+2
Do the buffers have to be so large that they require removal from IRAM?

EDIT: I know you said it expects that but why? It could keep it's own history buffer and interact with main dsp in smaller chunks, no?
Comment by Steve Bavin (pondlife) - Tuesday, 15 April 2008, 11:29 GMT+2
Hi Stephane,

Nice work! I'd be very interested in working on this and getting it committed, and would like to start by understanding the algorithm properly. Could you (or Nicolas) point me to a site (or book) which might act as a gentle introduction to the algorithm? I've tried googling WSOLA, but nothing remotely beginner-level appears.

p.s. Apologies for spelling your name wrong in CREDITS...
Comment by Thom Johansen (preglow) - Tuesday, 15 April 2008, 12:28 GMT+2
Ouch, moving those buffers out of IRAM for the sake of one feature is definitely out of the question. If it needs to buffer audio for lookahead, it should do so in its own buffer.
Comment by Stephane Doyon (sdoyon) - Wednesday, 16 April 2008, 06:09 GMT+2
pondlife wrote:
>Could you (or Nicolas) point me to a site (or book) which
>might act as a gentle introduction to the algorithm?

Indeed it's not exactly obvious. Unfortunately I'm not aware of any good
explanation/introduction to this.

There's this:
http://en.wikipedia.org/wiki/Audio_timescale-pitch_modification#Time_domain
but it's far from detailed.

Let me try to introduce the general idea in simple terms.

Roughly the idea is to pretend that your sound wave is a series of
repeating sequences, that repeat at whatever the fundamental frequency of
your sound is, and assuming that consecutive sequences are similar to one
another. On each iteration of the algorithm, we take a bit of the audio
we are about to process, and lookahead through the upcoming samples to
find a similar sequence. (You can lookup autocorrelation.) We then
proceed to clip out (sort of) one of the sequences, to speed up the
sound, or duplicate one, to slow it down.

How long the compared sequence is and how far ahead we look, depends on
the speed up factor, and in this implementation things are bounded by the
rough assumption that our fundamental frequency is higher than 100Hz. You
could do fancier stuff I'm sure. The best correlation is found by
searching for the offset at which we get the minimum of the sum of the
square of the deltas between corresponding samples. That's the costly
part. We actually have to skip a lot of those samples, and amazingly it
still works pretty well. There are possible variants for this step as
well of course, and you can tune quality vs processing time.

Now we don't actually clip parts out, or just repeat them, since that
would click like crazy, and of course our alignment and estimation of the
frequency are very rough. We do a kind of crossfading. When speeding up,
we take two consecutive sequences, and replace them with one that is the
result of mixing the two: giving most weigh to the first sequence at the
beginning and shifting the weight to the second sequence as we go. So the
beginning matches the beginning of the first sequence and the end matches
the end of the second.

Now I wrote this in 5mins, I'm no expert, and I'm only trying to
introduce the general idea here, so please be tolerant of this
explanation :-).

The part of the code that does what I just explained should be somewhat
understandable. However I admit some of the buffer management around it
is still somewhat obscure to me.

>p.s. Apologies for spelling your name wrong in CREDITS...

:-) No worries. And at least YOU had put in the accented E ;-) !
Comment by Stephane Doyon (sdoyon) - Wednesday, 16 April 2008, 06:10 GMT+2
MikeS wrote:
>Do the buffers have to be so large that they require removal from IRAM?
>EDIT: I know you said it expects that but why? It could keep it's own
>history buffer and interact with main dsp in smaller chunks, no?

So the autocorrelation phase needs a large chunk of input to operate
on. This isn't just history exactly: we're considering what chunk to clip
out and crossfade, and we can't start outputting it until we've decided.

The current implementation will work fine if you feed it tiny inputs: it
will buffer them internally. Empirically (not sure how to calculate
precisely) the minimum required buffer size is 3524 samples
(3524*4bytes). Some memory copies are of course involved in maintaining
that buffer, so this isn't terribly efficient.

A limitation of the current implementation is that when it has decided on
a correlation, it outputs an entire "frame" in one call. With some work,
it's conceivable the implementation could be made to split that operation
across successive calls. Alternatively the output could be buffered again
(with some more memory copies).

The current implementation has an output buffer of 4096 samples.

So we could feed it smaller chunks, but then the dsp/playback stack must
be ready for multiple iterations with 0 output. And we could possibly
have it emit smaller output, but then again you'd have several iterations
with 0 input and as much output as the rest of the pipeline can
handle. And the whole thing would probably be less efficient.

I'm not used to dealing with IRAM, so you'll have to educate me.
Also I'm not a DSP guru and I'm not particularly familiar with the
hardware either.

The autocorrelation is the costly step, as we look at the samples several
times. If someone were able to arrange this so we could do that step from
IRAM, I imagine there would be a huge gain. But I'm not sure how big a
buffer that would require, and how much IRAM we have left. In any case
this task is beyond me given my current availability.

Even if the required buffer is too large for IRAM in the general case, it
might perhaps be worth trying to make it work for cases where the audio
was downsampled to 22.05KHz or lower, as presumably the buffer
requirements might be more manageable, and this may be the case often
enough for speech.

So assuming autocorrelation cannot be done from IRAM, because it requires
too much space or because no one figures out the code, it means that this
costly step is done from ordinary RAM. I thought that the other DSP
effects (at least those I use) probably had a relatively smaller cost
compare to this one. I had speculated that in that case, the benefits of
IRAM were perhaps not so interesting, since they might be offset by
multiple calls and memory copies around the time scaling code.

The one big TODO that I left hanging however is to make sure to use the
existing IRAM buffer whenever time scaling is not in effect.
Comment by Glenn (DancemasterGlenn) - Wednesday, 16 April 2008, 22:30 GMT+2
Not sure if this is helpful, but there are two time/pitch independent plugins I've used before to great success, that might be either used as reference or possibly be ported themselves if useful. Here are the links: http://www.surina.net/soundtouch/index.html and http://www.breakfastquay.com/rubberband/ are SoundTouch and Rubber Band, respectively. They're made for PCs, but perhaps looking through their sources will be helpful.

I've been hoping for an independent pitch-shifting plugin for a while (I'm a musician), I hope this will be implemented!
Comment by Daniel Dalton (ddalton) - Wednesday, 18 June 2008, 13:11 GMT+2
Hi Stephane,

The version of this patched you emailed me, (it may already be up
here) works well for me.
I tested it on some music and it didn't lock up the player like older
versions did. Worked very well and the playback did speed up and slow
down. Some problems:
- Sometimes the player was slow to react when I held up/down in the
speed screen for a few seconds. As in it would start speeding right up
and I don't think the voice or the player could keep up.
- When the music was very fast, voice sometimes cut in and out, but
was still intelligible.

One thing I would suggest, do you think making the screen act like the
pitch one would be a good idea? So you could hit select and the speed
would go back to the default (100).

But looks nice from a quick test.

Thanks.
Comment by Stephane Doyon (sdoyon) - Sunday, 29 June 2008, 04:30 GMT+2
Here's an updated version. Sync, and a change to keep using
the IRAM resampling/dsp buffers when speed is not being altered.

Daniel Dalton wrote:
>The version of this patched you emailed me, (it may already be up
>here) works well for me.

Here it is.

>I tested it on some music and it didn't lock up the player like older
>versions did.

Erm well I didn't fix anything. This thing may stress buffering
somewhat, and I believe there have been a few fixes in that area in
the past weeks. (Mind you I'M not saying this code is perfect and
wouldn't ever freeze your player... but personally I still see
occasional freezes both with and without speed alteration.)

>Worked very well and the playback did speed up and slow
>down.

Good! At least one person tried it :-).

What player was this?

>Some problems:
>- Sometimes the player was slow to react when I held up/down in the
>speed screen for a few seconds. As in it would start speeding right up
>and I don't think the voice or the player could keep up.

Yes. That's what happens when you speed up high bitrate files beyond
what your player's CPU can handle. Ideally the range of the setting
would be clamped to something that doesn't let you shoot yourself in
the foot too much.

>One thing I would suggest, do you think making the screen act like the
>pitch one would be a good idea? So you could hit select and the speed
>would go back to the default (100).

You can do this by pressing CONTEXT, probably LONG SELECT. This is one
of the main reasons I added that feature to reset a setting to its
default.

Thanks for testing and reporting!

Loading...