This is the bug/patch tracker for Rockbox. Click here for more information.
Quick links: Bugs · Patches · Feature requests · Rockbox frontpage
FS#8894 - Speeding playback up/down without affecting pitch
Attached to Project:
Rockbox
Opened by Stephane Doyon (sdoyon) - Tuesday, 15 April 2008, 03:23 GMT+2
Opened by Stephane Doyon (sdoyon) - Tuesday, 15 April 2008, 03:23 GMT+2
|
DetailsSpeeding playback up/down without affecting pitch
aka time scaling. The good news is: this actually works. The bad news: it still needs work. I've been using this on my players for many months now. Life has made it so that my Rockbox spare time has been reduced to very little, unfortunately. And this works just well enough that I have not felt a pressing need to complete this. Therefore I am putting this up in its current state so that it can be useful to others and in the hope that someone will pick it up and give it the love it needs. This work it based on a previously unreleased implementation by Nicolas Pitre <nico@cam.org>. So the credit for this mostly goes to him. It's loosely based on the WSOLA algorithm. Nicolas implemented this from scratch, working from a good understanding of the general algorithm. My contributions: I helped with a bit of tuning and a bug report or two, and I started on this half-baked integration into Rockbox which is still pretty rough. Nicolas and I both used this implementation for a few years on our Linux'ified iPAQ H3600 handhelds, to speed up talking books. Those have a StrongArm processor running at 206MHz, which is relatively modest and does not support floating point operations. Nicolas released the code to me under GPL, with the explicit understanding that I would post it here for integration into Rockbox. He is not himself a Rockbox user at this time. This patch has been tested on X5 and E200. It works well enough for speeding up audio books (which are typically lower bit rate and mono). I cannot stress enough how tremendously useful this feature is to me. Slowing down speech works, but intelligibility is not much improved. Music can also be sped up or slowed, but with significant distortion. I can speed up low bit rate speech to about a factor 3, although in actual use one would normally use a factor of 1.6 to 2. Speeding up high bit rate music runs out of CPU at a somewhat lower factor. Since I was familiar with Nicolas's implementation and I knew it did not require too much CPU power, I naturally used that when trying to speed up playback on Rockbox. Nowadays there are other implementations that could potentially be used. The only one I have actually tried is Soundtouch, and that comparison was admittedly done in haste. My findings were that for speeding up speech, Nicolas's algorithm appeared to sound somewhat better (less clicking distortion), while for slowing down music, Soundtouch was better. Since my goal is to speed up audio books, and since this implementation works well enough for me, I am not really motivated to further investigate alternatives. The main difficulty in integrating this algorithm into Rockbox is that it needs a relatively large sound buffer to work on, a latency of about 0.1s, and this would be the first Rockbox DSP effect to have this kind of requirement AFAICT. Also the implementation was meant to process larger chunks at a time, and I do not have a very accurate estimate of the required input buffer size for the algorithm, and so I am feeding it larger chunks than absolutely necessary. Some latency can be felt in the UI: little or none for low bitrate files, but pretty bad for high bit rate files. A better integration with dsp.c and better buffering estimations would presumably prevent this. I haven't measured the effect on my battery life. Subjectively, it doesn't feel disastrous, but I imagine it could be improved. I've bypassed the IRAM buffer that was too small for my needs. It should be easy to add logic to use the IRAM buffer at least when time scaling is not in effect. I've also left a bunch of debugging macros in there. The algorithm has several tunables that trade quality for CPU utilization. I imagine some DSP gurus might like to tinker with these and with the code. I have played with this a bit and I think the current quality level is (subjectively) just about right for speech. Another interesting feature to add would be a true pitch shift function: combining this time scaling function with the sampling rate alteration effect (what Rockbox currently calls pitch, to produce an effect that shifts pitch without affecting speed, or that allows controlling both speed and pitch independently. I imagine musicians would find that useful. I hope this will make other speech listeners as happy as it's made me. |
This task depends upon
Separate patch because this one is likely to go stale quickly...
EDIT: I know you said it expects that but why? It could keep it's own history buffer and interact with main dsp in smaller chunks, no?
Nice work! I'd be very interested in working on this and getting it committed, and would like to start by understanding the algorithm properly. Could you (or Nicolas) point me to a site (or book) which might act as a gentle introduction to the algorithm? I've tried googling WSOLA, but nothing remotely beginner-level appears.
p.s. Apologies for spelling your name wrong in CREDITS...
>Could you (or Nicolas) point me to a site (or book) which
>might act as a gentle introduction to the algorithm?
Indeed it's not exactly obvious. Unfortunately I'm not aware of any good
explanation/introduction to this.
There's this:
http://en.wikipedia.org/wiki/Audio_timescale-pitch_modification#Time_domain
but it's far from detailed.
Let me try to introduce the general idea in simple terms.
Roughly the idea is to pretend that your sound wave is a series of
repeating sequences, that repeat at whatever the fundamental frequency of
your sound is, and assuming that consecutive sequences are similar to one
another. On each iteration of the algorithm, we take a bit of the audio
we are about to process, and lookahead through the upcoming samples to
find a similar sequence. (You can lookup autocorrelation.) We then
proceed to clip out (sort of) one of the sequences, to speed up the
sound, or duplicate one, to slow it down.
How long the compared sequence is and how far ahead we look, depends on
the speed up factor, and in this implementation things are bounded by the
rough assumption that our fundamental frequency is higher than 100Hz. You
could do fancier stuff I'm sure. The best correlation is found by
searching for the offset at which we get the minimum of the sum of the
square of the deltas between corresponding samples. That's the costly
part. We actually have to skip a lot of those samples, and amazingly it
still works pretty well. There are possible variants for this step as
well of course, and you can tune quality vs processing time.
Now we don't actually clip parts out, or just repeat them, since that
would click like crazy, and of course our alignment and estimation of the
frequency are very rough. We do a kind of crossfading. When speeding up,
we take two consecutive sequences, and replace them with one that is the
result of mixing the two: giving most weigh to the first sequence at the
beginning and shifting the weight to the second sequence as we go. So the
beginning matches the beginning of the first sequence and the end matches
the end of the second.
Now I wrote this in 5mins, I'm no expert, and I'm only trying to
introduce the general idea here, so please be tolerant of this
explanation :-).
The part of the code that does what I just explained should be somewhat
understandable. However I admit some of the buffer management around it
is still somewhat obscure to me.
>p.s. Apologies for spelling your name wrong in CREDITS...
:-) No worries. And at least YOU had put in the accented E ;-) !
>Do the buffers have to be so large that they require removal from IRAM?
>EDIT: I know you said it expects that but why? It could keep it's own
>history buffer and interact with main dsp in smaller chunks, no?
So the autocorrelation phase needs a large chunk of input to operate
on. This isn't just history exactly: we're considering what chunk to clip
out and crossfade, and we can't start outputting it until we've decided.
The current implementation will work fine if you feed it tiny inputs: it
will buffer them internally. Empirically (not sure how to calculate
precisely) the minimum required buffer size is 3524 samples
(3524*4bytes). Some memory copies are of course involved in maintaining
that buffer, so this isn't terribly efficient.
A limitation of the current implementation is that when it has decided on
a correlation, it outputs an entire "frame" in one call. With some work,
it's conceivable the implementation could be made to split that operation
across successive calls. Alternatively the output could be buffered again
(with some more memory copies).
The current implementation has an output buffer of 4096 samples.
So we could feed it smaller chunks, but then the dsp/playback stack must
be ready for multiple iterations with 0 output. And we could possibly
have it emit smaller output, but then again you'd have several iterations
with 0 input and as much output as the rest of the pipeline can
handle. And the whole thing would probably be less efficient.
I'm not used to dealing with IRAM, so you'll have to educate me.
Also I'm not a DSP guru and I'm not particularly familiar with the
hardware either.
The autocorrelation is the costly step, as we look at the samples several
times. If someone were able to arrange this so we could do that step from
IRAM, I imagine there would be a huge gain. But I'm not sure how big a
buffer that would require, and how much IRAM we have left. In any case
this task is beyond me given my current availability.
Even if the required buffer is too large for IRAM in the general case, it
might perhaps be worth trying to make it work for cases where the audio
was downsampled to 22.05KHz or lower, as presumably the buffer
requirements might be more manageable, and this may be the case often
enough for speech.
So assuming autocorrelation cannot be done from IRAM, because it requires
too much space or because no one figures out the code, it means that this
costly step is done from ordinary RAM. I thought that the other DSP
effects (at least those I use) probably had a relatively smaller cost
compare to this one. I had speculated that in that case, the benefits of
IRAM were perhaps not so interesting, since they might be offset by
multiple calls and memory copies around the time scaling code.
The one big TODO that I left hanging however is to make sure to use the
existing IRAM buffer whenever time scaling is not in effect.
I've been hoping for an independent pitch-shifting plugin for a while (I'm a musician), I hope this will be implemented!