FS#11891 - Add mp3 gap skipping support to improve gapless playback
|
DetailsAs we know, rockbox will use the information provided by "good" mp3 encoders (e.g. LAME) to ignore the extra padding that occurs at the beginning and end of each mp3 file, so that true gapless playback can occur. Unfortunately we do not always have the luxury of encoding our mp3 files in this way but we might still want true gapless playback. This patch compensates for the use of "dumb" mp3 encoders that do not record the necessary information for this feature.
In cases where it does not detect the proper header information to trim off the extra padding, it will optionally skip up to a full frame of silence (or near silence) at the beginning and end of each track (if the trimming is already specified, this patch does not do any additional trimming). This can potentially eliminate up to a frame of "real" silence at the beginning or end of a track, but in terms of audibility, that would be completely unnoticeable (and it would only happen on these badly-encoded tracks), while the extra silence is an audible annoyance. I have added an option to the Playback Settings menu so that the user can disable it if desired. |
This task depends upon
For mp3 this should result in something like
#define GAP_SKIP_THRESH_NATIVE 60 /* samples smaller +/-GAP_SKIP_THRESH_NATIVE in native 16 bit format are skipped */
#define GAP_SKIP_THRESH (GAP_SKIP_THRESH_NATIVE*(1<<(MAD_F_FRACBITS-NATIVE_DEPTH+1))) /* threshold when in fixed point fract format */
However, if my understanding (this being codec-specific) is correct, I don't understand why it's necessary to skip silence at the beginning of a track. If two tracks run together, surely the second one will not start with unintentional silence?
In response to sideral - you are exactly correct and I am in agreement that it was in no way meant to be a general-purpose gap killer.
I was surprised when I started this patch to find that in addition to the last-frame padding, mp3 encoders always have to have some empty space at the beginning of the track too which is an artifact of the encoding process (http://lame.sourceforge.net/tech-FAQ.txt). The LAME encoder of course puts the size of this padding in its header and rockbox already skips it.
Andree, thanks for the comments on GAP_SKIP_THRESH - I arbitrarily chose a number that worked well in my test tracks. It is undoubtedly higher than it needs to be but it gave me usable results. How does your calculation relate to what 'junk' may be in the resulting pcm data after the mp3 frame is decoded?
http://yirkha.foobar2000.org/fimg/mp3-gapless.png
What you see is that the first frame's delay is constant and depends on the encoder delay (which is fixed for each layer and a consequence of the analysis filter design) and the decoder delay (which is fixed as well due to synthesis filter design). The only variable delay is the padding in the last frame.
In mpa.c the decoder latency is 529 for layer3, the encoder latency should be 576 for a layer3 encoder. For layer2 I need to check, but I remember the overall latency (enc + dec) was 481 -- but this needs verification.
The padding in the last frame cannot be derived from the filter designs of a MPEG layer as the padding depends on the amount of samples of the original source file.
Removing samples via a fixed threshold decision is hard to configure. Unwanted signal parts can be of high amplitude when the source signal has high amplitude. On the other hand you might remove parts of the source signal when it is of low energy. E.g. using your number from the patch you will cut samples if < -35 dBFS -- which is still a significant level.
I think of my approach of removing the samples via a high fixed threshhold as a compromise to try to deal with the case where we can't really know for sure what is really correct and just have to do what will not sound bad. The ideal condition is when we know the encoder delay and use it, which was already being done. In cases where it is not known, not accounting for the encoder delay at all leads to an audible gap. The other extreme would be to skip the entire first frame when the encoder delay is not known, and in reality at most this would lose only a few samples in most cases since it seems very few encoders allow for small latency -- and this would be completely unnoticeable. With some encoders it is even possible to have more than a full frame of total latency that the beginning, which still may result in an audible gap with this patch.
See the logs around this time:
http://www.rockbox.org/irc/log-20110211#21:22:57
Your patch and approach is not removing inaudible information, but tries to make the transition between two songs continuous to avoid audible glitches. It does this based on very raw assumptions (= drop signals until a minimum amplutide has been reached). I understand that this approach seems to work for quite some signals, but I admit that I am not a fan of such approach. I will try to explain why:
Technically the used FIR filters will lead to a transient oscillation that adds waveform before the desired signal itself occurs. This oscillation has a growing character and its amplitude depends on the desired signal and its amplitude. As you do not know what the signal and its amplitude is your approach may drop too many samples (e.g. when the desired signal is of low amplitude). This might also results in audible clicks as the signal becomes discontinous.
Furthermore your approach does not fade-in the signal after you estimated where the desired signal begins. This way you might just start with a positive non-zero sample that might very likely result in a discontinious waveform (=clicking) as well.
Even if the exact samples at which the actual audio starts and ends are known (as is the case when the LAME tags are present), a lossy encoder such as MP3 has only limited control over the actual sample value produced after decoding. The encoder typically does not ensure that there is no discontinuity (click) between one track's last sample and the next track's first sample, right?
As far as I know, Rockbox does not attempt to smooth out these unavoidable discontinuities for any lossy format even when the track boundaries are perfectly known, so asking the MP3 decoder to do it when these boundaries are not known is a bit much. Even if Rockbox were to do this kind of smoothing, it arguably should be done in a codec-independent fashion (in the PCM driver perhaps?).
I think that detecting silence based on an absolute sample value is fine for a first version of this feature. If needed, it can be refined later on.
Another point of contention has been whether this should be configurable or always on. Like dryrock, I respect people who never want to lose a single sample of audio, which is why I support this being configurable.
Recall that this patch applies only to MP3 tracks that weren't well encoded, and improves playback of these tracks for many users. Nothing is lost by adding the feature as is.
One additional idea I remember from the IRC discussion is that the gap killer should never remove an entire audio frame. If the last sample of the first frame (or the first sample of the last frame) is classified as “silince”, it could be argued that the silence is indeed intentional and shouldn't be shortened.
Assuming 16 bit source audio, and that our threshold is the 17th bit, can this still happen? I guess it can if the FIR filter attenuates the first non-zero sample, but how many samples are likely to be lost?
Again: Gapless information allows to cut the exact amount of oscillating samples before (and after) the signal. Therefor this also avoids discontinuous waveforms. This patch does not do this. This patch blindly removes waveform parts below a certain amplitde/energy threshold. In effect the resulting waveform when playing two consecutive files may be not be continuous which leads to clicking.
This patch does not make gapless playback happen. It cuts sample parts of a fix defined low amplitude in the hope to shorten the gap and make it _less audible_. All of this at the cost of possible false positive detection (removing too much of the signal) and discontinuity (could be fixed). It is not exact, it cannot be.
There is opposition (some significant, some lukewarm) against including this feature. Here are the reasons given, with my responses to them; please chime in if you have further concerns or responses:
* It does not solve the problem precisely, but rather exchanges one kind of decoding artifact (audible gap/click) for another (possible click, possibly too much audio cut out). — My view is that this is a tradeoff many users are willing to make as the introduced artifact is much less severe than the artifact that this patch removes.
* The possible introduced discontinuity (click) could be removed by precisely cross-fading the tracks into each other at the computed boundary. — Discontinuities could be intended, as gapless playback does not imply that the waveform is continuous. Also, discontinuities are unavoidable with any lossy codec even if the track boundaries are known, and removing them would be a separate feature from gap elision.
* The implementation is codec-specific despite other codecs having the same problem. — Correct. There could be a generic function to remove samples based on temporal limits and an amplitude threshold, which is called by the codec-specific part. (Implementing this feature outside the codec (in the pcm driver) would be quite some headache because the track and frame boundaries currently are known only inside the codec; refactoring this probably is not worth the hassle.)
* Users can add the LAME tags themselves. — Asking users to add precise markers to an audio file for which there is no way to decide where these markers belong seems to be the wrong approach, and is also impractical.
* This is a kludge. Users should reencode the tracks properly, or ask their source to reencode them, or ask for a refund. — I find these suggestions impractical. Yes, the feature is for pure user convenience and satisfaction without asking anyone to jump through any additional hoops.
* This is irrelevant because misencoded MP3 tracks are rare. — My own private statistics contradict this statement. I've managed to purchase misencoded MP3 tracks from Amazon.
I was able to establish consensus (I think) that this feature must be configurable if it ever goes in. The name of the configuration option should sound generic to allow it to enable the same feature for other codecs later on. Rather than being a simple toggle, this could allow more fine-grain selection of the cut-off threshold.
* Users can add the LAME tags themselves. — Asking users to add precise markers to an audio file for which there is no way to decide where these markers belong seems to be the wrong approach, and is also impractical.
If imprecise automated gap removal is "good enough", why isn't imprecise LAME tags? Is there some reason LAME tags cannot be at least as good as this patch?
-> The problem is that files with such imprecise LAME tags would only *pretend* to have the exact gapless information. From now on nobody can decide whether the gapless information is correct or not... For the listening experience the result is the same, but it devalues the LAME tag usage for a collection of files in general.
I don't get why it's fine for the player to silently add this data during playback, but it's *not* fine for the user to manually and intentionally add this data. Is the idea that Joe User would add this tag while somehow not realizing he's doing it?
But the present feature is for users who prefer to not touch up their files with potentially incorrect tags. I agree with Buschel that adding the tags gives a false impression of perfect gap information. It basically casts a user preference into the files, so that sharing these files with someone like gevaerts who'd rather not have these gaps removed becomes impossible.
Also, I'd like to avoid to have to go through an extra manual step for something that Rockbox can do automatically at runtime.
With the tags, if you find the gap skipping algorithm gets it wrong, you can remove or adjust the tag. If the gap skipping is wrong in Rockbox, you can't adjust it on a per-file basis.
If this algorithm is good enough why is it *not* good enough for a per-file solution, but at the same time, good enough for an "all files" solution?
In my opinion, Rockbox should strive to enhance the user experience. This does just that. It's not a perfect solution, but those users who want the perfect solution can just as well edit their tracks or reencode them. This feature is for the other 99% of the user base.
My use case - I occaionally find a badly-encoded album but inevitably I'm miles from a PC and have to make a note or try and remember; a plugin that could re-parse a given directory/playlist/file would be most useful. (In this case I'm more specifically talking about a gapless album, if that helps.)