• Status Unconfirmed
  • Percent Complete
  • Task Type Patches
  • Category Codecs
  • Assigned To No-one
  • Operating System All players
  • Severity Low
  • Priority Very Low
  • Reported Version Release 3.7.1
  • Due in Version Undecided
  • Due Date Undecided
  • Votes
  • Private
Attached to Project: Rockbox
Opened by dryrock - 2011-01-18

FS#11891 - Add mp3 gap skipping support to improve gapless playback

As we know, rockbox will use the information provided by "good" mp3 encoders (e.g. LAME) to ignore the extra padding that occurs at the beginning and end of each mp3 file, so that true gapless playback can occur. Unfortunately we do not always have the luxury of encoding our mp3 files in this way but we might still want true gapless playback. This patch compensates for the use of "dumb" mp3 encoders that do not record the necessary information for this feature.

In cases where it does not detect the proper header information to trim off the extra padding, it will optionally skip up to a full frame of silence (or near silence) at the beginning and end of each track (if the trimming is already specified, this patch does not do any additional trimming).

This can potentially eliminate up to a frame of "real" silence at the beginning or end of a track, but in terms of audibility, that would be completely unnoticeable (and it would only happen on these badly-encoded tracks), while the extra silence is an audible annoyance.

I have added an option to the Playback Settings menu so that the user can disable it if desired.

Shouldn't this be done in such a way as to affect all tracks rather than being MP3-exclusive if it's to be considered for commit?

Well, my thinking on that is that some formats (ogg, wav, flac) already work properly and there is not a general difficulty with gapless playback with them. Other formats (wma?) make no promise of gapless playback at all, do they? Fixing this for mp3 makes sense because of the inherent issues in encoding, but I'm not sure about the other formats.

You're skipping silence *and* near silence. You're skipping "gaps" even in non-gapless tracks, which is something that could work (and be used) for any of those formats.

Project Manager

Sure, it might be better to do this on the generic PCM level, but then we probably lose the context. This is something we only want to do in the first frame (or at least in the first X milliseconds of a track. Does the PCM driver have the track boundary information?

A more generic solution to set GAP_SKIP_THRESH would be using the same value as used for setting DSP_SET_SAMPLE_DEPTH – see ci→configure(DSP_SET_SAMPLE_DEPTH, x). The sample depth is different – at least for different codecs. The samples are shifted by "»(x-16+1)" before sending them to the DAC.

For mp3 this should result in something like

#define GAP_SKIP_THRESH_NATIVE 60 /* samples smaller +/-GAP_SKIP_THRESH_NATIVE in native 16 bit format are skipped */
#define GAP_SKIP_THRESH (GAP_SKIP_THRESH_NATIVE*(1«(MAD_F_FRACBITS-NATIVE_DEPTH+1))) /* threshold when in fixed point fract format */

As this patch addresses a specific deficiency of the MP3 format (track length must be an integral multiple of the frame length) in an MP3-specific way (drop silence only in frames to which the deficiency applies), I'd welcome an MP3-specific solution. This patch does not seem to aim to be a general gap killer. If that's what other people want, they should provide it as a separate, codec-independent feature.

However, if my understanding (this being codec-specific) is correct, I don't understand why it's necessary to skip silence at the beginning of a track. If two tracks run together, surely the second one will not start with unintentional silence?

Oops, managed to repeat my previous comment.

In response to sideral - you are exactly correct and I am in agreement that it was in no way meant to be a general-purpose gap killer.

I was surprised when I started this patch to find that in addition to the last-frame padding, mp3 encoders always have to have some empty space at the beginning of the track too which is an artifact of the encoding process ( The LAME encoder of course puts the size of this padding in its header and rockbox already skips it.

Andree, thanks for the comments on GAP_SKIP_THRESH - I arbitrarily chose a number that worked well in my test tracks. It is undoubtedly higher than it needs to be but it gave me usable results. How does your calculation relate to what 'junk' may be in the resulting pcm data after the mp3 frame is decoded?

Andree's calculation gives me a result of 491520, which appears to skip few too-few samples in the first frame. In my test file it skips about half the frame, where it needs to skip approximately the full frame to avoid an audible gap.

I just thought through this approach a bit. Take a look at the following graph that was created for a discussion of lame's mp3 gapless function.

What you see is that the first frame's delay is constant and depends on the encoder delay (which is fixed for each layer and a consequence of the analysis filter design) and the decoder delay (which is fixed as well due to synthesis filter design). The only variable delay is the padding in the last frame.

In mpa.c the decoder latency is 529 for layer3, the encoder latency should be 576 for a layer3 encoder. For layer2 I need to check, but I remember the overall latency (enc + dec) was 481 – but this needs verification.

The padding in the last frame cannot be derived from the filter designs of a MPEG layer as the padding depends on the amount of samples of the original source file.

Removing samples via a fixed threshold decision is hard to configure. Unwanted signal parts can be of high amplitude when the source signal has high amplitude. On the other hand you might remove parts of the source signal when it is of low energy. E.g. using your number from the patch you will cut samples if < -35 dBFS – which is still a significant level.

Yes, the first-frame decoder latency of 529 is already taken into account and skipped. The problem with the encoder latency appears to be that it can be different between encoders. For LAME the default is 576, but the LAME documentation lists some other examples that are different.

I think of my approach of removing the samples via a high fixed threshhold as a compromise to try to deal with the case where we can't really know for sure what is really correct and just have to do what will not sound bad. The ideal condition is when we know the encoder delay and use it, which was already being done. In cases where it is not known, not accounting for the encoder delay at all leads to an audible gap. The other extreme would be to skip the entire first frame when the encoder delay is not known, and in reality at most this would lose only a few samples in most cases since it seems very few encoders allow for small latency – and this would be completely unnoticeable. With some encoders it is even possible to have more than a full frame of total latency that the beginning, which still may result in an audible gap with this patch.

Just a note to say this patch was discussed briefly on IRC, and one conclusion was that there is no need for a setting:

See the logs around this time:

fg commented on 2011-02-11 20:51

I'm very strongly opposed to any feature that can unconditionally drop actual audio.

I think I would like this. Some of my supposedly gapless albums have such an annoying micro gap.

To take a clear position: I also do not like to possibly drop (parts of) audio signals based on amplitude limits.

Frank & Andree - I understand purism, which is why this is an option, and why it does not in any way affect well-encoded tracks. But the mere act of encoding to mp3 discards actual audio. The idea is to discard inaudible stuff, which I still argue is what is happening here. As it is now, rockbox in some situations actually ADDS audible audio, and I would argue that is the bigger problem.

I totally agree that encoding lossless is destryoing information :)
Your patch and approach is not removing inaudible information, but tries to make the transition between two songs continuous to avoid audible glitches. It does this based on very raw assumptions (= drop signals until a minimum amplutide has been reached). I understand that this approach seems to work for quite some signals, but I admit that I am not a fan of such approach. I will try to explain why:
Technically the used FIR filters will lead to a transient oscillation that adds waveform before the desired signal itself occurs. This oscillation has a growing character and its amplitude depends on the desired signal and its amplitude. As you do not know what the signal and its amplitude is your approach may drop too many samples (e.g. when the desired signal is of low amplitude). This might also results in audible clicks as the signal becomes discontinous.
Furthermore your approach does not fade-in the signal after you estimated where the desired signal begins. This way you might just start with a positive non-zero sample that might very likely result in a discontinious waveform (=clicking) as well.


Even if the exact samples at which the actual audio starts and ends are known (as is the case when the LAME tags are present), a lossy encoder such as MP3 has only limited control over the actual sample value produced after decoding. The encoder typically does not ensure that there is no discontinuity (click) between one track's last sample and the next track's first sample, right?

As far as I know, Rockbox does not attempt to smooth out these unavoidable discontinuities for any lossy format even when the track boundaries are perfectly known, so asking the MP3 decoder to do it when these boundaries are not known is a bit much. Even if Rockbox were to do this kind of smoothing, it arguably should be done in a codec-independent fashion (in the PCM driver perhaps?).

I think that detecting silence based on an absolute sample value is fine for a first version of this feature. If needed, it can be refined later on.

Another point of contention has been whether this should be configurable or always on. Like dryrock, I respect people who never want to lose a single sample of audio, which is why I support this being configurable.

Recall that this patch applies only to MP3 tracks that weren't well encoded, and improves playback of these tracks for many users. Nothing is lost by adding the feature as is.

One additional idea I remember from the IRC discussion is that the gap killer should never remove an entire audio frame. If the last sample of the first frame (or the first sample of the last frame) is classified as “silince”, it could be argued that the silence is indeed intentional and shouldn't be shortened.

Technically the used FIR filters will lead to a transient oscillation that adds waveform before the desired signal itself occurs. This oscillation has a growing character and its amplitude depends on the desired signal and its amplitude. As you do not know what the signal and its amplitude is your approach may drop too many samples (e.g. when the desired signal is of low amplitude).

Assuming 16 bit source audio, and that our threshold is the 17th bit, can this still happen? I guess it can if the FIR filter attenuates the first non-zero sample, but how many samples are likely to be lost?

Let's make an easy assumption and only think of the filterbank effects, and not taking into account the mdct synthesis. In this case you might just check the D[]-table from the synthesis filter and look at it as the symmetric (it is FIR) impulse response – if having the correct order of the coefficients. The maximum in the middle of the impulse response will equal the signal pulse, all the stuff on the left side (=255 samples) is the transient oscillation before the signal. If the signal itself has an amplitude of 2^15 _all_ oscillations are >17th bit. In reality the signal is not a pulse and not full amplitude, in effect the oscillaton will have lower amplitude. It heavily depends on the signal.

Again: Gapless information allows to cut the exact amount of oscillating samples before (and after) the signal. Therefor this also avoids discontinuous waveforms. This patch does not do this. This patch blindly removes waveform parts below a certain amplitde/energy threshold. In effect the resulting waveform when playing two consecutive files may be not be continuous which leads to clicking.

This patch does not make gapless playback happen. It cuts sample parts of a fix defined low amplitude in the hope to shorten the gap and make it _less audible_. All of this at the cost of possible false positive detection (removing too much of the signal) and discontinuity (could be fixed). It is not exact, it cannot be.

I'd love to see this feature committed, so I brought it up on the IRC channel to see what kind of objections it would face. Here's my summary of the discussion status a few IRC conversations later:

There is opposition (some significant, some lukewarm) against including this feature. Here are the reasons given, with my responses to them; please chime in if you have further concerns or responses:

* It does not solve the problem precisely, but rather exchanges one kind of decoding artifact (audible gap/click) for another (possible click, possibly too much audio cut out). — My view is that this is a tradeoff many users are willing to make as the introduced artifact is much less severe than the artifact that this patch removes.

* The possible introduced discontinuity (click) could be removed by precisely cross-fading the tracks into each other at the computed boundary. — Discontinuities could be intended, as gapless playback does not imply that the waveform is continuous. Also, discontinuities are unavoidable with any lossy codec even if the track boundaries are known, and removing them would be a separate feature from gap elision.

* The implementation is codec-specific despite other codecs having the same problem. — Correct. There could be a generic function to remove samples based on temporal limits and an amplitude threshold, which is called by the codec-specific part. (Implementing this feature outside the codec (in the pcm driver) would be quite some headache because the track and frame boundaries currently are known only inside the codec; refactoring this probably is not worth the hassle.)

* Users can add the LAME tags themselves. — Asking users to add precise markers to an audio file for which there is no way to decide where these markers belong seems to be the wrong approach, and is also impractical.

* This is a kludge. Users should reencode the tracks properly, or ask their source to reencode them, or ask for a refund. — I find these suggestions impractical. Yes, the feature is for pure user convenience and satisfaction without asking anyone to jump through any additional hoops.

* This is irrelevant because misencoded MP3 tracks are rare. — My own private statistics contradict this statement. I've managed to purchase misencoded MP3 tracks from Amazon.

I was able to establish consensus (I think) that this feature must be configurable if it ever goes in. The name of the configuration option should sound generic to allow it to enable the same feature for other codecs later on. Rather than being a simple toggle, this could allow more fine-grain selection of the cut-off threshold.

I don't understand this one:

* Users can add the LAME tags themselves. — Asking users to add precise markers to an audio file for which there is no way to decide where these markers belong seems to be the wrong approach, and is also impractical.

If imprecise automated gap removal is "good enough", why isn't imprecise LAME tags? Is there some reason LAME tags cannot be at least as good as this patch?

If imprecise automated gap removal is "good enough", why isn't imprecise LAME tags? Is there some reason LAME tags cannot be at least as good as this patch?

→ The problem is that files with such imprecise LAME tags would only *pretend* to have the exact gapless information. From now on nobody can decide whether the gapless information is correct or not… For the listening experience the result is the same, but it devalues the LAME tag usage for a collection of files in general.

Since people are manually adding this tag to their file, isn't it something they *want*? If they would use our feature, wouldn't it be better for them to use the tag so that their files are 'fixed' for more than just a single player?

I don't get why it's fine for the player to silently add this data during playback, but it's *not* fine for the user to manually and intentionally add this data. Is the idea that Joe User would add this tag while somehow not realizing he's doing it?

This has probably already been suggested in IRC or on the NG, but might a better solution not be a viewer plugin to add (or optionally adjust) the LAME tags? i.e. Open a directory/playlist with a plugin and it'll reparse the track transitions based on configured thresholds.

I can understand the sentiment to fix the problematic files once and for all, for all players, and agree that some users may prefer to do this despite ending up with imprecise LAME tags. I'm not stopping them: They are free to do this.

But the present feature is for users who prefer to not touch up their files with potentially incorrect tags. I agree with Buschel that adding the tags gives a false impression of perfect gap information. It basically casts a user preference into the files, so that sharing these files with someone like gevaerts who'd rather not have these gaps removed becomes impossible.

Also, I'd like to avoid to have to go through an extra manual step for something that Rockbox can do automatically at runtime.

See, I still don't understand. "potentially incorrect tags" is bad, but "potentially incorrect always" is good?

With the tags, if you find the gap skipping algorithm gets it wrong, you can remove or adjust the tag. If the gap skipping is wrong in Rockbox, you can't adjust it on a per-file basis.

If this algorithm is good enough why is it *not* good enough for a per-file solution, but at the same time, good enough for an "all files" solution?

Project Manager

This is a simple solution for users that aren't able or willing to add fake LAME tags to their files. It is a convenience feature.

In my opinion, Rockbox should strive to enhance the user experience. This does just that. It's not a perfect solution, but those users who want the perfect solution can just as well edit their tracks or reencode them. This feature is for the other 99% of the user base.

I'm not saying don't offer it as an automatic (albeit disabled by default) solution, more to implement it in a way so it could be driven by a plugin that can update the tags.

My use case - I occaionally find a badly-encoded album but inevitably I'm miles from a PC and have to make a note or try and remember; a plugin that could re-parse a given directory/playlist/file would be most useful. (In this case I'm more specifically talking about a gapless album, if that helps.)

This patch is broken against the current svn.


Available keyboard shortcuts


Task Details

Task Editing