Rockbox

  • Status Assigned   Reopened
  • Percent Complete
    0%
  • Task Type Bugs
  • Category Database
  • Assigned To
    sideral
  • Operating System All players
  • Severity Low
  • Priority Very Low
  • Reported Version Release 3.8.1
  • Due in Version Undecided
  • Due Date Undecided
  • Votes
  • Private
Attached to Project: Rockbox
Opened by tenfoot - 2011-04-19
Last edited by roolku - 2011-05-12

FS#12076 - Autoresume feature can start from wrong offset due to resurrection of old runtime stats

I've found a problem with the new autoresume feature in 3.8. I use it when listening to podcasts and sometimes find that it resumes from near the end of a new episode - e.g. on a 30 minute podcast it will resume from 28 or 29 minutes - usually about the place where a previous epidsode finished and I skipped past the end credits.

What I think is happening is that there is some code in the database that tries to resurrect the runtime statistics for files that have been deleted and then restored - as the last offset is in this data, it is reusing the last offset from an old episode.

I've put some more logf()s in tagcache.c in build_numeric_indices(). I can see that if the filenames don't match (line 2306) it then tries to match any 2 tags out of the 3: artist, album or title. For the podcasts, the artist and album are always the same (i.e. the producer and which one of their podcasts it is), but the title is different (i.e. "Episode 100", "Episode 101", etc). So if I've listened to Episode 100 and then deleted the file, and then add Episode 102 to the player, it matches the artist and album and so resurrects the wrong data from Episode 100.

I've disabled the resurrection based on tags (just commented it out) and the problem seems to have gone away. A better fix would be to ensure that all 3 tags or matched or to ensure that title is one of the matched tags

Thanks for your bug report and your thorough analysis!

(Looks like I've never noticed that DB resurrection was broken this way because I always “complete” my podcasts, either by listening to its end or by calling a custom context-menu function I've written, to make them appear in a “completed podcasts” list I've added to my tagnavi menus.)

I wonder what the use cases for resurrection might be and what motivated the existing behavior. So far I've encountered the following cases:

* User renames a file
* User removes and later reinserts SD card, files reappear under the same name
* User changes tags without renaming the files

The existing rules also tolerate files that reappear under a different name, with one (but not more) of their author/title/album tags changed. Why? One possible explanation would be tagging tools such as Picard that also move files into a specific directory hierarchy; but assuming that at most one tag is changed seems somewhat of a stretch, and wouldn't users typically run these tools before syncing the files to their DAP?

I've asked Slasheri, the original author of the database, to comment on this issue. He responded that resurrecting renamed and retagged files was meant to carefully carry the statistics along in this case, but, given this is causing trouble, that it would be safe to limit resurrection to either renamed or retagged files.

(He also said that the filename's CRC used for filename comparisons has a nonzero checksum-collision probability, and that at least some other tags would have to match as well to safely identify a nonrenamed, but retagged file. AFAICT, the database currently uses the track length for that, which always must match for resurrection to take place. Looks like all your podcasts have exactly the same length?)

I have attached an experimental patch that requires that all three tags (artist, album, and title) match if the filename has changed. Tenfoot, could you please test whether it works for you? Thanks!

This patch works for me. Yes, the affected podcasts are always the same length (30 minute radio programmes from the BBC).

I am not happy with this 'fix' as it removes existing functionality and I rely on the "fuzzy" matching when correcting meta data. Please consider a different solution to the problem.

roolku, 'we' are a bit 'late' to the 'party' aren't we? ;-)

The fuzzy matching is still present; I just removed an invalid heuristic. You can still correct the metadata and keep all your runtime stats as long as you do not rename the file *and* change the metadata at the same time.

Sideral wrote:

roolku, 'we' are a bit 'late' to the 'party' aren't we? ;-)

I don't understand what you are trying to say here? It is not a party it is a serious step backwards that was only committed yesterday?

Maybe you don't remember, but at the time of implementation of the feature this was an important issue (the previous database implementation used a CRC of the audio data as a unique identifier for tracks, so one could change meta data and name without problem, and the fuzzy matching was the way the new tagcache implemented this feature).

as long as you do not rename the file *and* change the metadata at the same time."

And that is exactly the problem I have, as typically the meta data is also present in the file name (e.g. track title). For non-ascii text one often finds that characters are substituted by ascii "equivalents" (e.g. oe for ö) and fixing these now is a real hassle.

For the original problem wouldn't a better solution be to discard resume information for "resurrected" files?

roolku writes:
> Sideral wrote:
» [roolku:]

I am not happy with this 'fix' […]
»

roolku, 'we' are a bit 'late' to the 'party' aren't we? ;-)


I don't understand what you are trying to say here? It is not a
> party it is a serious step backwards that was only committed
> yesterday?

Relax, it's no big deal: I was just trying to make fun of you chiming in only after the issue has been closed and then ridiculing the way I've addressed it with irony marks ('fix'). I apologize if you didn't like my punning attempt. I do take your concern seriously.

Maybe you don't remember, but at the time of implementation of the
> feature this was an important issue

That's why I was asking for use cases. I did my due diligence and even interviewed Slasheri about it.

as long as you do not rename the file *and* change the metadata at
» the same time.
>
> And that is exactly the problem I have, as typically the meta data
> is also present in the file name (e.g. track title). […]

I see. I take it it's indeed common to change the file name and tags at the same time. But is it common enough to change the filename and only one of the {album,artist,title} tags (and not the two others) to warrant its own prone-to-break heuristic? I have my doubts.

For the original problem wouldn't a better solution be to discard
> resume information for "resurrected" files?

Unfortunately that won't work. The resume info is just as important as the other runtime information, and I wouldn't want to lose it in the resurrection case, for example, just because I eject and later reinsert an SD card. Also, it's simply wrong to resurrect anything (runtime stats or resume info) in the case described in the original bug report.

Clearly we need to get better at uniquely identifying resurrected tracks.

the previous database implementation used a CRC of the audio data
> as a unique identifier for tracks, so one could change meta data and
> name without problem, and the fuzzy matching was the way the new
> tagcache implemented this feature.

Why was the audio-data matching removed? Was it perhaps a wee bit too expensive and maybe had robustness issues of its own (just guessing)? If there was really no reason, perhaps that behavior could be, er, resurrected.

Barring that, we could use more tags or file properties to make fuzzy matching more accurate, in addition to length, filename, title, album, and artist. Ideally, these tags and properties are sufficiently unique per file and stay the same across retagging and renaming.
Candidates:

* File size without headers as returned by metadata handlers — Is it unique enough for exact-length radio programs?
* Track number — Doesn't qualify if removed or first added
* Bit rate — Doesn't qualify if removed or first added. Also, probably not unique enough.

Perhaps requiring the file size to match for resurrection to take place, in addition to the track length, would be sufficient?

Relax, it's no big deal: I was just trying to make fun of you chiming in only after the issue has been closed and then ridiculing the way I've addressed it with irony marks ('fix'). I apologize if you didn't like my punning attempt. I do take your concern seriously.

No worries. :)

And that is exactly the problem I have, as typically the meta data
» is also present in the file name (e.g. track title). […]
>I see. I take it it's indeed common to change the file name and tags at the same time. But is it common enough to change the filename and only one of the {album,artist,title} tags (and not the two others) to warrant its own prone-to-break heuristic? I have my doubts.

It is not ideal and I have been caught out myself when I forgot the limitation and changed two meta data items at a time. But in the vast majority of my cases it has worked out fine (it's typically the track title that needs changing; band and album are usually correct.).

For the original problem wouldn't a better solution be to discard
» resume information for "resurrected" files?
>Unfortunately that won't work. The resume info is just as important as the other runtime information

Yes, I suppose it is. :(

Clearly we need to get better at uniquely identifying resurrected tracks.

That would be the best solution.

Why was the audio-data matching removed?

It wasn't removed as such. tagcache was a completely new/different implementation that never had this feature.

Was it perhaps a wee bit too expensive and maybe had robustness issues of its own (just guessing)?

I am fairly sure it was performance - tagcache works on target, whereas the old DB was created offline. If I remember correctly - there was discussion at the time to only use an initial block of the audio data, but I think it was felt that there could be hash collisions for silent beginnings.

Barring that, we could use more tags or file properties to make fuzzy matching more accurate, in addition to length, filename, title, album, and artist. Ideally, these tags and properties are sufficiently unique per file and stay the same across retagging and renaming.

I would have thought that length would be a good enough contribution to 'uniqueness' and I was very surprised to learn that this is not the case.

Candidates:
>
>* File size without headers as returned by metadata handlers — Is it unique enough for exact-length radio programs?

Not for constant bitrate files as it would be proportional to length.

* Track number — Doesn't qualify if removed or first added
>* Bit rate — Doesn't qualify if removed or first added. Also, probably not unique enough.

Agreed. Not unique enough - radio shows from the same source are likely encoded the same.

Perhaps requiring the file size to match for resurrection to take place, in addition to the track length, would be sufficient?

I fear not - see above.


BTW, it occurred to me that your fix might not even work if the audio files are untagged (or tagged with the same information, e.g. just the name of the show, not episode).

Any more thoughts on this? In my opinion it should be reverted, as it doesn't completely solve the initial problem and removes existing functionality.

roolku writes:
> [sideral:]

* File size without headers as returned by metadata handlers — Is it
» unique enough for exact-length radio programs?
>
> Not for constant bitrate files as it would be proportional to length.
>
» Perhaps requiring the file size to match for resurrection to take
» place, in addition to the track length, would be sufficient?
>
> I fear not - see above.

I think this is worth a try. I have looked at various fixed-length radio shows encoded to a constant rate that I have access to, and the file length for all of these shows is different. It looks like CBR encoding typically does *not* yield down-to-the-byte exact stream sizes.

I'll come up with a diagnostic patch to log the metadata-provided file size sans headers. Tenfoot, would you be willing to run a test on your radio shows (or to make some of the tracks available [privately if necessary]) to determine whether this helps telling the tracks apart?

BTW, it occurred to me that your fix might not even work if the audio
> files are untagged (or tagged with the same information, e.g. just the
> name of the show, not episode).

Yes, even with my fix there are still cases where a track can be mistaken for another deleted one. Nonetheless, the fix addresses a real issue, whereas asking the database to discern tracks that are tagged entirely identical, or even not at all, seems rather contrived to me, especially in the case of tag-based resurrection (where the unique file name is not available any longer). In other words, this is simply a problem that's not addressed by my fix; it was there to begin with.

For the untagged case, this would be easy to fix: don't count "<Untagged>" == "<Untagged>" as a match. The question is whether anyone cares: It's not like the runtime DB information is particularly useful for untagged tracks.

The same-tag case is fuzzy by design :-) and it could be fixed only by removing fuzzy matching altogether IMHO.

Any more thoughts on this? In my opinion it should be reverted, as it
> doesn't completely solve the initial problem and removes existing
> functionality.

I disagree with your assessment that this doesn't completely solve the initial problem. It actually does. But the present fix does not address two other, more hypothetical cases that you've brought up.

I fully agree in principle that limiting existing functionality is undesirable. But if a feature reduces the robustness of another feature, it has to be questioned. That said, I'd like to help restoring the more loose fuzzy matching that we had previously if it can be made more robust.

Let's not rush this unnecessarily, please. The next release is still about a month out.

I did some investigation of what it would take to use the file size sans metadata as a mandatory match for resurrection.

I took a look at how various metadata parsers implement id3→filesize, which is defined as “file size without headers in bytes.” It looks like only the MP3 parser makes at least a half-hearted attempt to remove the variable-sized metadata header from this size, making this a metadata-independent value for tracks tagged only with ID3v2; however, adding or removing a trailing ID3v1 currently breaks the scheme. Worse, it looks like all other codecs simply use the file size for this value, so any metadata change resulting in a file-size change would be going to break resurrection. Fixing this shouldn't be too difficult, though.

Also, the file size would have to be saved in the on-disk database. That would mean another DB column has to added to the master index. (The extra space could be offset in various ways. The most obvious ones would be to fold CD number and track number into a single 32-bit integer, or similarly compressing the year and the rating into a single column.)

To verify whether the approach of using the file size works at all for the original problem, given this situation, doing a diagnostic patch for MP3 files tagged with only an ID3v2 tag would be simple to do. Tenfoot, how are your podcast files encoded and tagged?

I've looked at id3→filesize for some of my same-length radio recordings encoded in CBR MP3 format, to verify whether this size would be sufficiently unique to help discern similar, but nonidentical tracks. Recall that for MP3, id3→filesize already has the length of the header stripped off. I made sure my MP3 files had no trailing ID3v1 tag so as to not distort the results.

As I suspected, the file size can differ slightly despite CBR encoding. Unfortunately, there were duplicate file sizes nonetheless: In my sample of 13 same-length tracks, there were three clusters containing 3, 3, and 2 tracks, respectively, with the same id3→filesize.

In summary, it looks like roolku was right: The file size is not a sufficient fingerprint to tell similar tracks apart.

Another approach I'd like to check is to compute a CRC of the first 512 non-header bytes of the files. I'm somewhat optimistic that these bytes can help fingerprinting the file even if they contain only silence, because silence tends to be nonstatic/nonzero for most digital audio processing.

In the meantime, I've verified with tenfoot that his podcast tracks are in MP3 format as well. Tenfoot agreed to make some of his tracks available for testing.

I'm not quite sure how to handle this issue for the next release, which I expect to happen before it is resolved. I'm inclined to leave fuzzy matching for renamed tracks disabled, given that we don't have a robust way of telling similar tracks apart, yet. But I agree that it may be viewed as a regression. We should probably bring this up on the developers mailing list unless there's a quick and unexpected fix soon.

I now have two tracks from tenfoot available for testing – Thanks tenfoot!

As suspected, the id3→filesize method fails to tell them apart. On to the CRC ID method…

Thank you for not giving up with this.

I've experimented with IDing tracks by computing a CRC of the encoded track data after the metadata header, as indicated by id3→first_frame_offset (set by the metadata parser). The results were… interesting. :-)

Depending on how much I seek behind id3→first_frame_offset, I still get some dupes (i.e., CRC failing to tell tracks apart). When not seeking at all, many podcasts yield dupes, presumably because they start with an audio intro that is similarly encoded. Seeking 10 sectors after the header, I still get some duplicates; and even at an offset of 20 sectors, there are still some left. In most of these tracks, there's some strange padding filled with 0x55 bytes, sometimes interspersed with a "LAME3.98." strings and other junk; not sure what to make of this. In others, the data looks like encoded audio, but is actually identical; it's plausible in this case because these are a few episodes of a podcast with a very long intro.

In all of these cases, seeking even further behind the header (i tried 1 MB) eliminates all duplicates. But I find long seeks a bit worrisome because it will likely make database creation noticeably slower on hard-disk-based DAPs. I haven't tested that yet, though.

There are also other issues with non-MP3 tracks:

* The M4A parser doesn't report id3→first_frame_offset yet; when using a short seek, the CRC yields duplicates of data that looks like a header or padding. This likely won't be an issue once id3→first_frame_offset is set correctly (which is needed for this approach anyway).

* Chip music tracks often aren't even 10 sectors long, let alone 1 MB (tested with an Atari SAP collection); but I'd guess that doesn't really matter for the use case we're discussing (telling apart tracks with very similar metadata).

In summary, I think the CRC approach can be made to work, but if the seek times turn out to be a problem, we should perhaps make fuzzy resurrection depend on a configuration setting (which it arguably should have been all along.)

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing