Rockbox

  • Status Closed
  • Percent Complete
    100%
  • Task Type Bugs
  • Category Music playback
  • Assigned To No-one
  • Operating System All players
  • Severity High
  • Priority Low
  • Reported Version Release 3.7.1
  • Due in Version Undecided
  • Due Date Undecided
  • Votes
  • Private
Attached to Project: Rockbox
Opened by Buschel - 2011-01-06
Last edited by Buschel - 2011-04-06

FS#11863 - Playback / buffer freezes

Extracted from  FS#11830 . Since the issues described below also occur when  FS#11830  is not applied this bug report is generated.

On iPod nano 1G and iPod Video’s the following were issues observed:
1) Device does not fully start (no menu visible, background image loaded)
2) Playback stopps or does not start (WPS playtime is 0:00)
3) “data abort” on white screen

Except (1) and (3) the iPod’s UI responds. In situation (2) loading mpegplayer or shut down via longpress Play/Pause is not possible. In general plugins (tested with test_fps and test_mem) can be loaded and performed normally.

On iPod Video this happened with builds which based on r28915, r28940 and r28960. So, before the ATA but after some kernel/thread/priority changes.

I could provoke this failure twice when generating the DB during playback. But these failure also happened randomly during standard playback (mp3 and mpc in my case).

As long as we do not know what is happening I will assume this is not specific for PP502x processors.

Closed by  Buschel
2011-04-06 19:59
Reason for closing:  Fixed
Additional comments about closing:   Warning: Undefined array key "typography" in /home/rockbox/flyspray/plugins/dokuwiki/inc/parserutils.php on line 371 Warning: Undefined array key "camelcase" in /home/rockbox/flyspray/plugins/dokuwiki/inc/parserutils.php on line 407

All relevant issues have been fixed (also see 'related tasks') in the past weeks.

I have started tests with 3 test builds for iPod Video:
A) r28992
B) r28992 w/o r28877
C) r28992 w/o r28877 r28928 r28923 r28918 r28909 r28901

Test case is playing a large playlist from my DB, starting with the same track for all builds.

A) locked up after about 30 minutes (during 1st or 2nd rebuffering after playback started)

B) ran ~17h without any problems

I will focus on a test of A) now to get more confidence regarding influence of r28877.

MikeS commented on 2011-01-09 09:02

I'm not sure what to make of what happened here with H10. As you know, when I tested at first, it ran fine for hours. Today, it took about 30 min before I needed the paperclip. The difference was that the first time it was plugged-in and the second I was running from the battery only (almost fully charged).

I had a freeze after reboot from emergency USB connection (those aren't uncommon though). Whether it was that is questionable since it played a little bit before freezing and those tend to happen almost immediate upon booting. The freeze described in the previous paragraph was the firmware load after paperclipping from post-emergency USB freeze.

See if the attached patch reveals anything since if priority changes more often, any circular waits might have a greater occurance. But, I don't see how plugging would affect that.

FYI: I've got the H10 plugged again and it seems alright so far.

Short update: Yesterday I had no issues with A) build for ~12h.

This is getting weird… Hopefully we will find a way to reproduce this. There definately _is_ something going wrong….

MikeS commented on 2011-01-09 17:45

Thought I might have been onto something here with the external power. H10 ran fairly long on the battery afterword before I had to shut it off for the night.

cpu98 commented on 2011-01-10 11:16

I had this problem long long time ago.
Most noticeable problem was ffw not working: Disk spinning forever, no responding UI, no sound, buffering section in debug menu showing negative value. Only savior was menu+center. A suspicious buffering related thing.
Don't remeber from what revision it happened. But after few hundreds of revision it gets mysteriously fixed!
And guess what? That happens again with some update and gets fixed again with some update after that. * 5 or more
That ruthless fucker which whispered 0dBFS screaming direct to my ear while I was alone on a bed at dark night might not be little piece of buggy codes but gcc and linker bugs that screw up some 'chosen' revision.
BTW I use iPod Video 80G and enjoyed r28831. I haven't encountered any problems with that.

Next idea to be thrown in: Could this be caused by r28652 (3.7 branch) or r28608 (trunk)? What if "tagcache_is_usable()" does not become true? From what I understand tagtree_buffer_event() would then block. This sounds a bit like what we experienced. This also explains why I got the "DB not initialized" screen when I tried to enter the DB in this situation. Even though I have played from a playlist created from the DB.

Edit: Also interesting → http://www.rockbox.org/irc/log-20110113#22:56:55 "the problem" is a strange playback issues as well.

See  FS#11865  for an issue cause by r28652 / r28608 which seems to be related (and is the one discussed in the IRC conversation mentioned).

MikeS commented on 2011-01-16 10:50

I did find something with H10. I suddenly got some data aborts during directory cache initialization. I thought it was my change to the processor cache init at first but it turns out not so. Somtimes, directory cache acts up and appear to init and yet not be initialized, with the setting reverting to "No" despite my having set it to "Yes". Sometimes, it will data abort during init every time (at 58B44 or 58A2C which is dircache code in an H10 build). Other times, it will init normally. I could cause different behavior by varying the padding in an unrelated file (FYI: system-pp502x.c just happened to be the one because I was looking into my changes).

AFAICT, nothing like this appears to be happening on my SD devices.

There is a bug for sure that appears alignment-related.

cpu98 commented on 2011-01-16 23:39

I got the problem MikeS described above while using r29012.
It behaves as if dircache is disabled although setting says it is enabled.
I transferred 100 songs and unplgged and got a data abort (at 5xxxx).

It seems flac cuesheet (or skipping track) doesn't work too. (Not sure it is caused by this)
(Load single flac → 'Browse cuesheet' to change track → Playback stops as soon as I change track (w or w/o full scale noise), and UI freezes.)

The attached patch by peadar was posted in IRC. It was reported to fix reproducible crashes on clip+ when skipping during playback. It may be connected especially to the tasks that I linked as "related tasks".

Edit: This patch is not working correct. It has negative side-effects on my simulation and iPod Video.

Yeah: I can see a few issues that it might cause. Can you describe the side-effects? Hanging at the end of a track, maybe? I'll post a replacement once I'm sure its fixed properly.

Try this version. It can probably be made a bit cleaner, but testing with this is much more stable for me.

The failure was a playback lock-up when manually switching from track A to track B. In this case the buffering was stuck in a loop.

The test case which always failed for me on iPod Video target and simulation works fine on simulation with this patch. It also survived some intense seeking/skipping. I will further test on target.

Tested a while on iPod Video now. The tests included long time playback (>10h) and heavy skipping/seeking forward/backward to force rebuffering. I experienced no side effects so far.

Peter, can you describe the issues you had in detail? And can you explain what your patch does and what the failure (code-wise) was?

MikeS commented on 2011-02-03 23:56

There are fishy things inside there. Things like "new_widx + sizeof(struct memory_handle) > buffer_len" look like they should be "new_widx + sizeof(struct memory_handle) >= buffer_len" since the guard buffer starts at buffer_len bytes offset. ">" instead of ">=" would put things one byte into the guard buffer.

MikeS commented on 2011-02-04 20:19

Hmmm….I don't think what I said is the problem per se but I'm also not sure exactly how it messes up to bring widx all the way back to ridx. Avoiding that state seems to be programmed in already.

MikeS commented on 2011-02-09 05:25

I decided I'd go through every line of code meticulously. I have done some partial renovation buffering in the patch attached and still have some more checking to do. Wouldn't mind having it all in for the next release. I found some errors and overcomplications.

Edit: change .diff to .patch because IE is ok with viewing that

Michael, I really appreciate that you walk through this piece of code and found several issues. I will be happy to test your changes over weekend :)

Some questions so far:
1) Is the patch by Peter Edwards included as well or did you find another way?
2) If you should submit those changes: Can you split this to several independent commits? E.g.

a) changing variables / structs (e.g. "data_counters" to "dc" or "h->id" to "handle_id")
b) removing mutexes
c) the logical changes?

If there should occur any side effects we can then better bisect and track down any issues.

Thanks for your work!

MikeS commented on 2011-02-09 08:04

Thank you. Please note: I haven't validated the handle movement code in its entirety yet.

1) No, though I did find one spot so far where > should be >= to avoid the said situation. It's the one I know of ATM. Rebuffering check needed to be >= as well since file position is 0-based and then account for seeking to EOF.
2) a) b) c) Not completely possible since things like changing the locking scheme and fixing up problems need changing things for thread safety. Perhaps the minor things mentioned in 1) could be separated from thread stuff.

Not calling find_handle after a queue_send call was not correct but the need largely obviated (and there's no need to worry if a compaction-triggering event won't be in the queue before the sender's message).

handle_id is, for example, to cache the id through possible buffer compactions because h→id would no longer be valid thus lost entirely.Logical changes are rather minor except moving rebuffer logic almost entirely onto the buffer thread, which avoids having to re-find the handle after each queue_send and passes fewer requests to buffering.

Naturally, I'll remove whatever becomes obsolete as a consequence of tigtening things up.

MikeS commented on 2011-02-09 11:44

Resynced because I committed some of it already. Also, I had the logic in seek_handle incorrect.

MikeS commented on 2011-02-09 22:58

Resync to addition of embedded album art.

MikeS commented on 2011-02-10 11:32

Sync yet again :) to r29275.

MikeS commented on 2011-02-11 05:10

Urgh, yes, it does need a bit more substance to the locking so that handles aren't added while bitmaps are being loaded because the bitmaps haven't finalized the total handle size until the load is finished.

MikeS commented on 2011-02-12 13:01

Guess what? r29289 resync. :p

MikeS commented on 2011-02-14 11:40

Patches are now obsolete. r29303 has the changes. There's still some danger on USB connect but alot of nasty stuff was squashed. Onto bigger and better things… :D

MikeS commented on 2011-02-14 14:14

There is something odd. I couldn't even get my H10 to commit the database (data abort 4D524 or 4D510 over and over). But, I disabled DMA such that only one instruction changed but the count remained the same. Doing that it finished init just fine and no dircache issue. So here's a patch that does that to try out.

If it's not a cacheline issue perhaps there could there be a stray drive interrupt?

The issues described in this bug report also happen on my iPod Nano 2nd Gen, but instead of saying "Data abort" it says "Invalid instruction at 00FF00FF00" where 00FF00FF00 is some hexadecimal reference that always change.

I wrote a huge and confuse (but very detailed) forum post on what happens with my device, here: http://forums.rockbox.org/index.php/topic,27260.msg178068.html#msg178068

MikeS commented on 2011-02-19 13:42

Please check this out, at least on a target (patched or otherwise) than doesn't seem to be subject to other issues. Gets the codec thread under control of the audio thread and make it a dumb thread that simply obeys commands instead of worring about PCM, playback state or anything else about the engine itself and take care of most of the horrid raciness. Also, fix codecs' checking of metadata readiness which can cause them to loop forever waiting for it. Some were using it before it was ready or checking the pointer variable for NULL instead of its referenced value. Some unifying of encoder/decoder interface to facilitate all this.

Note: There are bugs that exist in SVN that I will not bother to fix without at least a minor renovation of the codec format since it's almost impossible to fix them otherwise but that's a bit more extreme than I would like to get at this time. They have to do with depending upon the presence of a codec to signal seeks and skips at a time when it's quite impossible to have one handy.

Said existing bugs are the only way playback seems to break for me but with those it's stable enough to stop and restart to recover. I haven't been able to actually crash anything skipping/seeking on an otherwise stable target.

MikeS commented on 2011-02-19 20:11

One change had to be made– codec must always seek the handle for the next one when exiting and changing over to a new one. Also, I split the metadata wait fix to the codecs because that can be committed separately.

So, you want me to try that out? I warn you that my build is patched, thus it is possible that part of these problems I have are potentiated by the changes I made (I enabled the piezo, added two games and the moonrock plugin - this last one isn't, BTW, working properly with my latest builds), but the playback bugs I see (or better, listen) seem to not be related to those changes as they also existed on older builds I complied (before I added so many patches), also none of my bugs change any of the files the patches above change.

I'll try patching that against my trunk and when I'm done I'll tell you the results.

MikeS commented on 2011-02-23 15:47

I have committed to SVN with further refinement after the 3.8 branch. Everything deals with longstanding issues of the design, some a result of having too basic a kernel API at the time. Further changes are planned to make the playback system robust against changes outside it and to seal it up to make other code robust against changes to it. As mentioned above, not every bug can be properly squashed at this point.

There are still issues when codecs fail for some reason. Not much change there except maybe the specifics of what goes on. Seeking/skipping at the end of the last track still doesn't work as it should (by seeking/skipping) but now the playback won't go dead.


In the technical side, my plan is to divide codec operation into several exported entrypoints:
1) The init code which does the main crt0/codec initialization and returns. If it creates threads, it must do it here or notify of any failures. If it fails this, 2) won't be called but 3) will.
2) The decoding code which returns after every track completes, possibly subdivided:

 a) seeking 
 b) decoding

3) Cleanup code (most important if it creates threads internally or uses any other system resources)

The codec will stay resident and ready for more tracks until 3) is called. It must reset itself if it previously errored in 2) for any reason.

All this places the presence of the codec under playback control and not codec control, which lead to the mentioned difficulties. Right now, the entrypoint cannot be called again without a full restart. Other side effects include simpler and more consistent codec implementations and the ability to decide precisely when it is running or not; it can more easily listen to its message queue at all times.

(Sorry for the ramble! :)

I updated my code trunk after you committed the changes to SVN, compiled and updated the build on the player.

It seems much more stable. It played three tracks I chose with the file browser, at the end of the last one I let it playing (shuffle enabled) and it played 8 more tracks without hanging, clicks or other strange noises.
It was at the end of the 8th music, I had to go quickly check something and let the iPod playing, when I came back, it was with the screen turned off, no sound on the speakers and when I pressed the scrollwheel the screen didn't turn on, so in conclusion the player was hung again. I did a Menu+Select reset and it's playing again. Now let's see for how much time.
All the tracks the iPod has been playing are OGG tracks encoded at an average of 160 kbps and an average duration of 4 minutes.

Anyways, the playback seems more stable. I needed to change of language and Rockbox usually pauses playback for 0,5s while changing language (it also happens sometimes on the themes), this time it paused and sound didn't come on until I went to the WPS. There, it acted like it had been muted: playback had paused at aprox. 0:58s and it resumed at 1:05s. I changed of language again, and this time it kept playing after the short (0,5s) interruption.

In conclusion:
Your improvements seem to have a good effect. So I think your plan is the way to follow in order to have a more consistent code and stable playback :)
For now, Rockbox is able to play at least 8 tracks without hanging which is a pretty good mark, taking into account that before it had times that I could only listen to a minute and half of audio.

Skipping tracks backwards repeatedly causes the sim to segfault - no idea if this is a new issue or not, but thought you should know! My config (crossfade enabled) is attached and gdb details follow…

Program received signal SIGSEGV, Segmentation fault.
0x0044aa54 in pcmbuf_monitor_track_change (monitor=false) at /home/Steve/rockbox/apps/pcmbuf.c:496
496 read_end_chunk→end_of_track = monitor;
(gdb) print read_end_chunk
$1 = (struct chunkdesc *) 0x0
(gdb) print last_chunksize
$2 = 34304
(gdb) bt
#0 0x0044aa54 in pcmbuf_monitor_track_change (monitor=false) at /home/Steve/rockbox/apps/pcmbuf.c:496
#1 0x0044ab1d in pcmbuf_start_track_change (auto_skip=false) at /home/Steve/rockbox/apps/pcmbuf.c:548
#2 0x0044e6e5 in audio_check_new_track () at /home/Steve/rockbox/apps/playback.c:1727
#3 0x0044ef35 in audio_thread () at /home/Steve/rockbox/apps/playback.c:2199
#4 0x0047149c in start_thread () at /home/Steve/rockbox/firmware/target/hosted/thread-win32.c:53
#5 0x7c80b729 in KERNEL32!GetModuleFileNameA () from /cygdrive/c/WINDOWS/system32/kernel32.dll
#6 0x7c82ffb1 in KERNEL32!ConvertThreadToFiber () from /cygdrive/c/WINDOWS/system32/kernel32.dll
Cannot access memory at address 0xbaadf011

MikeS commented on 2011-02-25 03:13

HA! This whole thing is whack-a-mole at this point. There is a data abort in vorbis that I thought was caused by the major changes but it turns out it's still there going back to at least the metadata wait redo– it just changes where it happens. Lemme re-evaluate that function 'cause I think I misread in all the mayhem about read_end_chunk not being NULL if last_chunksize is not 0.

MikeS commented on 2011-02-25 03:55

I think last_chunksize just needed to be cleared at pcm stop since having a current buffer going out on DMA of size 0 is pure bullocks.

Coming back to the theory of ATA DMA messing up the playback / buffering on PP502x targets. On IRC dewdroid and JdGordon both reported to not have any issues anymore when commenting HAVE_ATA_DMA. JdGordon tested for ~25h, dewdroid for ~1 week in daily use (and stated he had the lockups and static noise several times a week). If you check the "Related Tasks" to this one you will find that the lockup + static noise is only reported for PP-targets.

In my opinion we should switch off ATA_DMA on Trunk and for a 3.8.1 until the underlying issue is solved.

Buschel, I remember -something- about the Rio Karma which is also portal player based in this thread by one of the developers:
http://forums.rockbox.org/index.php?topic=14226.15

"Thirdly, there is a very serious silicon bug in the cache on the PP5003. If you ever flush the cache (or a region of it), it can get into a state where the wrong cache line gets written randomly somewhere in external memory. And, because the two CPUs on the 5003 are not cache-coherent, you end up flushing the cache a lot if you try and run stuff on both CPUs simultaneously. (Which you need to do, because the two clocks are locked together, so if you've got 90MHz worth of work to do, it uses much less power to run both CPUs at 45MHz than it does to do it all on one CPU and have the other doing 90MHz worth of nothing.) There is a workaround, which is only available because (bizarrely but fortunately) the 5003 memory-maps the entire internal logic of the cache controller, so you can go and rewrite all the tags and dirty bits to point to somewhere innocuous. Do not issue a cache flush, on either CPU, without following it up (running from IRAM) with the workaround sequence: arbitrary code or data will be overwritten, and your program will randomly crash. (Like, *cough*, the Karma 1.0 firmware did until we fixed it in 1.2.) This workaround is in the application code, not Ecos itself, and so does not appear in the ecos-last-rhepl diff."

I also remember reading from the same developer that the PP5003 is internally more a PP5020 than a PP5002. Hope that helps.

Was the v07 patch in  FS#11830  (http://www.rockbox.org/tracker/task/11830?getfile=23052) not committed - only v06? Guessing v07 would be harmless now that ATA DMA is disabled ?

without a reliable repro this is going to be difficult to track down but im giving it a go anyway. I'm prettty sure 3.7 was good so i've gone through the changesets since then and looks for obvious candidates and built a dozen builds to try over the weekend. (I'm AFK untill monday so i'm doing a bunch of builds and loading them up and going to swicth whenever a build shows problems).

rockbox-28525.zip rockbox-28953.zip rockbox-29248.zip
rockbox-28700.zip rockbox-29056.zip rockbox-r28435.zip
rockbox-28885.zip rockbox-29106.zip rockbox-r29510.zip

those are the builds im starting with, if anyone has better suggestions i can build tomorow while at work. hopefully we can track it down to at least a few dozen changes instead of ~1100

Jonathan, there are bug reports which state those issues have been first experienced with 3.7 (coming from 3.4/3.5 – see  FS#11764 ). So, there seems to be evidence that 3.7 shows the error as well. In  FS#11775  it is stated that 3.6 does not show this effect. I also had local software builds which definately showed this lockup and noise issue, but they did only once. I never were able to reproduce…

yeah, thats what is worying me, 3.7 seems to be fine on my mini

MikeS commented on 2011-03-08 08:39

@Thomas: I'm not sure what they're getting at with "the two clocks are locked together" thing since the cores are independently able to be unclocked and placed in WFI mode or slept for various specified periods of time.

Anyway, the cache has seemed reliable with the likes of mpegplayer, fft and DMA audio playback, and loading plugins in general which very much rely on the cache coherency routines in places. I had a tester plugin that would ping-pong stuff between cores that ran for extended periods without trouble.

*Car adapter only*

I started having very similar issues when I went from 3.7.1 to 3.8 Data Aborts, Splash screen hung at "Scanning" etc. Once I CTR+Menu enough times to get it playing, it would put out faint clicking sounds or no sound at all (out of the headphone jack.) Sometime the playback would stop/start irratically as well. I have treid the 3-11-2011 daily build as well; same results. I have an 80GB Video.

In my case, using the player on battery with headphones seems to work normally, although I generally dont use my player that way. I own it specifically to use with the OEM Mini (BMW) iPod interface. Previous verisons have had varying levels of support for the OEM control functions, but I could always at least hear the sound and the car would keep the player charged…

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing