Rockbox

Tasklist

FS#12391 - PP502x commit_discard_idcache() causes memory corruption

Attached to Project: Rockbox
Opened by Boris Gjenero (dreamlayers) - Thursday, 17 November 2011, 18:04 GMT
Last edited by Boris Gjenero (dreamlayers) - Sunday, 26 May 2013, 17:56 GMT
Task Type Bugs
Category Operating System/Drivers
Status Closed
Assigned To No-one
Operating System PortalPlayer-based
Severity Low
Priority Normal
Reported Version Daily build (which?)
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

If I add "if (!write) cpucache_invalidate();" to ata_transfer_sectors in firmware/drivers/ata.c, a database commit almost always crashes. I suspect this is related to instability seen in PP502x IDE DMA ( FS#9708 ) (which was disabled in r29476 due to instability). Note that here, I'm just adding the cache_invalidate calls and not turning on IDE DMA.

The crash does not always happen during the same stage (number) of the database commit, but it always happens at the same spot: near the start of tempbuf_sort, currently starting at line 2088 in apps/tagcache.c:
idlist = &lookup[i]->idlist;
while (idlist->next != NULL)
idlist = idlist->next;
The compiler inlines tempbuf_sort into build_index. One time I printed the first pointer pointing outside of RAM and it was 0xc0edbabe.

I'm attaching a patch I used to add the code to ata_transfer_sectors. If you want to reproduce this bug, you should probably keep the LCD awake manually or disable LCD sleep so that you can see the crash message. Note that /.rockbox/database_tmp.tcd will remain, and Rockbox will again attempt to commit the database on next startup. If /.rockbox/rockbox.ipod contains the patch, you may have to restart into disk mode to fix this, so you might want to put the patched rockbox.ipod elsewhere and load it from within Rockbox. If others can't reproduce this, I can attach my database_tmp.tcd file.

I've confirmed this with r30989 and r31001 on my 5G 30GB iPod. Defining FORCE_SINGLE_CORE and increasing the tagcache stack did not help. I suspect this is a problem with PP502x cpucache_invalidate(), and not with the database.
This task depends upon

Closed by  Boris Gjenero (dreamlayers)
Sunday, 26 May 2013, 17:56 GMT
Reason for closing:  Fixed
Additional comments about closing:  Fixed in 0fec841
Comment by Boris Gjenero (dreamlayers) - Saturday, 19 November 2011, 08:18 GMT
I might have a solution here. I've been able to commit the database multiple times, both with just the cpucache_invalidate() and with UDMA 4.

The theory behind this is that the cache must be filled with valid data before being flushed, or else some unused cache lines trash memory. Apparently, the PP5020 may have a bug that's also in the PP5003. Thanks to pdh11 for this post: http://forums.rockbox.org/index.php/topic,14226.msg110998.html#msg110998 and thanks to thomasjfox for pointing it out in  FS#11863  discussion.

I've only seen the OF do this after enabling the cache and I've never seen a flush+invalidate used except when turning off the cache. Here, for a successful database commit I just need to do it after invalidating, but I guess it should also be done after the init.

Unfortunately, this workaround is horrible: it involves loading 8KB of probably useless data into the cache after every invalidate. I'll try to create a better alternative which just rewrites the cache status data to send that corruption "somewhere innocuous". The "PortalPlayer PP5024 memory controller & cache(s) v0.4" document by MrH should help here: http://daniel.haxx.se/sansa/memory_controller.txt
Comment by Michael Sevakis (MikeS) - Saturday, 19 November 2011, 08:45 GMT
Argh, I forgot about that bit with the filling. It is present in the PP5020 OF and does it in the beginning of its memory map. It didn't seem to be required to produce proper dual-core results so was taken out (it used to be in the RB source).

As far as I know a flush+invalidate is mandatory when invalidating and plain invalidate isn't possible. I'm not sure if the deliberate fill is needed on all PP502x (like 5024).

I have seen flush+invalidate in the e200 emulator used with every DMA transfer on the SD card during reads.
Comment by Boris Gjenero (dreamlayers) - Sunday, 20 November 2011, 15:54 GMT
I first created a patch (pp502x_point_cache_to_buffer.patch) which points all the cache lines to a 2KB buffer. (Since the cache is 4-way set associative, 4 lines can be set to the same memory location, and so only 2KB is needed for the 8KB of cache.) For this to actually help, the lines had to be marked as valid. This prevented the database crash, but I did not see the buffer being altered unless the lines are marked as dirty.

Since the buffer wasn't altered, I thought it wasn't necessary so I instead pointed the cache lines past the end of memory (pp502x_cache_point_past_end.patch). This also fixed the database crash.

Based on observations, it seems like the problem can be triggered by just a cache flush. Maybe lines which aren't marked as valid can get flushed to locations other than where their address field points. I suspect only DMA seems to trigger it because other cache invalidates in Rockbox are rare enough that there's enough time for the cache to refill.

Unfortunately, I found another way to reproduce the problem, and only cache filling reliably fixes it. I'm attaching a patch which causes a lockup at startup, right after the screen is cleared. This happens in r31026 (on my 5G iPod with 32MB RAM and the stock 30GB disk) with just this patch, with all database and configuration files deleted. The if condition in ata.c is required; if the cpucache_invalidate() happens always, I don't get the lockup. The variable in debug_menu.c can be any size, including multiples of 16. What matters is that it's CACHEALIGN_ATTR and initialized.

Edit: I think cache_crash_without_database.diff is another issue: objcopy -O binary outputs sections in the order they are in the .elf file, without caring about their addresses and alignment. Rockbox moves sections, but only a few targets move ".data". If a target doesn't move it, and it's more than word aligned, ".data" can end up at the wrong place. I think pp502x_cache_point_past_end.patch is the better solution, so I spent more time cleaning up and commenting that. The other one is attached in case someone wants to experiment with it to see if the buffer gets trashed.
Comment by Michael Sevakis (MikeS) - Sunday, 20 November 2011, 21:51 GMT
I am a bit puzzled because DMA for audio has been used with writeback for coherency until the mixer was introduced, which now uses IRAM for DMA. Direct PCM users still use their own buffers. No major faults seemed to happen, aside from the possibility of the FIFO going empty and causing a click in certain circumstances. The flushes of course happened often and at any time since it was called from the FIQ handler.

Who knows though. Perhaps some oddball mystery bugs had something to do with it.
Comment by Boris Gjenero (dreamlayers) - Monday, 21 November 2011, 01:22 GMT
The problem only happens when flushing cache lines which are not marked as valid. There is just 8 KB of cache per core, so it gets filled with valid data quickly. Once a line is valid, it should stay valid until an invalidate. After the invalidate, everything is okay as long as the cache gets re-filled before the next flush or flush+invalidate.

MP3 playback uses IRAM to move data between cores, and it seems only MPEGplayer and FFT use frequent invalidates for that. Both may do enough work between invalidates to fill the cache. In MPEGplayer, audio playback seems to be from SHAREDBSS_ATTR, so unpredictably timed flushes from the FIQ handler shouldn't happen. Flushes from FIQ also never affected FFT, because it is newer than the mixer. Other invalidates happen rarely.

It should be possible to avoid use of cpucache_invalidate() by invalidating manually. I can think of two possibilities using the memory-mapped status words:
- flush normally and then clobber all cache status words that might need to be invalidated based on their location in the cache
- test all status words that might point to the region being invalidated, and clobber only those that point to locations being invalidated

You can ignore cache_crash_without_database.diff. It's another bug:  FS#12397  - .data section alignment ignored on some targets
Comment by Michael Sevakis (MikeS) - Monday, 21 November 2011, 02:39 GMT
It used to be the case the PCM was played directly out of the PCM buffer in pcmbuffer.c. So in fact it did invalidate quite frequently (several Hz). MPEGPlayer (38Hz) was also the case but it uses the mixer these days. Other plugins play PCM directly from their buffers.

One plugin that hits the cache functions frequently is FFT since it's dual-core on PP.

If we can invalidate manually, can we not introduce range operations? That would be preferable and wouldn't prevent interrupts for extended periods.

Also, I attached a plugin for messing around with buffers between cores. Not sure if it still compiles.
Comment by Boris Gjenero (dreamlayers) - Monday, 21 November 2011, 04:50 GMT
I don't understand why you write about invalidating when PCM is playing. Are you confusing flushing and invalidating?

When writing or outputting stuff via DMA , you just need to flush the cache. That means: all cache lines which contain changes which haven't been written to memory must be written out to memory. You're just ensuring that the DMA sees an up to date copy of the data you want to write. When you're writing or outputting, the DMA doesn't change RAM, and so the cache can continue assuming things about the contents of memory. The cpucache_flush() function only performs a flush; it does not perform an invalidate.

Invalidation means telling the cache to stop assuming it knows the contents of memory. That means, if you attempt to read some data that was cached before the invalidate, the cache won't give you a cached copy, and it will instead re-read the data from memory instead. It's only needed when you're inputting or reading via DMA. It is done because DMA changes memory but doesn't update the cache. It is required to ensure that the current core sees up-to-date data from memory, and not old cached data.

Invalidation requires a flush to be performed first, unless the invalidation is confined to a specific area where data from before doesn't have to be preserved. If the whole cache is invalidated and a flush isn't performed first, important changes to RAM outside the desired area may be lost. The cpucache_invalidate() function does perform a flush first.

Of course, if there was PCM DMA for recording, invalidates would be needed, but fiq_record() in firmware/target/arm/pcm-pp.c doesn't use DMA.

Yes, manual invalidation can support cache range operations.

To compile test_cache.c, just remove PLUGIN_HEADER, PLUGIN_IRAM_DECLARE and PLUGIN_IRAM_INIT(rb). Those are not needed anymore. It's explained in the apps/plugin.h revision log. I left test_cache running for about half an hour and it didn't find any mismatches. I think the test isn't very useful for spotting this kind of memory corruption. It ensures that the data being moved is valid, but the problem seems to be memory corruption elsewhere. The surprising thing is that Rockbox doesn't crash after all those cache_invalidate() calls. If each one had a significant chance of memory corruption, there would be a lot of corruption. I wonder what's going on.
Comment by Michael Sevakis (MikeS) - Monday, 21 November 2011, 05:15 GMT
Indeed, I meant "flush" or "writeback", not "invalidate" with the audio DMA (just silly typos). Didn't you mention something about an issue with that as well?

I wrote the test_cache thing when making dualcore work in a general-ish way in the kernel but have not updated or tried to compile it since that time.

One of my big "what the fuuuu's" is you can call those flush/invalidate functions alot in other contexts without any problems manifesting. What's so special about the disk transfers? That's all I'm saying. Yes it is a big, as you say,"I wonder what's going on."

For all I know, it might be the case that OF goes through lengths to avoid cache flush/invalidate use. It's very expensive and buffer copying using an intermediate can actually be cheaper. But, I haven 't investigated. Perhaps the ATA controller has some nasty effect on these things. Other issues come up in regard to accessing the drive on iPod Video, hence that strange object hack that prevents the CPU core from sleeping in ata.c.
Comment by Boris Gjenero (dreamlayers) - Monday, 21 November 2011, 15:44 GMT
Flushes by themselves are not dangerous. I've never seen a crash just due to flushes, and the OF uses flushes in many places without any special workaround.

I thought that flushes might be dangerous soon after cache initialization or an invalidate, when some cache lines didn't have valid data in them. However, those observations were confused by  FS#12397  so ignore this.

I'm not sure if disk transfers are special. In my original patch here which reproduces the problem, the invalidate is in ata_transfer_sectors, but there's no ATA DMA. Maybe I should try moving that invalidate elsewhere to check if I can still reproduce the problem then.

I just had a cache test running all night from the bootloader. It checked the first 16MB of RAM for unexpected changes. I didn't find any corruption. Maybe a very specific memory access pattern is needed to reproduce this bug. That would explain why the crash in tempbuf_sort is so easy to reproduce, but other functionality such as MP3 playback seems unaffected by added cache invalidates.

Edit: I did a RAM dump from tempbuf_sort when things go wrong. I see: i = 2360, &lookup[i] = 0x1CB150, lookup[i] = 0xEA000003. All neighbouring pointers in the lookup array seem reasonable, but that particular one has been replaced by an ARM instruction. 0xEA000003 occurs in many places, including one place, 0x6A150, where address & 0x7FF == 0x150, which means it can get cached to the same place.
Comment by Boris Gjenero (dreamlayers) - Tuesday, 22 November 2011, 01:12 GMT
I finally have a tool which reliably reproduces the problem! :)

This is a plugin. Put it in apps/plugins and add it to SOURCES there. It uses the preprocessor to create a lot of code which does writes and invalidates. After running that for a while, altered words are counted and the cycle repeats. When a certain number of altered words is reached or select is pressed, the loop exits. If any altered words were found, a memory dump is written to /ram.dump before exiting. There are some compile-time options near the start of the source.

It seems that a lot of code reads and data writes, both involving the cache, are required to trigger this bug. I didn't see corruption when there are just data reads. Words from code being executed overwrite data, and when a larger amount of code is being read into the cache, there is more corruption.

The pp502x_cache_point_past_end.patch was pointing the cache past 8 MB, not past the end of RAM. A fixed version is attached. Both that fixed patch and pp502x_point_cache_to_buffer.patch seem to fix the problem, though I didn't run them for huge lengths of time. I'm also attaching dump_cachetrash.patch, which dumps the buffer where the cache gets pointed by pp502x_point_cache_to_buffer.patch. (It requires the buffer patch, because the buffer doesn't exist in the patch that points cache lines past the end of RAM.) I didn't see any modifications in the buffer.

There is some unreliability when using the select button to exit the plugin. You may have to press it multiple times. Rockbox appears to still work even after a lot of corruption is reported, probably because corruption is confined to where writes happened. Nevertheless, it's a good idea to at least reboot after corruption is reported.

Edit: an updated point past end patch is attached. There is no functional change. It is just a resync with some comment edits.

Edit: I'm attaching a further simplified fix which removes the code which tells cache hardware to invalidate/discard and instead uses commit_dcache(). The workaround performs the function of invalidation.

Loading...