Rockbox

Tasklist

FS#8663 - Data corruption on usb write on sansa (SD driver bugs)

Attached to Project: Rockbox
Opened by Frank Gevaerts (fg) - Thursday, 28 February 2008, 23:00 GMT
Last edited by Alex Parker (BigBambi) - Sunday, 06 June 2010, 10:52 GMT
Task Type Patches
Category Drivers
Status Closed
Assigned To No-one
Operating System Sansa c200
Severity Low
Priority Normal
Reported Version Daily build (which?)
Due in Version Version 3.1
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

The udelay() in ata-c200_e200.c is not needed on my c250, and actually seems to cause write errors during USB connections. This patch makes it conditionnal on SANSA_E200.

Not committed yet because of the potential problems if I'm wrong
This task depends upon

Closed by  Alex Parker (BigBambi)
Sunday, 06 June 2010, 10:52 GMT
Reason for closing:  Out of Date
Comment by Michael Sevakis (MikeS) - Friday, 29 February 2008, 02:46 GMT
We really need to find out why this creepy problem occurs at all. :\
Comment by Mark Arigo (lowlight) - Friday, 29 February 2008, 16:01 GMT
The OF uses the DMA engine for transfers so it's hard to say if it's needed.
If someone would like to experiment with DMA read/writes, I've attached the
disassembly from the c200 bootloader.
Comment by Tomasz Moń (desowin) - Monday, 03 March 2008, 16:55 GMT
Is that udelay really needed on e200?

I've removed that udelay, build with USE_ROCKBOX_USB and USB_HIGH_SPEED.
Then I've connected sansa to computer, copied xubuntu-7.10pl-desktop-i386.iso to microSD card, during copying dmesg said many high speed device reset and two "usb 2-3: device descriptor read/all, error -110" between device resets.
I've put card in card reader then, diffed file against the original - it was exactly the same!
Then, I've started sansa, put microsd card into it, copied file from card to internet memory on sansa (in rockbox), rebooted sansa into OF, and diffed that file against the original one at my computer - it was eactly the same as well.
Comment by Frank Gevaerts (fg) - Monday, 10 March 2008, 18:47 GMT
What happens is that "something" inserts two bytes at an offset of 0x220 into the write-block. The bytes are copied from 32 bytes earlier. Note that they are _inserted_, not overwritten. It only happens if there is a USB transfer still active (so it might be a USB related ISR that causes this, or USB related DMA).

Further testing shows that on my c250 it doesn't seem to happen (or happens much less) on the internal flash. Only the micro-sd slot seems to be affected.
Comment by Toni (ahellmann) - Friday, 28 March 2008, 17:05 GMT
I implemented rudimentary DMA write access on the SD driver for e260 and observed exactly the same effect (2 Bytes inserted / copied from 32 Bytes ahead at random places) at 30MHz and 24MHz, while writing data to disk (without USB dma involved!). I don't know the consequence of this.
Comment by Toni (ahellmann) - Friday, 28 March 2008, 20:19 GMT
After changing PLL_CONTROL from 0x8a220501 to 0x8a120501 at 30MHz the dma file transfer seems to be much more stable now on the sansa. At 24MHz I still get sporadically this '2 bytes inserted' effect. I didn't check other negative effects after changing the PLL_CONTROL setting.
Comment by Frank Gevaerts (fg) - Sunday, 20 April 2008, 13:53 GMT
The problem seems to go away if the write data is in IRAM
Comment by Martin Ritter (MartinR) - Monday, 21 April 2008, 11:46 GMT
This controller is very similar to the one in PXA255 / PXA27x series.
I would like to change the register definitions accordingly to the documentation of these chips to make further work a bit easier.
They would read like MMC_STRPCL, MMC_STAT, MMC_CLKRT and so on.
Does anybody mind?
Comment by Martin Ritter (MartinR) - Tuesday, 22 April 2008, 13:53 GMT
Here is my work on changing the register and bit definitions accordingly to the PXA27x Developers Manual.
Most of them obviously make sense, some unused registers have been identified by probing their default values with e200tool.
The rest is still guessing, especially the CMDAT and I_MASK bits.
It would be nice, if someone would consider to commit this or comment on what to change.
It shouldn't harm anything, as there are no functional changes yet.
Comment by Martin Ritter (MartinR) - Thursday, 12 June 2008, 08:38 GMT
This might be kind of a workaround for this bug. Surprisingly, it can be circumvented by filling the FIFO of the controller very fast, using a chunk of asm. Though, there has to be a short delay between detecting FIFO_EMPTY and filling the FIFO. Without this I got lockups and corruption. But if it's too long, the 'two bytes inserted' bug reappears. Probably this udelay has to be adjusted for c200 and GoGear. It has to be as short as possible.

The asm code will only work with an aligned buffer. With an unaligned buffer, the old slow routine is used along with an additional udelay to prevent the bug (that's odd, yes). But do we actually ever have an unaligned buffer with ata_write_sectors?

I did very intense testing and it turned out to be absolutely reliable with my e260 and a Sandisk 4GB SDHC card for several weeks now. Write speed is around 3.5 to 4 MB/s on the SD card, the internal flash is slower.

Though, I'm quite uncertain on how it will behave with different players or SD cards because of this fragile timing. Maybe even the speed of the SD card might have its impact (I only own one card). The speed of the internal flash might differ between devices as well. So it will still need a lot of testing. Backups!
Comment by Jonathan Backer (jrbil) - Thursday, 12 June 2008, 16:45 GMT
I just tried Martin's patch on my e280 with 2GB microSD card and found no corruption. Not extensive (about 1 GB) but seems to work.
Comment by Antoine Cellerier (dionoea) - Thursday, 12 June 2008, 17:03 GMT
Here's the result of my tests with and without the patch (udelay removal) on a Sansa e260 with a 4GB microSD HC card. All of these tests used full speed USB (high speed is a bit unreliable). Test protocol was simple:
1/ compile rockbox for clean svn trunk
2/ install on player using original firmware for transfer
3/ disconnect player
4/ connect using rockbox and copy 550 files (mostly music tracks)
5/ disconnect player
6/ connect using rockbox and run md5sum -c reference.md5 to check if files were ok or not.

Just to make sure that the microsd errors weren't due to a crappy partition, I reformated the disk (with udelay removed) and ran the test (with udelay included) again.

svn-trunk.17708.internal.check
550 OK
svn-trunk.17708.microsd.check
481 FAILED
27 FAILED open or read
42 OK
svn-trunk.17708.microsd.validation.check
465 FAILED
7 FAILED open or read
78 OK
svn-trunk.17708.no-udelay-in-sd-ata.internal.check
460 OK
svn-trunk.17708.no-udelay-in-sd-ata.microsd.check
550 OK

As you can no doubt see, removing the udelay fixes writes to the microsd card.
Comment by Jonathan Backer (jrbil) - Thursday, 12 June 2008, 19:18 GMT
Just with internal flash (removed SD), I formatted, copied 280 files and 1.2 GB. Did a diff and fsck, which found no corruption. Built the database and rebooted. Now I have corruption (*PANIC* Dir entry 6 in sector 3 is not free!).
Comment by Jonathan Backer (jrbil) - Thursday, 12 June 2008, 19:48 GMT
Just with internal flash (removed SD), I formatted, copied 280 files and 1.2 GB. Did a diff and fsck, which found no corruption. Built the database and rebooted. Now I have corruption (*PANIC* Dir entry 6 in sector 3 is not free!).
Comment by Martin Ritter (MartinR) - Friday, 13 June 2008, 07:44 GMT
Jonathan, thanks for testing. Can you please increase the udelay from 2 to 3 and repeat the database rebuild? Also try copying from internal flash to SD and back again.
Antoine, I assume you are referring to Franks initial patch. For me, that udelay was definitely required on my e260. But I always used high speed USB for my tests.
Comment by Antoine Cellerier (dionoea) - Friday, 13 June 2008, 10:53 GMT
Yes, I'm referring to the original patch. It works fine with Full Speed. I'll give it a try with high speed (I get lots of connection resets with high speed but it's still kind of usable) and post the results here. Anyway, since this seems to be the only reliable way to copy files without corruption on my microSD HC card (no support from OF), I'm using the udelay-less build for everyday stuff so I'll be able to see if it breaks at some point in the future.

(I already was able to copy files without the udelay over high speed, but I haven't had time to do the md5sums yet. What kind of issues were you getting? Are/were they easily reproducible?)
Comment by Martin Ritter (MartinR) - Friday, 13 June 2008, 12:46 GMT
Antoine, I just checked that again. Removing the initial udelay causes corruption on both the SD card and the internal flash for me. Especially on internal writes like building the database or when copying files within Rockbox. Just now Rockbox completely locked up on a file copy. USB transfers to the internal flash do apparently working though.
Comment by Martin Ritter (MartinR) - Friday, 13 June 2008, 13:06 GMT
New version of my patch:
- increased the udelay from 2 to 3 (Apparently, I was too optimistic)
- cleaned up the asm code
It's quite a bit slower but should be more reliable.
Comment by Michael Sevakis (MikeS) - Friday, 13 June 2008, 15:46 GMT
This wasn't a controlled check but perhaps worth noting: I had copied some movies over to my e260 for mpegplayer development with the retailos USB mode and one of them had some corruption in it. Copying over it again from the same source file fixed it. I can't say whether or not rockbox had some role in it but the file never had writes made to it by rockbox itself. It could just be yet another PP hardware bug.
Comment by Jonathan Backer (jrbil) - Friday, 13 June 2008, 18:20 GMT
Just tried the latest patch:

From computer to internal flash I transferred 257 files totaling 1.2 GB. Only one file was corrupted.
From SD to internal flash I transferred 234 files totaling 1.6 GB. Only three were corrupted.

Almost there. Even with the additional delays, high speed transfer is easily 3 times faster than full speed.
Comment by Jonathan Backer (jrbil) - Friday, 13 June 2008, 21:25 GMT
Upped the udelay to 4 and still get corruption in 1 of the 257 files when I retried the transfer from the computer to the internal flash. Without your patch 196 of the 257 files get corrupted.

I just tried to do the tests with the OF. Although I verified that there was no corruption, my computer lost connection several times (KDE automatically unmounted and remounted) which caused me to restart my verification several times. I did not have this problem with rockbox. I'm curious if other people can reproduce my problems with your latest patch.
Comment by Jonathan Backer (jrbil) - Saturday, 14 June 2008, 00:40 GMT
OK. Testing on my linux box at home. With your patch, I've copied over 750 files and 3.5 GB without any corruption whatsoever.

Without the patch, running hi-speed I get corruption on 203 of 253 files. Unfortunately, I left my SD card at work, so I can't test it.
Comment by Jonathan Backer (jrbil) - Saturday, 14 June 2008, 02:26 GMT
Copied approx 500 files (3.5 GB) and got one error. So the problem seems less frequent on my home machine, but still persists.
Comment by Martin Ritter (MartinR) - Monday, 16 June 2008, 08:20 GMT
Michael, I too believe that this must be some kind of hardware bug. Or the controller runs at an inappropriate clock rate or something. Maybe it would help if we had the interrupts working. But since the bug also occurs with DMA, it will probably do with interrupts as well.
Would have been interesting, which kind of corruption your file had.

Jonathan, thanks again. Did you retry the database rebuild? And can you somehow confirm that the remaining corruptions are in fact from the 'two bytes inserted' bug? In my tests it either occurred quite frequently, or it was completely gone. It never happened that sporadically to me. I don't know how to check that with Linux though. Hmm, if the file sizes would be different, then it's definitely not this bug. If they're the same, it probably is.
Comment by Linus Nielsen Feltzing (linusnielsen) - Monday, 16 June 2008, 09:14 GMT
Maybe it would be better if we set the CPU core speed to match the USB rate?
Comment by Martin Ritter (MartinR) - Wednesday, 18 June 2008, 15:25 GMT
Linus, this bug has not much to do with USB actually. If that udelay is slightly changed, it occurs with non USB writes as well. The fact that it is triggered by USB is probably just because the timing is a bit different then. The USB buffer resides in uncached memory, so the FIFO filling loop runs a bit slower than usual. But that's just my guess, I may be completely wrong.
Comment by Martin Ritter (MartinR) - Monday, 23 June 2008, 11:33 GMT
Final version of this patch (probably):
As I don't have any corruptions at all anymore, I don't know how to improve this any further.
- #ifdef only for e200 as no other targets have been tested yet.
- Not using sd_poll_status() inside the loop because yielding may disturb the timing.
This will hopefully fix the last remaining file corruptions.
- The new code will be used only in USB mode (usb_exclusive_ata), the current code remains completely unchanged.
This way
- We don't have to care for the buffer alignment as the USB buffer will be aligned always.
- Testing will be much easier because only USB mode has to be tested.
- It will not break anything when not in USB mode.
Comment by Jonathan Backer (jrbil) - Saturday, 28 June 2008, 02:06 GMT
I just did some testing on my e280, and the latest patch looks good. With USB transfer, I sent over 1000 files (5 GB) with no errors. I did a bunch of copying around (w/ rockbox not USB) to exercise the other code (1.4 GB), and found no errors there either. Looks solid to me. I haven't tried it with my micro SD card yet because it is at the office.
Comment by Toni (ahellmann) - Saturday, 28 June 2008, 13:27 GMT
I tried the latest patch here on e250 with external SD-card: No errors!
Comment by Jonathan Backer (jrbil) - Monday, 30 June 2008, 21:24 GMT
I was finally able to try the latest patch with my 2GB microSD card. Good news: the internal flash works flawlessly transferring large amounts of data by USB and directly with Rockbox (no USB) when the SD card is present. Bad news: reading and writing to the SD card causes corruption. Doesn't matter if the transfer is through the USB or Rockbox. To test Rockbox, I copied from known good internal flash to SD and back again and got corruption. Oh, and it's definitely a read _and_ write issue to SD: comparing with the internal flash (copy to SD and then back to internal using Rockbox) showed fewer errors then comparing with the SD (just copy from internal to SD). Moreover, the difference in detected corruption (just one file) disappeared when I tried that particular file again, indicating that it was a one-off read error.
Comment by Jonathan Backer (jrbil) - Tuesday, 01 July 2008, 03:48 GMT
Ok. I remembered from previous corruption issues that it was important to reformat. Once I reformatted the SD card, all of my corruption problems vanished. With the latest patch on my e280 with 2GB microSD card, the high speed USB is flawless. Please put it in SVN. Thanks.
Comment by Frank Gevaerts (fg) - Wednesday, 09 July 2008, 20:10 GMT
USB changes in r17997, while not fixing the issue, at least makes it not be triggered by USB any more. I copied 2GB to my microSD card without errors. Before r17997 I would get errors every few MBs.

Note that if there was any non-USB related corruption, this will still be there

(I didn't apply any of the patches from this task, to maximise my chances of triggering the corruption)
Comment by Jonathan Backer (jrbil) - Thursday, 10 July 2008, 00:25 GMT
I just tried the USB changes committed in r17997. Repeating the tests I used before, I found no corruption on my e280: 6 GB to internal from computer, 1.7 GB to SD from computer, 1.7 GB from SD to internal.

Seems to work pretty well. The write speeds seems slower but the read speeds are comparable. On write, I'm getting just over 2 MB/s.

Thanks Frank.
Comment by Frank Gevaerts (fg) - Thursday, 10 July 2008, 08:34 GMT
Lower write speed is to be expected : the usb storage driver now is careful not to overlap usb and sd transfers. The read path is unchanged
Comment by Martin Ritter (MartinR) - Friday, 11 July 2008, 06:55 GMT
Frank, unfortunately your change in r17997 did not fix this bug for me (e260 + 4GB Sandisk SDHC, plain SVN, High Speed enabled). I only copied 3 files (15MB) to the SD card and each of them were corrupted. Transfers to the internal flash seem to work, but they also worked before.

Jonathan, I wonder why our test results are so different. Just to make sure, did you remove my patch before testing r17997?
Comment by Jonathan Backer (jrbil) - Friday, 11 July 2008, 16:17 GMT
Yup. I did a clean build from SVN. However, my card is just a 2GB SD (not SDHC). Based on its size, I'm guessing that Frank's card is also a regular SD card. I don't know how big a difference that makes, but it's worth looking into.
Comment by Frank Gevaerts (fg) - Friday, 11 July 2008, 16:50 GMT
Can both of you (MartinR, jrbil) check if the corruption _only_ occurs while using usb, i.e. copy some large files within rockbox (from/to sd, sd-to-sd, internal-to-internal...) ?
Comment by Jonathan Backer (jrbil) - Friday, 11 July 2008, 17:01 GMT
Sure. Which revision should I use? Is anything prior to r17997 OK?
Comment by Frank Gevaerts (fg) - Friday, 11 July 2008, 17:14 GMT
r17997 only changed usb behaviour, so that doesn't really matter when testing non-usb behaviour.
Comment by Jonathan Backer (jrbil) - Friday, 11 July 2008, 20:31 GMT
By copying large files within Rockbox, I tested SD->SD (1 GB), SD->internal (2 GB), internal->SD (2 GB), and internal->internal (2 GB) and found no errors.
Comment by Martin Ritter (MartinR) - Monday, 14 July 2008, 08:24 GMT
All internal write operations seem to work properly for me, as long as that udelay in the SD driver is left untouched. That's why I decided to activate my patch only in USB mode.
I just double checked with current SVN and a fresh formatted SD card, but the corruption is still there. Then I did a comparison without SERIALIZE_WRITES and found no significant difference. Within a 10MB test file I had the 2 bytes inserted up to 10 times with SERIALIZE_WRITES and up to 8 times without (5 tries on each).
Comment by Jonathan Backer (jrbil) - Sunday, 03 August 2008, 03:18 GMT
For what it's worth, I got a new micro SD card today. It's a class 6 SDHC. With SERIALIZE_WRITES it works flawlessly.
Comment by Frank Gevaerts (fg) - Sunday, 17 August 2008, 20:56 GMT
Can someone who still has the problem with usb try copying some files within rockbox while the cpu is boosted (this can be done from the System/Debug menu), to see if that makes a difference?
Comment by Martin Ritter (MartinR) - Tuesday, 19 August 2008, 07:22 GMT
Hi Frank, I did the following tests with plain SVN r18310 (just high speed USB enabled):
- copied a directory of 415 MB (95 files) within rockbox from internal flash to sd in 2:24 min
- bitwise compared all files via USB - no error
- deleted the directory on sd
- raised the boost counter in the Debug/CPU frequency screen
- copied same files again in 2:21 min
- bitwise compared all files via USB - no error again
- copied a 12 MB file from PC to sd via USB and got 2 bytes inserted at 15 random positions
Since the measured times don't differ much, I believe the CPU is boosted during a file copy as well. So that's probably not the reason.
Another difference between internal copying and USB transfers might be the buffer size. I'll do some tests on it ASAP.
Comment by Martin Ritter (MartinR) - Tuesday, 19 August 2008, 12:39 GMT
The buffer size doesn't seem to matter. But if I put the buffer used for copying in uncached memory, like the USB buffer is, I get corruption also when copying. (clipboard_pastefile in onplay.c)
So that's probably the significant difference which makes the bug appear with USB only. But what to do about it? Buffer in IRAM?
Oops, did I say that? ;-)
Comment by Frank Gevaerts (fg) - Tuesday, 19 August 2008, 13:27 GMT
If the issue is only caching, just using the cacheable address in ata_write_sectors() should do the trick (with probably a cache flush before). I'll prepare a test patch later today
Comment by Frank Gevaerts (fg) - Tuesday, 19 August 2008, 17:01 GMT
Could you try the attached patch? It uses a cached address for the actual write.
Comment by Martin Ritter (MartinR) - Thursday, 21 August 2008, 09:59 GMT
Test results with cached.diff using 3.8 GB of data:

SERIALIZE_WRITES enabled
SD card: 2.5 MB/s, no corruption
internal flash: 2.1 MB/s, no corruption

SERIALIZE_WRITES disabled
SD card: 2.8 MB/s, no corruption
internal flash: 2.3 MB/s, no corruption

So I think, SERIALIZE_WRITES can be reverted. If the existing code in ata_write_sectors() is considered as proven to work, this patch should be safe. For gaining more speed (up to 3.6 MB/s on SD), my above patch might be considered. But that would surely need some public testing phase, as it introduces new time critical code. Basically, both patches do the same thing: speeding up the FIFO feeding by either using cache or ASM. I wonder what if both patches would be combined...
Comment by Jonathan Backer (jrbil) - Thursday, 21 August 2008, 19:13 GMT
Test results with latest SVN (cached.diff committed) copying 2.5 GB of data:

SERIALIZE_WRITES disabled
SD card: 1.76 MB/s, 5/506 files corrupted

SERIALIZE_WRITES enabled
SD card: 1.57 MB/s, 17/506 files corrupted

What a weird bug...
Comment by Frank Gevaerts (fg) - Thursday, 21 August 2008, 21:06 GMT
Could you run that a few more times, to check if the difference in number of corrupted files is really there or is just an accident?

Did you also check the filesystem for errors between tests, or did you reformat? I'd like to rule out pre-existing errors as a cause for the higher number of errors in the SERIALIZE_WRITES case.

Just to be sure : is this the same sd card that you used previously when you found no errors? Also, how big was your test then?
Comment by Steve (TheBashar) - Thursday, 21 August 2008, 21:26 GMT
I have a c200 and a 4GB micro SDHC card. I'm running on Vista. If someone could post a set of steps to follow, I'd be happy to provide additional testing with the latest build. Specifically I'm in the dark on how one goes about verifying the file integrity / finding the corruptions.
Comment by Jonathan Backer (jrbil) - Thursday, 21 August 2008, 21:59 GMT
I reformatted the card between tests. I will run a few more times to make sure that the increase in errors is not just an accident. I was dumbfounded, so I double checked by reverting the cache patch and defining SERIALIZE_WRITES. Indeed there were no errors.
Comment by Martin Ritter (MartinR) - Friday, 22 August 2008, 07:50 GMT
Jonathan, I'm wondering why your write rates are quite lower than mine. Can you please measure the speed with the setup that is working for you? Maybe that will give a hint.
Steve, any tool for testing USB memory sticks will do. I often used this one: ftp://ftp.heise.de/pub/ct/ctsi/h2testw_1.4.zip
Start testing 10 MB, if that works 100 MB, then the whole card.
Comment by Martin Ritter (MartinR) - Friday, 22 August 2008, 08:15 GMT
Jonathan, still wondering what the difference between our players might be. Maybe that's silly, but please do the following: Go to System/Debug/CPU frequency. Turn the wheel left one click. Frequency should show 80000000. Go back one level, then to View HW info. What's the Est. clock shown there? Mine is 81804.
Comment by J. Bilbrey (JB) - Friday, 22 August 2008, 11:39 GMT
Just out of curiosity, for those still getting SD corruption, how are you formatting your cards? I have seen elsewhere that using an OS to format SD cards can give them inappropriate cluster sizes, resulting in poor speed, and data corruption on some players.

Panasonic has a utility for properly formatting SD cards here: http://panasonic.jp/support/global/cs/sd/download/sd_formatter.html See if this makes any difference.
Comment by Frank Gevaerts (fg) - Friday, 22 August 2008, 13:17 GMT
At least in my tests, formatting has nothing to do with it as I use the raw unformatted partition
Comment by Tomasz Moń (desowin) - Saturday, 23 August 2008, 00:32 GMT
I'm experiencing unusual problem (Sansa e280)

when I have my usb printer connected then there's corruption when copying files to internal flash, but if I unplug the printer all data is ok.
(I've done 4 tests using 5.3gb of data - two with printer connected (different files were corrupted), two without (all data matched))
Comment by Jonathan Backer (jrbil) - Saturday, 23 August 2008, 03:33 GMT
Martin: I checked my CPU frequency and it reads 81803.

I used the program that Martin recommended for testing USB sticks and I get far fewer errors. Using SERIALIZE_WRITES and the caching patch, I twice filled my 4GB SDHC with no detected error. Once it detected about 800K of sectors that were wrong.

Before, I tested by copying my music library over to the card, and then comparing each file bit-by-bit. This is fairly easy under the linux command prompt. I just repeated this test for SERIALIZE_WRITES with caching just to be sure. 10 of 509 music files were corrupted. No such corruption occurs with just SERIALIZE_WRITES.

I have no idea why many files would cause a problem, but just a few wouldn't.

Regarding speed: my write rates are much lower than Martin's. This seems to be independent of whether I access the SD or built-in flash. It is much slower than the OF.
Comment by Jonathan Backer (jrbil) - Saturday, 23 August 2008, 05:09 GMT
Write speeds:

OF - 4.81 MB/s
SERIALIZE_WRITES - 1.52 MB/s
SERIALIZE_WRITES w / caching - 1.67 MB/s
caching w/o SERIALIZE_WRITES - 1.90 MB/s
Comment by Steve (TheBashar) - Sunday, 24 August 2008, 07:31 GMT
Just another data point. I formatted my 4GB micro with SD_Formater and then stuck it in my c250. I connected it to my vista laptop and used the h2testw utility to test all 3854MB. My c250 was running r18338 except I turned on USE_ROCKBOX_USB. I had no errors at all. My write speed was oddly low though at only 757KB/s. But slow is better than corrupted.
Comment by Martin Ritter (MartinR) - Monday, 25 August 2008, 08:01 GMT
Thanks for all the tests, guys. I more and more come to the conclusion, that we are chasing more than only one bug here.
Yesterday I had a corrupted file where a range of sectors hasn't been written at all. It contained the contents of the file which was stored previously at that position. The screenshot desowin did at http://www.rockbox.org/irc/log-20080822#22:25:13 looks pretty much the same. I wasn't able to reproduce this yet. It seems to happen extremely rarely, at least with my hardware.
Jonathan, do you have the ability to prove what kind of corruption you're still experiencing?
Steve, for high speed USB you have to define USE_HIGH_SPEED as well.
Comment by Frank Gevaerts (fg) - Monday, 25 August 2008, 12:06 GMT
This could just be the "usual" corruption, but in the FAT structure instead of in the file
Comment by Martin Ritter (MartinR) - Monday, 25 August 2008, 14:55 GMT
Possibly. But chkdsk should easily detect this as a FAT mismatch. So if we have wrong sectors in a file and chkdsk doesn't complain, we probably have another bug. Unfortunately, I didn't run chkdsk then.
Comment by Tomasz Moń (desowin) - Monday, 25 August 2008, 15:05 GMT
fsck didn't complain for me.

I wanted to chkdsk it, but when I connected sansa to my laptop, it just froze whole Vista. (it unfreezed after I unplugged sansa)
Comment by Frank Gevaerts (fg) - Sunday, 31 August 2008, 15:18 GMT
I reverted r18327 for now, as it made things break hard on ipod (and it didn't solve the sansa issue completely anyway)
Comment by Jonathan Backer (jrbil) - Friday, 05 September 2008, 03:41 GMT
I just returned from holidays and did some more testing. The corruption that I'm witnessing only occurs when copying from my computer to the SD card (not from the computer to internal memory). When it happens the corruption is contiguous and spans about 6k (exact numbers are 6066, 6046, and 6026 bytes). This was the latest SVN with the cache patch applied.
Comment by Martin Ritter (MartinR) - Friday, 05 September 2008, 07:45 GMT
Jonathan, thanks again. Actually, the number of corrupted (because shifted) bytes should be a multiple of 32, as the 'two bytes inserted' bug always occurs at a 32 byte boundary.
But anyway, as we see, it's hardly possible to find a workaround that will work for everyone. So we really have to find the underlying problem, instead of trying to cure the symptoms. Maybe I already found out something, but yet it raises another problem. Investigating further...
Comment by Linus Nielsen Feltzing (linusnielsen) - Friday, 05 September 2008, 09:18 GMT
32 bytes sounds very much like a cache line size...
Comment by Martin Ritter (MartinR) - Friday, 05 September 2008, 10:36 GMT
It's also the size of the controller's FIFO (actually 16 words). What probably happens is, that after all 16 words are sent to the card, the first word is sent again. So it's probably a ring buffer which does one step too much sometimes.
Comment by Jonathan Backer (jrbil) - Friday, 05 September 2008, 14:50 GMT
I'm pretty sure about the size of the corruption (I did a "cmp -l file1 file2 | wc -l"), but I'm not sure that it is contiguous (I checked that by glancing over the the output by hand; if it helps, I can investigate that further). It did occur away from the file boundaries (neither at the beginning nor end). As a side note, fsck and chkdsk found no errors.
Comment by Michael Sevakis (MikeS) - Friday, 05 September 2008, 15:12 GMT
Could the card possibly continue clocking-out data on its own from words remaining in the FIFO? Is it simply a FIFO underrun condition?
Comment by Michael Sevakis (MikeS) - Friday, 05 September 2008, 16:57 GMT
I might also suggest (as I've done several times) to compare this SDHC controller to that of the one for the imx31. The similarity seems worthy of note and could perhaps be a good guide.
Comment by Martin Ritter (MartinR) - Monday, 08 September 2008, 08:04 GMT
This really might be some FIFO underrun condition. But actually, the controller should stop the card clock automatically in that case. There's nothing we can do about it at software side, besides correct initialization.

I already compared this controller to various SoC. The closest match is probably the PXA27x series plus a bit of PXA255 (search Google for PXA27x developer's manual). My patch at http://www.rockbox.org/tracker/task/8663#comment23382 was about renaming the register and bit definitions accordingly to this manual. It's out of sync now, but I'd resync it, if it would have a chance for committing. Actually, there is too much work in it to be dumped silently.
Comment by Toni (ahellmann) - Sunday, 25 January 2009, 11:24 GMT
This patch includes a modified sd_write_sectors() routine and related test functionality, which -I hope- avoids the 'two byte inserted' bug from  FS#8663 . Because of missing specifications the solution provided here is all 'try and error' and needs a lot of testing before committing. I misused the debug menu "CPU frequency" for test purpose.

Warning: Because the write test procedure uses raw sector access, some/many files will get destroyed, when running the "CPU frequency" debug menu!!!

If you still want to help, then please test your internal/external cards on e200/c200 in the following way:
- run debug menu "CPU frequency" with this patch (leave the test with 'context menu', long press of DOWN button)
- change number of write sectors
- change delay (0 ... 200 loops => 0...10usec @80MHz for interrupt simulation)
- change int/ext card
- change 30/80MHz
Always watch the number of bad sector writes in the display. Of course there should be no bad sector writes at all. Report your results back.

Additional infos:
I have experimented with the 'two bytes inserted' bug on the e200 and I come to the same conclusion as MartinRitter: There seems to be a hw related bug in the sd fifo handling. I used MartinRitter's copy-optimized patch with a modified ordering of FIFO_EMPTY poll status (to avoid the lockup problem) and replaced the udelay() with a more precise small asm wait loop, which needs 4 cpu cycles. By increasing the loop counter I got following results:

CPU @ 80MHz:
loop counter < 12: bug appears frequently
loop counter 12 ... 86: bug disappears
loop counter 86 ... 106: bug reappears
loop counter > 106: bug disappears

CPU @3 0MHz:
loop counter < 10: bug appears frequently
loop counter > 10: bug disappears

dram<->iram: has no effect.
int.card<->ext.card one external card showed the same bug, another did not show the bug
30MHz<->80MHz has effect. See above.

In a second approach, see this patch, I modified the UNKNOWN register in SD_WRITE_MULTIPLE_BLOCK call. The udelay() was no more necessary with the slow copy routine. I didn't get the bug at 30MHz/80MHz, dram/iram, int.card/ext.card and all number of sectors tested. The slow copy routine copies 1Byte/8cycles, which makes 10MB/s @80MHz.
Comment by Toni (ahellmann) - Sunday, 25 January 2009, 12:06 GMT
Correction: The slow copy routine copies 2Byte/8cycles, which makes 20MB/s @80MHz. So the speed limiting factor is the SD programming not the transfer itself I think.
Comment by Frank Gevaerts (fg) - Sunday, 25 January 2009, 22:42 GMT
I've tested this patch with USE_ROCKBOX_USB on my c200 and e200. Playing (briefly) with the "CPU frequency" settings and copying several gigabytes of data to internal storage and two different microSD cards (one microSD, one microSDHC) showed no errors at all.
Comment by Frank Gevaerts (fg) - Sunday, 25 January 2009, 23:03 GMT
I'm a bit confused here. You mention modifying the UNKNOWN register, but I don't see that in the patch?
Comment by Thomas Martitz (kugel.) - Monday, 26 January 2009, 00:51 GMT
Ok, results of my test:

I coped the windows 7 beta image from my ntfs partition to the sansa in linux.

$ md5sum /media/XP/Windows7/7000.0.081212-1400_client_de-de_Ultimate-GB1CULXFRE_DE_DVD.iso
52173cce05f14f14f6516ca4b5fe3482 /media/XP/Windows7/7000.0.081212-1400_client_de-de_Ultimate-GB1CULXFRE_DE_DVD.iso
Filesize: 3,357,288,448

$ md5sum /media/SANSA\ E200/7000.0.081212-1400_client_de-de_Ultimate-GB1CULXFRE_DE_DVD.iso
52173cce05f14f14f6516ca4b5fe3482 /media/SANSA E200/7000.0.081212-1400_client_de-de_Ultimate-GB1CULXFRE_DE_DVD.iso
Filesize: 3,357,288,448

Rockbox md5sum plugin:
D41D8CD98F00B204E9800998ECF8427E /7000.0.081212-1400_client_de-de_Ultimate-GB1CULXFRE_DE_DVD.iso
Filesize: -937,678,848 (Rockbox file properties plugin)

Several resets occured:
[37955.448216] sd 53:0:0:1: Attached scsi generic sg8 type 0
[38347.840024] usb 3-1: reset high speed USB device using ehci_hcd and address 51
[39124.828069] usb 3-1: reset high speed USB device using ehci_hcd and address 51
[39680.436020] usb 3-1: reset high speed USB device using ehci_hcd and address 51
[39716.480021] usb 3-1: reset high speed USB device using ehci_hcd and address 51
[40524.172035] usb 3-1: reset high speed USB device using ehci_hcd and address 51
[41218.636014] usb 3-1: reset high speed USB device using ehci_hcd and address 51
[41238.420016] usb 3-1: reset high speed USB device using ehci_hcd and address 51
[41240.148018] usb 3-1: reset high speed USB device using ehci_hcd and address 51
[41240.416021] usb 3-1: reset high speed USB device using ehci_hcd and address 51

Another note: The first 1,7GB transfered very fine (no resets), but then it got utterly slow, with many pauses and the resets occuring.
Comment by Martin Ritter (MartinR) - Monday, 26 January 2009, 10:35 GMT
Toni, I had no time to run your tests yet, but want to post the findings I did during the last months.

1. The critical time span seems to be from signaling FIFO_EMPTY until the FIFO is filled completely. It is possible to start filling the FIFO immediately after detecting FIFO_EMPTY without any delay at the maximum possible speed. But the last word must not be written to DATA_REG before at least 3 usec have elapsed since FIFO_EMPTY.

2. Setting SD_STATE_REG to SD_RCV right before the 'for' loop in sd_write_sectors seems to prevent the bug in case of higher delays. It somehow changes the behavior of the controller. As a side effect, we additionally need to wait for FIFO_EMPTY right before waiting for DATA_DONE at the end of the loop. Without this, the last sector was not written sometimes.
Comment by Toni (ahellmann) - Monday, 26 January 2009, 17:42 GMT
@fg: Pls have a look at the last parameter in the SD_WRITE_MULTIPLE_BLOCK call
@kugel: Did you use the standard ata_write_sectors() or ata_write_sectors()_mod for the copy action
@MartinR: Great findings. Can you post a patch here?
Comment by Thomas Martitz (kugel.) - Monday, 26 January 2009, 17:43 GMT
I don't know. I just applied patch and enabled USB, I didn't change any other thing. So, whatever the default is, I guess.
Comment by Jonathan Backer (jrbil) - Monday, 26 January 2009, 22:56 GMT
I did a bit of testing. I began by testing the first approach. I did this by I switching the last parameter of SD_WRITE_MULTIPLE_BLOCK in sd_write_sectors_mod back to 0x1c2d (the value in SVN). This caused sensitivity to the delay parameter when testing through the debug menu. At 30 MHz, a value greater than 7 was safe. At 80MHz values in the range of 5-106 were safe.

Applying the patch as is (with the last parameter of SD_WRITE_MULTIPLE_BLOCK set to 0xd), I did not notice any sensitivity to the delay parameter. Encouraged by this, I replaced sd_write_sectors in ata-sd-pp.c with the sd_write_sectors_mod included in the patch. After extensive testing, I have not found any corruption copying files over USB to and from internal storage and a 2GB microSD card (copying GB of data and 100s of files).

What's the difference in the new magic number (0xd) used in the call to SD_WRITE_MULTIPLE_BLOCK? Taking the SVN code and just changing 0x1c2d to 0xd caused corruption. So it's not just the magic number that makes the new code work.
Comment by Martin Ritter (MartinR) - Tuesday, 27 January 2009, 08:26 GMT
Toni, I ran your test with all possible settings and the NumBads counter remained 0. No time for an USB test yet, sorry.
Attached is the patch I'm using for weeks now. Comments inside. Your modified CMDAT flags seem to have a similar result as my SD_RCV approach, but without the side effect of the missing last sector. So I would prefer your approach here.
Comment by Toni (ahellmann) - Tuesday, 27 January 2009, 21:43 GMT
MartinR, I got no corruptions here with your patch even after inserting wait cycles at several places. So both yours and mine patch seems to work fine. Your patch has the advantage of higher transfer speed with 4byte aligned buffers especially at 30MHz.
Because there seems to be no proper bugfix in the near future I am all for including yours or mine (or a combination of both) workarounds in SVN, if there is no negative feedback here.
Comment by Martin Ritter (MartinR) - Wednesday, 28 January 2009, 09:51 GMT
Toni, I'd vote for your sd_write_sectors routine as it doesn't add code, plus my copy_write_sectors as it doesn't depend on the CPU clock, plus disable SERIALIZE_WRITES.
What do you think about my register renaming patch here: http://www.rockbox.org/tracker/task/8663#comment23382
It got no attention so far. Do you want me to resync it?
Comment by Toni (ahellmann) - Wednesday, 28 January 2009, 21:51 GMT
MartinR, it indeed sounds like a good idea, to combine the two patches in your proposed way. I personally like the register renaming too. But the actual implemented hardware clearly differs in some items, so please resync the renaming patch and add a comment, saying, that all the namings have been retrieved from a close match hardware specification. This is my opinion.
Comment by Martin Ritter (MartinR) - Thursday, 29 January 2009, 14:23 GMT
Toni, attached is a rework of my renaming patch as proposed. Additionally, I replaced the numerical cmdat parameters of the sd_command calls by their bit namings, for the bits which appear clear to me. After applying the patch, the same .o file is generated as before. So, I'm pretty sure that it will not harm anything.
Comment by MichaelGiacomelli (saratoga) - Friday, 30 January 2009, 17:29 GMT
Since this bug has occasionally managed to "brick" players (or rather force the use of recovery mode to fix the file system), I would prefer that a working fix go in as soon as possible, even if it is imperfect and needs to be updated later.
Comment by Thomas Martitz (kugel.) - Saturday, 31 January 2009, 14:28 GMT
I agree. Anything that makes it not cause filesystem corruption is better than what's in SVN, regardless of speed or whatever. I'd also have the fix in SVN, and do further work (whatever that needs to be done) in svn.

I bet I'm not the only one who's keen on getting rid of the OF entirely.

Also, re the #defines patch. I like to have that in too. Hopefully it'll make the pp sd driver more understandable and more adoptable (with an eye on the yet-missing bank switching mechanism in the as3525 sd driver).
Comment by Thomas Martitz (kugel.) - Sunday, 01 February 2009, 01:19 GMT
So, this patch what Martin proposed, i.e. using his copy_write_sectors and Toni's sd_write_sectors.

I've did some tests:

copy 1.4GB music files from PC to Sansa: MD5 all fine.
move 1.3GB music files from MicroSD to internal storage within Rockbox: All md5s are fine.
move 1.3GB back from internal storage to MicroSD *while music is playing* to get alternating CPU clocks: all md5s are fine.
edit: 2 more tests:
move the same 1.3GB back to internal again with music playing: all md5s are fine
copy the same 1.3GB to Microsd with music playing (music from the folder to be copied, I'm fairly sure I began copying before buffering ended): all md5s are fine
I made the SERIALIZE_WRITES dependant on CPU_PP for now. If it's the safer transfer method, it should probably be used by work-in-progress targets.

I'm all for committing!
Comment by Toni (ahellmann) - Sunday, 01 February 2009, 10:40 GMT
Same as above, but resynched and with test functionality included.

!!!!Warning!!!!:
Because the write test procedure uses raw sector access, the file system may get corrupted when running the "CPU frequency" debug menu!

Sansa e200, c200 and GoGear testers are needed to test this patch, because the timings may differ between these players. To do the test:
- run debug menu "CPU frequency" with this patch
- change number of write sectors (1-128)
- change aligned/misaligned data access
- change int/ext card
- change 30/80MHz
and watch the number of bad sector writes on the display. Preferably the test should run with music playing. Report your results back.
Comment by Thomas Martitz (kugel.) - Sunday, 01 February 2009, 13:50 GMT
Just out of curiosity. What's holding a working patch back? Or do you not believe that it's working? I mean, more work can always be done on code in SVN too.
Comment by Jonathan Backer (jrbil) - Monday, 02 February 2009, 01:52 GMT
I just did some testing of the latest patch and it seems pretty solid. At first I used the debug menu and tried various settings. I ran this for about an hour while music played in the background without any problems. Then I tried copying data over the USB. The transfer hung after about 100 MB transferred. Rebooting rockbox and trying again fixed that. I don't know how the test bench ties in. Could running that first have caused the problem? I just tried reproducing the problem without any luck.

Anyway, after the reboot, I transferred several GB of data without a hitch. I will note that my transfer speed with the patch is 1.1 MB/s, whereas I was getting closer to 2 MB/s before applying the patch, and close to 4 MB/s with the OF.
Comment by Steve (TheBashar) - Monday, 02 February 2009, 19:33 GMT
Tested sd_write_sectors_diff_v2 against svn r19905. Results in short: almost no problems. Ran the "cpu frequency menu" test for a couple minutes against internal storage with no problems. Ran test for about 30 minutes against external storage (1-128 sectors, y&n misaligned, normal boosting & boost locked) with no bad writes reported. Both int and ext tests were done while playing MPC audio from the internal storage.

Used h2testw to test write & verify over USB 17MB on internal storage with no errors. Wrote & verified 7.41GB on external storage with no errors. Copied test files off via USB and verified on hard drive with no errors. Copied files back over USB and verified on player with no errors. Write speed ~3.2 MB/s. Read speed ~6.1 MB/s.

The only odd behavior seen was with moving the h2testw generated files off the ext storage to my pc over USB. A short time after starting the operation I decided to cancel. The move process and explorer became unresponsive while attempting to cancel. Unplugging the USB cable unstuck the processes and after replugging the cable I was able to reinitiate the move with no problems.

+1 to commit!
http://www.codinghorror.com/blog/images/works-on-my-machine-starburst.png
Comment by Toni (ahellmann) - Monday, 02 February 2009, 21:37 GMT
OK, even I would like to see more test results from c200 and GoGear users, I will probably commit tomorrow, if there are no negative reports here and no objections in irc.
Comment by Thomas Martitz (kugel.) - Monday, 02 February 2009, 21:39 GMT
What about SERIALIZE_WRITES? You last patch doesn't deactivate it.
Comment by Steve (TheBashar) - Tuesday, 03 February 2009, 01:03 GMT
Reran my h2testw run against the external storage after commenting out SERIALIZE_WRITES. Again, no errors, but also no perceivable increase in write speed.
Comment by Jonathan Backer (jrbil) - Tuesday, 03 February 2009, 01:49 GMT
I've been copying music files back and forth on my e200 after disabling SERIALIZE_WRITES. I've noticed a substantial increase in write speed. I'm getting about 2.1 MB/s now, whereas I was getting about 1.1 MB/s before. About 7 GB transferred without error.
Comment by Aitor (Hito) - Thursday, 05 February 2009, 16:35 GMT
Hi, I'm a c200 user and tested the latest patch, also after disabling SERIALIZE_WRITES. I've registered to post my results.

I copied 7GB of data to external storage (SDHC) and found no errors at all (I tested all md5 sums). However, the transfer hanged the first time after sending 20MB and I had to unplug the usb cable to recover. This didn't happen again, and apparently caused no error on filesystem. Transfer speeds were about 3MB/s.
Comment by William (lee321987) - Monday, 09 February 2009, 03:40 GMT
I guess this is where to report problems with r19911.
Before I do will someone please clarify for me -- As of r19911, all I have to do is enable USB transfer, through my make file? Or is there also a patch on this page I need?
That is all I have done (change make file), and I'm getting bugs (not file/file system corruption though). Should I post them here?
Comment by Jonathan Backer (jrbil) - Monday, 09 February 2009, 15:08 GMT
If you make the changes to the Makefile as described in http://www.rockbox.org/twiki/bin/view/Main/PortalPlayerUsb, your player should connect and transfer data over USB. The patches here are intended for corruption problems. I've seen some work on USB detection problems in the past few weeks (plug in the DAP and it doesn't connect). That work is under Filespray #9831.
Comment by Toni (ahellmann) - Monday, 09 February 2009, 21:34 GMT
lee321987, I have some questions on the bugs: What kind of bugs did you get? Can you reproduce the bugs, even after reformatting (with original firmware)? Are the bugs on both internal/external drives? The bug described here, can cause file and/or file system corruptions. A binary compare between original and copied files will tell.
Comment by William (lee321987) - Monday, 09 February 2009, 22:51 GMT
Jonathan Backer:

I understand what the patches here, are for. The only change I have made before compiling is to enable USB transfer through the make file, but since r19911 is coming from work on this page, and this page is referenced in r19911, I figured this is where to report bugs on it. (I figured a normal bug report was wrong since I had to change the make file in order to encounter my bugs)

-------------------------------------------------------------------------------------------------------------------------------------------------

Toni:

Before I tell you the two bugs, I want you to know that I have not used any patches from this page! The only change I made was to enable USB transfer through my make file. Do I also need a patch from this page?

Two bugs, but neither is file or file system corruption.
They both occur after disconnect (when the connection was through RBs' firmware).
One is, right after disconnect, I get a screen that I cannot exit (requiring a hard reset). The message on that screen is:
========================
data abort
at 00009240 (0)
========================

The other is that when I click "Resume Playback" OR open an audio file through the "Files" menu, I get no playback (the time position stays at "00:00"). The only way to get playback working again, is to open a bookmark -- after that I can open files through the Files menu again.
Comment by Jonathan Backer (jrbil) - Monday, 09 February 2009, 22:59 GMT
lee321987: I reviewed the patches from this page. It is indeed the revisions that occurred in r19911. It slipped my attention. Sorry.
Comment by John Gu (einmus) - Tuesday, 10 February 2009, 14:21 GMT
I've experienced the same "data abort" problem as lee321987. And it is always exact at 00009240 (0).
Comment by John Gu (einmus) - Tuesday, 10 February 2009, 14:28 GMT
Sorry, tested again, data abort not exact at the same location (random).
Comment by Thomas Martitz (kugel.) - Tuesday, 10 February 2009, 16:23 GMT
The data aborts sound more like http://www.rockbox.org/tracker/task/9827 to me.

Loading...