FS#10319 - File corruption on USB writes when using the HID driver at the same time

Attached to Project: Rockbox
Opened by Martin Ritter (MartinR) - Friday, 12 June 2009, 08:28 GMT
Last edited by Frank Gevaerts (fg) - Monday, 14 June 2010, 18:49 GMT
Task Type Bugs
Category Operating System/Drivers
Status Closed
Assigned To No-one
Operating System Sansa e200
Severity Low
Priority Normal
Reported Version Daily build (which?)
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No


Observed with Windows XP and e200 internal storage. Press buttons or turn the wheel while copying a big (~20MB) file.
The file will most likely be corrupted in the way that some data is replaced by a short block starting with 'USBS'. That would probably mean that the receive buffer was overwritten by a CSW before storage_write_sectors was called. The next block after this corruption (0x4000 later) looks like it was not written at all. The rest of the file is intact. Reading is not affected.
This task depends upon

Closed by  Frank Gevaerts (fg)
Monday, 14 June 2010, 18:49 GMT
Reason for closing:  Fixed
Comment by Frank Gevaerts (fg) - Friday, 12 June 2009, 12:36 GMT
Could you try going back to usb_storage r20962 to see if you can reproduce this there? You'll need to remove the #ifdef USB_STORAGE near the top, and the corresponding #endif at the end.
Comment by Martin Ritter (MartinR) - Friday, 12 June 2009, 14:28 GMT
I already tried to get r20964 running, but it locked up after inserting the USB plug. In usb_core.h I swapped #define USB_CHARGING_ONLY with #define USB_HID. In which file would I need to remove the #ifdef USB_STORAGE? Unfortunately, there will probably be no spare time over the weekend.
Comment by Frank Gevaerts (fg) - Friday, 12 June 2009, 14:54 GMT
I meant just roll back usb_storage.c, nothing else.
Comment by Frank Gevaerts (fg) - Friday, 12 June 2009, 16:04 GMT
I don't seem to get this on my e200 with linux. I've now copied several hundreds of megabytes to my e200 while pressing buttons and turning the wheel.

I do get some bus resets though, so there's definitely something that's not entirely right yet.
Comment by Martin Ritter (MartinR) - Monday, 15 June 2009, 09:14 GMT
Now I did a current build with usb_storeage.c of r20962 with #ifdef USB_STORAGE removed.
There are still corruptions, but now they occur less often. Instead of 'USBS' I now see 'USBC' in the corrupted file, which would mean that another CBW was received before the data of the previous command have been stored. Content of this CBW is a write command for LUN 0, which is not surprising.
There are some lags in the transfer, which usually indicate bus resets. My theory is, that these resets are causing the corruptions if they occur in some bad moment. It's the only way I can think of, that a new command is received before the data of the previous command have been stored. The HID driver is probably only triggering the resets somehow.
I could also reproduce this with Linux (Open Suse 11.1) with an unmodified current build, but I needed much more tries. With Windows I get it at almost every try.
However, disabling the HID driver seems to reliably fix the problem.
Comment by Martin Ritter (MartinR) - Monday, 15 June 2009, 13:09 GMT
I managed to catch it in a logf dump. This is with an unmodified current build. Reading from bottom to top, things start going wrong at line 286 with an unexpected length in a write command. Thereafter, host and device seem to disagree on the status of the conversation ('OUT received in SENDING_CSW' and 'IN received in RECEIVING'). Then after a reset ('set config') things continue normally. So the resets are probably not causing this, something different is going wrong before. However, by reading the log I don't seem to understand how a CSW would end up in the file. But it's there.
   logf.txt (29.1 KiB)
Comment by Frank Gevaerts (fg) - Monday, 15 June 2009, 13:16 GMT
My guess would be a bug in firmware/target/arm/usb-drv-arc.c. I haven't properly looked yet though.
Comment by Martin Ritter (MartinR) - Thursday, 25 June 2009, 07:54 GMT
This is really strange. I get the same type of corruptions if the HID driver is disabled, but a HID mouse is connected to the same hub (front panel with built in 4 port hub).
Maybe there is a general problem when interrupt transfers are present on the same bus. Would it be possible that the host cancels/stalls bulk transfers in order to maintain the poll interval of the HID? I noticed that the status field of the transfer descriptor(s) is not checked in transfer_completed in usb-drv-arc.c. Don't know whether this would improve things.
Anyway, simply ignoring the 'unexpected length' events prevents the corruptions and apparently the resets as well. That's probably not the right way to do it but it shouldn't harm either. Patch is attached.
There are still resets while reading and turning the wheel though.
Comment by Frank Gevaerts (fg) - Friday, 26 June 2009, 15:16 GMT
I've committed this patch. It won't hurt, and I do prefer occasional bus resets to occasional bus resets *and* write corruption.

The status field is indeed not checked at the moment. I need to fix that.
Comment by Tomer Shalev (tomers) - Friday, 22 January 2010, 21:36 GMT
In usb-drv-arc.c, prime_transfer(), when endpoints completes too fast (due to an interrupt transfer caused by HID, which probably has greater priority), then the transfer halts. This is a bug in the arc driver.

On Ubuntu, while transferring the file and pressing buttons on the DAP (initiating HID transfers), the transfer halts. The file operation progress bar is stuck at the same percentage level. The OS seems to wait for it.
Windows handles this badly. When the DAP is disconnected after such scenario, Windows Explorer is still erratic, and windows needs reboot.

I'm still investigating...
Comment by Frank Gevaerts (fg) - Tuesday, 26 January 2010, 23:03 GMT
From my research, I get the impression that the fast completes are not the cause.

My test setup is (on e200 for me)
- run dd if=/dev/sd<your-sap> of=/dev/null bs=64k
- do lots of HID (I turn the wheel)

If the e200 is freshly booted, MSC stops immediately when turning the wheel. After ROLO, it's a lot more stable, but eventually it also stops

What I'm seeing is that it stops because REG_ENDPTCOMPLETE has the endpoint-specific bit set prime_transfer(), which should have been cleared by transfer_completed(). First I suspected that usb_hid.c did not always wait for completion of the previous transfer before sending the new one (I think this can actually happen, but it doesn't seem to be the cause of this issue. See hid.patch for a possible fix, but this should be reviewed more), but this seems to be wrong.

I haven't managed to get much further yet. See debug.patch for all modifications and extra debug infor currently in my tree. The "REG_ENDPTCOMPLETE shows that the endpoint is still busy!" logf is what I'm concentrating on.

Comment by Tomer Shalev (tomers) - Wednesday, 27 January 2010, 05:56 GMT
I think that USB drivers should be able to send transfers whenever they need to. This should be abstracted by the abstraction level below (the arc driver).

I found that each requested transfer starts at the first corresponding item in td_array, regardless of whether the pipe is currently transferring or not. I think this what caused the corruption. The HID probably got priority over the MS, which cause it not to be completed before the next MS transfer is requested, thus the relevant td_array is overwritten while they are still part of the td's linked-list pointed to by the pipe's queue head.

The solution I'm proposing is to concatenate the newly requested transfer to the end of the list pointed to by the pipe's queue head. We will need to keep pointers to the first and last pointers of the pipe's td array items (this is explained in the data-sheet, but it is called 'head' and 'tail'). We should just start with the first (head) item, it the pipe is not transferring. NUM_TDS_PER_EP might be needed to be bigger.

I've implemented this part yesterday and still debugging. I'll post my patch (still work-in-progress) soon, when I get back home.
Comment by amaury pouly (pamaury) - Wednesday, 27 January 2010, 19:37 GMT
I have tested my e200 under windows in virtualbox (with ubuntu )(yes that's strange) and HID fails to work. Also only the first driver seems to be reported. But sometimes, it's simply doesn't at all also.
With with same e200 under linux, doing sending HID reort during a big BULK transfer breaks the transfer as described.

Also tomers, from what I seen during the development of my simulator usb driver, I think rockbox never send/recv on a endpoint which is already transferring some data. But perhaps I misunderstand what you say (and also I haven't got all the usb-arc driver details in mind)
Comment by Frank Gevaerts (fg) - Wednesday, 27 January 2010, 20:02 GMT
tomers: are things really overwritten in td_array? Every endpoint/direction should have its own range.

I don't mind multiple requests per endpoint as such, but how are you going to manage them? Just allocate lots, and hope that there are enough, or limit every EP or driver to a number of TDs? The rule of one transfer at a time might be slightly restrictive, but I'm afraid that actually using the alternative from a class driver is not going to be simple
Comment by Tomer Shalev (tomers) - Wednesday, 27 January 2010, 22:25 GMT
> Also tomers, from what I seen during the development of my simulator usb driver, I think rockbox never send/recv on a endpoint which is already transferring some data.
> But perhaps I misunderstand what you say (and also I haven't got all the usb-arc driver details in mind)

I think that our stack does send on an endpoint which is already transferring some data, but it does so badly, possible overwriting the current transfer. That's just an assumption at the moment, just from reading the code.
I don't know about the receive part, but now we focus on sending, i.e. copying a big file from the DAP to the host, while sending HID. I didn't tested if that is also occurs when copying a file to the DAP, but let's fix one thing first.

I will send my (partial) patch tomorrow. I couldn't get to that today.

>tomers: are things really overwritten in td_array? Every endpoint/direction should have its own range.

My solution is exactly the one suggested by the data sheet. Have separate array of TDs for each pipe (endpoint+direction pair). The array is used as a circular array, and it is actually being used to store a linked list of TDs being transferred. The linked-list is composed of adjacent items in the array (it may overlap from last item to first item in the array).

The QH store pointers to both head and tail, which are actually the first and last items of the corresponding array, in order to help in the implementation of the functionality explain here.
The QH points to NULL (-1) if it doesn't currently transfer anything, or points to the current array item, which is the current TD being transferred, i.e. the first item in the TDs linked list.

Adding transfer to a pipe which is currently transferring extends the linked list by concatenating the TDs of the new transfer at the end of the current linked list.
(a test should be made to check if all TDs are in use, which means buffer overflow).


TDs array size: 10
Current TD is marked with underline.
Each TD item in use (not 0), points to the one to the right. The last TD in the array points to the first. If a TD is not in use (marked by 0 here), it points to NULL (-1).

Initial state:

Transfer A, takes 3 TDs:

1 TD was sent when the driver is requested to send transfer B, which takes 4 TDs:

4 TD were sent when the driver is requested to send transfer C, which takes 7 TDs:

I hope this clears things up. Please let me know what you think.
Comment by Tomer Shalev (tomers) - Thursday, 28 January 2010, 20:38 GMT
This patch is an initial work for implementing the suggested algorithm from previous comment. It doesn't work yet, so please don't try to run it. It is here just for reference.
Comment by Frank Gevaerts (fg) - Saturday, 30 January 2010, 21:57 GMT
tomers: while I'm definitelly not opposed to reworking the driver that way, all information I've seen seems to point out that this particular bug isn't related to the driver now only handling one transfer per direction per endpoint at a time.
Shouldn't your rework be in a separate task?
Comment by Martin Ritter (MartinR) - Monday, 29 March 2010, 09:08 GMT
After testing r25329 I think, this can be closed. At least the subject doesn't apply any more. There are still stray resets when constantly turning the wheel while reading from DAP. But they're so rarely that they might have a different cause. Maybe just one reset per 200MB. Writing to the DAP works perfectly.
I also tried reverting r24333. It seemed to have no ill effects and didn't change the number of resets. So, if this bug was the only reason for r24333, I'd say revert it.
Comment by Frank Gevaerts (fg) - Monday, 29 March 2010, 09:28 GMT
r24333 is not because of this. It's because the code before that was incorrect in a way that *probably* doesn't affect things, but it's independent of HID