This is the bug/patch tracker for Rockbox. Click here for more information.
Quick links: Bugs · Patches · Rockbox frontpage
FS#10319 - File corruption on USB writes when using the HID driver at the same time
Attached to Project:
Rockbox
Opened by Martin Ritter (MartinR) - Friday, 12 June 2009, 10:28 GMT+2
Last edited by Frank Gevaerts (fg) - Monday, 14 June 2010, 20:49 GMT+2
Opened by Martin Ritter (MartinR) - Friday, 12 June 2009, 10:28 GMT+2
Last edited by Frank Gevaerts (fg) - Monday, 14 June 2010, 20:49 GMT+2
|
DetailsObserved with Windows XP and e200 internal storage. Press buttons or turn the wheel while copying a big (~20MB) file.
The file will most likely be corrupted in the way that some data is replaced by a short block starting with 'USBS'. That would probably mean that the receive buffer was overwritten by a CSW before storage_write_sectors was called. The next block after this corruption (0x4000 later) looks like it was not written at all. The rest of the file is intact. Reading is not affected. |
This task depends upon
I do get some bus resets though, so there's definitely something that's not entirely right yet.
There are still corruptions, but now they occur less often. Instead of 'USBS' I now see 'USBC' in the corrupted file, which would mean that another CBW was received before the data of the previous command have been stored. Content of this CBW is a write command for LUN 0, which is not surprising.
There are some lags in the transfer, which usually indicate bus resets. My theory is, that these resets are causing the corruptions if they occur in some bad moment. It's the only way I can think of, that a new command is received before the data of the previous command have been stored. The HID driver is probably only triggering the resets somehow.
I could also reproduce this with Linux (Open Suse 11.1) with an unmodified current build, but I needed much more tries. With Windows I get it at almost every try.
However, disabling the HID driver seems to reliably fix the problem.
Maybe there is a general problem when interrupt transfers are present on the same bus. Would it be possible that the host cancels/stalls bulk transfers in order to maintain the poll interval of the HID? I noticed that the status field of the transfer descriptor(s) is not checked in transfer_completed in usb-drv-arc.c. Don't know whether this would improve things.
Anyway, simply ignoring the 'unexpected length' events prevents the corruptions and apparently the resets as well. That's probably not the right way to do it but it shouldn't harm either. Patch is attached.
There are still resets while reading and turning the wheel though.
The status field is indeed not checked at the moment. I need to fix that.
On Ubuntu, while transferring the file and pressing buttons on the DAP (initiating HID transfers), the transfer halts. The file operation progress bar is stuck at the same percentage level. The OS seems to wait for it.
Windows handles this badly. When the DAP is disconnected after such scenario, Windows Explorer is still erratic, and windows needs reboot.
I'm still investigating...
My test setup is (on e200 for me)
- run dd if=/dev/sd<your-sap> of=/dev/null bs=64k
- do lots of HID (I turn the wheel)
If the e200 is freshly booted, MSC stops immediately when turning the wheel. After ROLO, it's a lot more stable, but eventually it also stops
What I'm seeing is that it stops because REG_ENDPTCOMPLETE has the endpoint-specific bit set prime_transfer(), which should have been cleared by transfer_completed(). First I suspected that usb_hid.c did not always wait for completion of the previous transfer before sending the new one (I think this can actually happen, but it doesn't seem to be the cause of this issue. See hid.patch for a possible fix, but this should be reviewed more), but this seems to be wrong.
I haven't managed to get much further yet. See debug.patch for all modifications and extra debug infor currently in my tree. The "REG_ENDPTCOMPLETE shows that the endpoint is still busy!" logf is what I'm concentrating on.
I found that each requested transfer starts at the first corresponding item in td_array, regardless of whether the pipe is currently transferring or not. I think this what caused the corruption. The HID probably got priority over the MS, which cause it not to be completed before the next MS transfer is requested, thus the relevant td_array is overwritten while they are still part of the td's linked-list pointed to by the pipe's queue head.
The solution I'm proposing is to concatenate the newly requested transfer to the end of the list pointed to by the pipe's queue head. We will need to keep pointers to the first and last pointers of the pipe's td array items (this is explained in the data-sheet, but it is called 'head' and 'tail'). We should just start with the first (head) item, it the pipe is not transferring. NUM_TDS_PER_EP might be needed to be bigger.
I've implemented this part yesterday and still debugging. I'll post my patch (still work-in-progress) soon, when I get back home.
With with same e200 under linux, doing sending HID reort during a big BULK transfer breaks the transfer as described.
Also tomers, from what I seen during the development of my simulator usb driver, I think rockbox never send/recv on a endpoint which is already transferring some data. But perhaps I misunderstand what you say (and also I haven't got all the usb-arc driver details in mind)
I don't mind multiple requests per endpoint as such, but how are you going to manage them? Just allocate lots, and hope that there are enough, or limit every EP or driver to a number of TDs? The rule of one transfer at a time might be slightly restrictive, but I'm afraid that actually using the alternative from a class driver is not going to be simple
> Also tomers, from what I seen during the development of my simulator usb driver, I think rockbox never send/recv on a endpoint which is already transferring some data.
> But perhaps I misunderstand what you say (and also I haven't got all the usb-arc driver details in mind)
I think that our stack does send on an endpoint which is already transferring some data, but it does so badly, possible overwriting the current transfer. That's just an assumption at the moment, just from reading the code.
I don't know about the receive part, but now we focus on sending, i.e. copying a big file from the DAP to the host, while sending HID. I didn't tested if that is also occurs when copying a file to the DAP, but let's fix one thing first.
I will send my (partial) patch tomorrow. I couldn't get to that today.
gevaerts:
>tomers: are things really overwritten in td_array? Every endpoint/direction should have its own range.
My solution is exactly the one suggested by the data sheet. Have separate array of TDs for each pipe (endpoint+direction pair). The array is used as a circular array, and it is actually being used to store a linked list of TDs being transferred. The linked-list is composed of adjacent items in the array (it may overlap from last item to first item in the array).
The QH store pointers to both head and tail, which are actually the first and last items of the corresponding array, in order to help in the implementation of the functionality explain here.
The QH points to NULL (-1) if it doesn't currently transfer anything, or points to the current array item, which is the current TD being transferred, i.e. the first item in the TDs linked list.
Adding transfer to a pipe which is currently transferring extends the linked list by concatenating the TDs of the new transfer at the end of the current linked list.
(a test should be made to check if all TDs are in use, which means buffer overflow).
Example:
TDs array size: 10
Current TD is marked with underline.
Each TD item in use (not 0), points to the one to the right. The last TD in the array points to the first. If a TD is not in use (marked by 0 here), it points to NULL (-1).
Initial state:
_0_,0,0,0,0,0,0,0,0,0,0
Transfer A, takes 3 TDs:
_A_,A,A,0,0,0,0,0,0,0
1 TD was sent when the driver is requested to send transfer B, which takes 4 TDs:
0,_A_,A,B,B,B,B,0,0,0
4 TD were sent when the driver is requested to send transfer C, which takes 7 TDs:
C,C,C,C,0,_B_,B,C,C,C
I hope this clears things up. Please let me know what you think.
Shouldn't your rework be in a separate task?
I also tried reverting r24333. It seemed to have no ill effects and didn't change the number of resets. So, if this bug was the only reason for r24333, I'd say revert it.