Rockbox

  • Status Closed
  • Percent Complete
    100%
  • Task Type Bugs
  • Category Operating System/Drivers
  • Assigned To No-one
  • Operating System Sansa e200
  • Severity Low
  • Priority Very Low
  • Reported Version Daily build (which?)
  • Due in Version Undecided
  • Due Date Undecided
  • Votes
  • Private
Attached to Project: Rockbox
Opened by Martin Ritter - 2009-06-12
Last edited by Frank Gevaerts - 2010-06-14

FS#10319 - File corruption on USB writes when using the HID driver at the same time

Observed with Windows XP and e200 internal storage. Press buttons or turn the wheel while copying a big (~20MB)
file will most likely be corrupted in the way that some data is replaced by a short block starting with ‘USBS’. That would probably mean that the receive buffer was overwritten by a CSW before storage_write_sectors was called. The next block after this corruption (0×4000 later) looks like it was not written at all. The rest of the file is intact. Reading is not affected.

Closed by  Frank Gevaerts
2010-06-14 18:49
Reason for closing:  Fixed
Admin
Frank Gevaerts commented on 2009-06-12 12:36

Could you try going back to usb_storage r20962 to see if you can reproduce this there? You’ll need to remove the #ifdef USB_STORAGE near the top, and the corresponding #endif at the end.

Martin Ritter commented on 2009-06-12 14:28

I already tried to get r20964 running, but it locked up after inserting the USB plug. In usb_core.h I swapped #define USB_CHARGING_ONLY with #define USB_HID. In which file would I need to remove the #ifdef USB_STORAGE? Unfortunately, there will probably be no spare time over the weekend.

Admin
Frank Gevaerts commented on 2009-06-12 14:54

I meant just roll back usb_storage.c, nothing else.

Admin
Frank Gevaerts commented on 2009-06-12 16:04

I don’t seem to get this on my e200 with linux. I’ve now copied several hundreds of megabytes to my e200 while pressing buttons and turning the wheel.

I do get some bus resets though, so there’s definitely something that’s not entirely right yet.

Martin Ritter commented on 2009-06-15 09:14

Now I did a current build with usb_storeage.c of r20962 with #ifdef USB_STORAGE
are still corruptions, but now they occur less often. Instead of ‘USBS’ I now see ‘USBC’ in the corrupted file, which would mean that another CBW was received before the data of the previous command have been stored. Content of this CBW is a write command for LUN 0, which is not
are some lags in the transfer, which usually indicate bus resets. My theory is, that these resets are causing the corruptions if they occur in some bad moment. It’s the only way I can think of, that a new command is received before the data of the previous command have been stored. The HID driver is probably only triggering the resets
could also reproduce this with Linux (Open Suse 11.1) with an unmodified current build, but I needed much more tries. With Windows I get it at almost every
disabling the HID driver seems to reliably fix the problem.

Martin Ritter commented on 2009-06-15 13:09

I managed to catch it in a logf dump. This is with an unmodified current build. Reading from bottom to top, things start going wrong at line 286 with an unexpected length in a write command. Thereafter, host and device seem to disagree on the status of the conversation (’OUT received in SENDING_CSW’ and ‘IN received in RECEIVING’). Then after a reset (’set config’) things continue normally. So the resets are probably not causing this, something different is going wrong before. However, by reading the log I don’t seem to understand how a CSW would end up in the file. But it’s there.

   logf.txt (29.1 KiB)
Admin
Frank Gevaerts commented on 2009-06-15 13:16

My guess would be a bug in firmware/target/arm/usb-drv-arc.c. I haven’t properly looked yet though.

Martin Ritter commented on 2009-06-25 07:54

This is really strange. I get the same type of corruptions if the HID driver is disabled, but a HID mouse is connected to the same hub (front panel with built in 4 port
there is a general problem when interrupt transfers are present on the same bus. Would it be possible that the host cancels/stalls bulk transfers in order to maintain the poll interval of the HID? I noticed that the status field of the transfer descriptor(s) is not checked in transfer_completed in usb-drv-arc.c. Don’t know whether this would improve
simply ignoring the ‘unexpected length’ events prevents the corruptions and apparently the resets as well. That’s probably not the right way to do it but it shouldn’t harm either. Patch is
are still resets while reading and turning the wheel though.

Admin
Frank Gevaerts commented on 2009-06-26 15:16

I’ve committed this patch. It won’t hurt, and I do prefer occasional bus resets to occasional bus resets *and* write corruption.

The status field is indeed not checked at the moment. I need to fix that.

Tomer Shalev commented on 2010-01-22 21:36

In usb-drv-arc.c, prime_transfer(), when endpoints completes too fast (due to an interrupt transfer caused by HID, which probably has greater priority), then the transfer halts. This is a bug in the arc driver.

On Ubuntu, while transferring the file and pressing buttons on the DAP (initiating HID transfers), the transfer halts. The file operation progress bar is stuck at the same percentage level. The OS seems to wait for
handles this badly. When the DAP is disconnected after such scenario, Windows Explorer is still erratic, and windows needs reboot.

I’m still investigating...

Admin
Frank Gevaerts commented on 2010-01-26 23:03

From my research, I get the impression that the fast completes are not the cause.

My test setup is (on e200 for
run dd if=/dev/sd<your-sap> of=/dev/null
do lots of HID (I turn the wheel)

If the e200 is freshly booted, MSC stops immediately when turning the wheel. After ROLO, it’s a lot more stable, but eventually it also stops

What I’m seeing is that it stops because REG_ENDPTCOMPLETE has the endpoint-specific bit set prime_transfer(), which should have been cleared by transfer_completed(). First I suspected that usb_hid.c did not always wait for completion of the previous transfer before sending the new one (I think this can actually happen, but it doesn’t seem to be the cause of this issue. See hid.patch for a possible fix, but this should be reviewed more), but this seems to be wrong.

I haven’t managed to get much further yet. See debug.patch for all modifications and extra debug infor currently in my tree. The “REG_ENDPTCOMPLETE shows that the endpoint is still busy!” logf is what I’m concentrating on.

Tomer Shalev commented on 2010-01-27 05:56

I think that USB drivers should be able to send transfers whenever they need to. This should be abstracted by the abstraction level below (the arc driver).

I found that each requested transfer starts at the first corresponding item in td_array, regardless of whether the pipe is currently transferring or not. I think this what caused the corruption. The HID probably got priority over the MS, which cause it not to be completed before the next MS transfer is requested, thus the relevant td_array is overwritten while they are still part of the td’s linked-list pointed to by the pipe’s queue head.

The solution I’m proposing is to concatenate the newly requested transfer to the end of the list pointed to by the pipe’s queue head. We will need to keep pointers to the first and last pointers of the pipe’s td array items (this is explained in the data-sheet, but it is called ‘head’ and ‘tail’). We should just start with the first (head) item, it the pipe is not transferring. NUM_TDS_PER_EP might be needed to be bigger.

I’ve implemented this part yesterday and still debugging. I’ll post my patch (still work-in-progress) soon, when I get back home.

amaury pouly commented on 2010-01-27 19:37

I have tested my e200 under windows in virtualbox (with ubuntu )(yes that’s strange) and HID fails to work. Also only the first driver seems to be reported. But sometimes, it’s simply doesn’t at all
with same e200 under linux, doing sending HID reort during a big BULK transfer breaks the transfer as described.

Also tomers, from what I seen during the development of my simulator usb driver, I think rockbox never send/recv on a endpoint which is already transferring some data. But perhaps I misunderstand what you say (and also I haven’t got all the usb-arc driver details in mind)

Admin
Frank Gevaerts commented on 2010-01-27 20:02

tomers: are things really overwritten in td_array? Every endpoint/direction should have its own range.

I don’t mind multiple requests per endpoint as such, but how are you going to manage them? Just allocate lots, and hope that there are enough, or limit every EP or driver to a number of TDs? The rule of one transfer at a time might be slightly restrictive, but I’m afraid that actually using the alternative from a class driver is not going to be simple

Tomer Shalev commented on 2010-01-27 22:25


Also tomers, from what I seen during the development of my simulator usb driver, I think rockbox never send/recv on a endpoint which is already transferring some
But perhaps I misunderstand what you say (and also I haven’t got all the usb-arc driver details in mind)

I think that our stack does send on an endpoint which is already transferring some data, but it does so badly, possible overwriting the current transfer. That’s just an assumption at the moment, just from reading the
don’t know about the receive part, but now we focus on sending, i.e. copying a big file from the DAP to the host, while sending HID. I didn’t tested if that is also occurs when copying a file to the DAP, but let’s fix one thing first.

I will send my (partial) patch tomorrow. I couldn’t get to that today.


are things really overwritten in td_array? Every endpoint/direction should have its own range.

My solution is exactly the one suggested by the data sheet. Have separate array of TDs for each pipe (endpoint+direction pair). The array is used as a circular array, and it is actually being used to store a linked list of TDs being transferred. The linked-list is composed of adjacent items in the array (it may overlap from last item to first item in the array).

The QH store pointers to both head and tail, which are actually the first and last items of the corresponding array, in order to help in the implementation of the functionality explain
QH points to NULL (-1) if it doesn’t currently transfer anything, or points to the current array item, which is the current TD being transferred, i.e. the first item in the TDs linked list.

Adding transfer to a pipe which is currently transferring extends the linked list by concatenating the TDs of the new transfer at the end of the current linked
test should be made to check if all TDs are in use, which means buffer overflow).

Example:

TDs array size:
TD is marked with
TD item in use (not 0), points to the one to the right. The last TD in the array points to the first. If a TD is not in use (marked by 0 here), it points to NULL (-1).

Initial

Transfer A, takes 3

1 TD was sent when the driver is requested to send transfer B, which takes 4

4 TD were sent when the driver is requested to send transfer C, which takes 7

I hope this clears things up. Please let me know what you think.

Tomer Shalev commented on 2010-01-28 20:38

This patch is an initial work for implementing the suggested algorithm from previous comment. It doesn’t work yet, so please don’t try to run it. It is here just for reference.

Admin
Frank Gevaerts commented on 2010-01-30 21:57

tomers: while I’m definitelly not opposed to reworking the driver that way, all information I’ve seen seems to point out that this particular bug isn’t related to the driver now only handling one transfer per direction per endpoint at a
your rework be in a separate task?

Martin Ritter commented on 2010-03-29 09:08

After testing r25329 I think, this can be closed. At least the subject doesn’t apply any more. There are still stray resets when constantly turning the wheel while reading from DAP. But they’re so rarely that they might have a different cause. Maybe just one reset per 200MB. Writing to the DAP works
also tried reverting r24333. It seemed to have no ill effects and didn’t change the number of resets. So, if this bug was the only reason for r24333, I’d say revert it.

Admin
Frank Gevaerts commented on 2010-03-29 09:28

r24333 is not because of this. It’s because the code before that was incorrect in a way that *probably* doesn’t affect things, but it’s independent of HID

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing