Rockbox

Tasklist

FS#9708 - PP5020 ATA (IDE) DMA

Attached to Project: Rockbox
Opened by Boris Gjenero (dreamlayers) - Thursday, 25 December 2008, 01:09 GMT
Last edited by Torne Wuff (torne) - Tuesday, 23 March 2010, 20:50 GMT
Task Type Patches
Category Drivers
Status Closed
Assigned To No-one
Operating System iPod 5G
Severity Low
Priority Normal
Reported Version Daily build (which?)
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

I got Ultra DMA working on my 5th generation 30 gig iPod. Mode 2 (33.3 MB/s) seems stable, and it doesn't need CPU frequency boost. Mode 4 (66.7 MB/s) is the highest supported, but modes above 2 require CPU boosting.

The main benefit I can see from DMA is somewhat faster buffer filling. For example, with one large unfragmented FLAC file and the FS#9621 FAT read-ahead patch, the initial load is about twice as fast with DMA. Test_disk says 1 meg reads are 14102 KB/s and writes aren't sped up. I suspect the reported read speed is from the drive's cache, not the media. (BTW. Why are PIO reads slower than PIO writes in test_disk?)

The main cost of DMA is having to flush and invalidate the cache. I note that pp5020.h defines a CACHE_FLUSH_MASK, which makes me think it may be possible to only affect part of the cache. However, I don't know how to use that yet.

While implementing this I significantly altered firmware/drivers/ata.c. I merged the read and write functions into a transfer function which can do both reads and writes with both PIO and DMA. This removes some code duplication and IMHO enhances writes by using "write multiple" and doing retries if needed. I plan to make other changes to ata.c and add interrupt support. However, I feel the right thing to do is to redo the patch with minimum modifications to ata.c. The ata.c changes can become another, separate patch.

Due to interest shown on the developer mailing list, I am submitting this preliminary patch so I can get it out ASAP. I know it needs to be cleaned up, and it won't compile correctly for other targets, and things like detection of DMA modes need to be added. I'm sorry if this bothers you. I will release a cleaned up patch.

Note that messing with disk-related code can lead to filesystem corruption, losing all of your files, and having to restore your iPod. If you're concerned about this, you may want to at first disable writes at "//if (write)" in ata.c. I didn't have any problems, the test_disk write verify test passed, single sector transfers (like the FAT) are still done via PIO, and Ultra DMA CRC-checks DMA bursts.
This task depends upon

Closed by  Torne Wuff (torne)
Tuesday, 23 March 2010, 20:50 GMT
Reason for closing:  Fixed
Additional comments about closing:  submitted in r24405
Comment by Andree Buschmann (Buschel) - Thursday, 25 December 2008, 11:53 GMT
I did test runs with svn (r19589) against svn+fs#9708 and svn+fs#9708+fs#9621 on my 5.5G 30GB iPod. I measured a similar reduction in buffer filling time.

1st test: start to play an album (>80MB)
svn: ~10s
+fs#9708: ~7s
+fs#9708+fs#9621: ~7s

2nd test: start to play an large unfragmented track (>60MB)
svn: ~10s
+fs#9708: ~7s
+fs#9708+fs#9621: ~6s

The test_disk plugin shows:
read (4K, aligned): 7.653 / 12.171 KB/s (unboosted 24MHz / boosted 100MHz)
read (1M, aligned): 14.098 / 14.077 KB/s (unboosted 24MHz / boosted 100MHz)
Speed of writes is similar to svn.

I'll battery bench the fully patched version over night now -- let's see whether the theory of missing DMA being one of the battery eaters on iPods is true :-)
Comment by Seheon Ryu (cpu98) - Thursday, 25 December 2008, 12:19 GMT
My 80GB doesn't boot with 9708+9621 saying 'no partition found'.
Currently I'm not using official bootloader but few weeks old svn bootloader.
Comment by Seheon Ryu (cpu98) - Thursday, 25 December 2008, 13:49 GMT
(svn or 3.0) bootloader + 9621 bin = works
(svn or 3.0) bootloader + (9621+9708) bin = 'no partition found'
(9621+9708) bootloader + any bin = stuck at apple logo
Comment by Boris Gjenero (dreamlayers) - Thursday, 25 December 2008, 17:27 GMT
Currently, the code only uses DMA for transfers of more than one sector. Because of this, you can get to the main menu even without functional DMA. Failures would happen when you try to play music, use plugins or perform other activities which involve longer disk accesses. Problems before that point would probably be due to initialization and transfer mode setting in firmware/target/arm/ata-pp5020.c or the changes in firmware/drivers/ata.c. I will now work on integrating DMA into the current ata.c with minimum changes. If that doesn't make it work, we'll then have to look at what needs to change in ata-pp5020.c. I don't have access to an 80GB iPod.
Comment by Andree Buschmann (Buschel) - Thursday, 25 December 2008, 17:56 GMT
Boris, maybe the error reported by cpu98 is connected to the use of "#define MAX_PHYS_SECTOR_SIZE 1024" (config_ipodvideo.h)? This was introduced for the 60/80GB drives of the 5.5G iPods. The 60/80GB HDD have some kind of different behaviour when comparing them to the 30GB drives.

When using this patch I have some issues:
1. Sometimes the default font is loaded and not the theme's font.
2. Loading a plugin for the first time fails (tested with battery_bench and calculator). When re-trying to load the same plugin it works.

Bad thing:
Performing a battery bench is not possible right now. With r19589 the playback stops after 1h 15min. This is reproducible with the downloaded build.
Comment by Sanggon, Lee (isanggon) - Friday, 26 December 2008, 06:17 GMT
On my 5.5th 80gb, this patch doesn't work, too.
FS#9621 works fine.
Comment by Boris Gjenero (dreamlayers) - Friday, 26 December 2008, 23:54 GMT
Here's a cleaned up patch without ata.c refactoring. Here are some notes on the patch:

firmware/export/: DMA is enabled on the video iPod by defining HAVE_ATA_DMA in config-ipodvideo.h. If you want to test the code on another PP502x device, add that to your config file and report the results here.

debug_menu.c: Code to display DMA modes supported by the device was already present. I enabled it and changed it to check word 88 validity and check for UDMA 6 (133 MB/s). I also added a line to show what DMA mode is currently enabled. (Go to the "View disk info" debug screen for this.)

ata.c: The fastest DMA mode supported by both the device and the controller is used. PIO is used if DMA is unavailable. Some devices have design flaws and fail at modes which they claim to support (eg. http://www.linuxdevices.com/articles/AT5102023409.html ).

ata-pp5020.c: DMA compatible initialization and code for working with the DMA controller. This is the PP502x specific stuff.

ata-target.h: If you want UDMA 4, define ATA_UDMA_MAX to 4 here. I don't see any benefit however. Note that this causes boosting, so it's unfair to compare it to UDMA 2 without boosting.

About speed:
- Since boosting does not increase speed of 1 MB reads I assume CPU speed doesn't slow down UDMA reads. However, CPU speed limits what modes work reliably. UDMA 4 is unstable at 30 Mhz and UDMA 2 is unstable at 24 Mhz.
- UDMA writes are slower than PIO writes at 30 Mhz. At 80 Mhz, both are much faster and comparable in speed. (Should there be boosting or automatic picking of the fastest alternative?)
- If the address is not 32-bit aligned or the transfer is just one sector, DMA is not used.
- The Rockbox 0x10 PIO timing is faster than the PIO timings I have, so I'm not changing it.
Comment by Sanggon, Lee (isanggon) - Saturday, 27 December 2008, 02:38 GMT
Now works with 5.5th 80gb. Great job, dreamlayers. Thank you:)
Comment by Boris Gjenero (dreamlayers) - Saturday, 27 December 2008, 02:58 GMT
Andree, you're probably right about MAX_PHYS_SECTOR_SIZE. All transfers are for a multiple of phys_sector_mult sectors. If phys_sector_mult > 1, all transfers are via DMA (because they're for more than a single logical sector). I set phys_sector_mult = 2 to experiment. Apparently, if the very first transfer after the drive was woken up is via DMA, it causes problems from that point on. I have to investigate this. Meanwhile I've attached a simple patch (on top of the previous one) which makes transfers of 1024 bytes or less use PIO.

In the first patch I flushed the cache after a read instead of invalidating it. That may have something to do with the plugin and font issues. I also put the DMA transfer setup before the timeout check, which would probably cause problems in case of a timeout. These issues have been fixed in the second patch.

Regarding FS#9621: the only thing that patch has in common with this one is that it also tries to optimize disk access to speed up buffer filling. The two patches affect different things and they do not depend on each other. If Rockbox doesn't run correctly with this patch, adding 9621 almost certainly won't help.


Comment by Boris Gjenero (dreamlayers) - Saturday, 27 December 2008, 03:03 GMT
Oops, sorry, I should have reloaded the page before posting that.
Can you please make sure that on the 80gb transfers work properly after the drive has been asleep? Do they become much slower? If I set phys_sector_mult to 2 on mine, the difference is very visible when loading plugins.
Comment by Sanggon, Lee (isanggon) - Saturday, 27 December 2008, 03:15 GMT
Without "possible_80G_workaround.patch", 80gb device was frozen when I connect usb cable on rockbox and some times I feel slower than before.
I'll try "possible_80G_workaround.patch" and report again.
Comment by Sanggon, Lee (isanggon) - Saturday, 27 December 2008, 03:38 GMT
After "possible_80G_workaround.patch", above problem was solved. I think disk access after asleep works fine, too.
I'll upload test disk result after do test.
Comment by Sanggon, Lee (isanggon) - Saturday, 27 December 2008, 04:37 GMT
I think it works correctly on 80g now.
Comment by Andree Buschmann (Buschel) - Saturday, 27 December 2008, 07:03 GMT
The battery bench worked fine now. I have tested svn r19589 +  FS#9708  + FS#9621 (+  FS#8668 ).

Total runtime is 14:37 (90-10: 12:30). Typical runtime achieved before was 14:22 (90-10: 12:18), but I need to retest svn (+  FS#8668 ) withouth your patches.
Current savings calculated from this runtime are ~0.5mA. Which is less than I have expected.

Note1: This is an iPod 5.5G 30Gb, playing mpc. The default clock is 24MHz.

Note2: Boris, you have told the DMA transfer is unstable at 24Mhz. How does this instability show up, retries of read access while buffering?

Note3: I will first battery bench svn (+  FS#8668 ) to have a clean test, afterwards I will test your actual patch with boosted DMA.

Question: Does anybody of you have the problems I have described above (error on loading battery bench or calculator plugin the first time)?
Comment by Seheon Ryu (cpu98) - Saturday, 27 December 2008, 13:40 GMT
Mine works flawlessly.
I don't have problems Buschel mentioned.

By the way CPU_FREQ in config-ipodvideo.h looks like for coldfire targets,
though that is likely to be ignored.
Comment by Seheon Ryu (cpu98) - Saturday, 27 December 2008, 18:02 GMT
Sorry that CPU_FREQ was off-topic.
Patched bootloader says no partition found. (of course 3.0 bootloader works)
It doesn't stuck at apple logo now.
Comment by Boris Gjenero (dreamlayers) - Sunday, 28 December 2008, 04:16 GMT
Sanggon, thanks for the benchmarks. I note that some things are unreasonably slow. 512 byte writes are 23 KB/s or less! Create and delete counts at the top are low, and 1 meg boosted create with DMA is slow. Because the problem with 512 byte writes occurs even before the patch, I assume that it's either a hardware problem or a bug already present in Rockbox. I wonder if the hard drive spun down before the test or retries happened.

Andree, thanks for performing a battery bench. It's good to see that there is at least some savings.

If the CPU frequency isn't fast enough for the UDMA mode, I've seen various things happen. Sometimes it's just a lockup or read error. I've also been told that a plugin is for the wrong model. This makes me wonder if unreported errors occur. I imagine the data might get across the interface correctly but then the buffer holding data to be written to memory overflows. A flow control method is defined for UDMA, but the way 1 meg read speed doesn't change with CPU speed makes me wonder if it is works correctly. Note that music playback is not a valid test for unboosted disk access because the buffering code boosts while filling the buffer.

Regarding the bootloader, sorry, I forgot about that. The issues were: DRAM isn't remapped to 0 and the CPU runs at 24 Mhz. I'm including a small patch. UDMA1 seems to work at 24 Mhz. I'm not sure if cpu_boost works in the bootloader.
Comment by Boris Gjenero (dreamlayers) - Sunday, 28 December 2008, 06:11 GMT
I found why corrupt data was being read: errors weren't being checked for! Sorry!

The refactored ata.c checked for errors, and when putting my code in ata.c I just assumed that wait_for_end_of_transfer() returned zero if something is wrong. Actually, it doesn't. Oh, and when reads fail because of an overly low CPU frequency, I was getting:
ATA_STATUS == ( STATUS_RDY | 0x10 |STATUS_ERR ) (0x10 is "na" in the spec, so just ignore that)
ATA_ERROR == ( ICRC | ABRT ) (UDMA CRC error, command aborted)

When I saw this I looked looked at ata_write_sectors and found that write errors aren't checked, even in SVN without any patches! This should probably be submitted as a separate bug. I'll deal with that tomorrow.

I'm attaching a fix for error checking, currently only changing things if DMA support is compiled in.
Comment by Seheon Ryu (cpu98) - Sunday, 28 December 2008, 09:49 GMT
Applied all patches bootloader says that again.
Maybe it's another xxxx-bytes sector problem?
I can't test any device except 5.5G 80GB.
Comment by Andree Buschmann (Buschel) - Sunday, 28 December 2008, 14:44 GMT
Some new results.

The battery bench of r19589 (+  FS#8668 ) showed nearly identical runtime compared to the patched version (90-10: 12:24h). This leads to the assumption that there is no real current saving.
Question: Do you explicitly enable the DMA-controller somewhere? If so: Can it it be disabled after transfer and re-enabled before the next transfer? Maybe an enabled controller eats up some mA?

I have compiled and tested your second version of the patch combined with the error checking fix in two versions: UDMA 2 and UDMA 4. Both work, but the UDMA 2 version still shows an error when trying to load the battery bench plugin the first time -- the UDMA 4 worked flawless until now. As I am also using  FS#8668  which boosts the (with timeout) CPU after using the any button: Is it possible that boosting/unbossting or the CPU-delays between both clocks are the cause of DMA problems? This may also happen during normal playback and buffering. The UDMA 4 version boost during the whole DMA transfer, so it is not "disturbed" by re-clocking...

Update: The UDMA 4 version crashed with "data error" after some hours of playback.
Comment by Boris Gjenero (dreamlayers) - Monday, 29 December 2008, 02:33 GMT
It seems PCF ADC channel 7 shows battery current (absolute value of current in/out, from ISTAT of the LTC4066). Using this, I can't see a DMA vs. no DMA difference when the drive is sleeping or IDE power is off. I didn't examine current during disk activity yet, and this data wouldn't show differences of fractions of a milliamp. I could try to configure for DMA when needed, and undo that as soon as DMA is finished.

I added  FS#8668  and tested. If the hard drive is spun down, plugin load fails, and if it's spinning, plugin load succeeds. I think it fails if button boost time from  FS#8668  expires (due to spinup) and the read is attempted at 24 Mhz. To see if frequency changing causes problems, I removed 8668, made ata_wait_intrq keep toggling boost (30 to 80 and back) and disabled retries. I did get to a state once where disk access didn't work anymore until I restarted, but other than that I did a lot of plugin and image loading and it all worked.

I'm not sure what should be my top priority regarding this patch now. I'd like to fix the problem that happens if DMA is the first read after the drive has been off or asleep.
Comment by Robert Guzik (bobbyguzik) - Monday, 29 December 2008, 03:28 GMT
Just wanted to report that, testing with a 5.5G 80 GB iPod, with the possible_80gb_workaround patch, the bug where it would be slow when the first read is DMA and the drive is asleep seems to be fixed. One problem I am having, however, is when I have #define USE_ROCKBOX_USB defined in the config-ipodvideo.h file to use ROckbox's USB stack, when I plug it in, windows sees it, but no files show in My Computer. I have to test to see if this is a bug in SVN, with this patch or the fat preload patch. I'll let you know tomorrow.
Comment by Andree Buschmann (Buschel) - Monday, 29 December 2008, 07:33 GMT
Hi Boris,

the information about PCF ADC ch7 showing the battery current is new to me. Do you have a patch at hand? I would like to check with some current relevant settings.

It seems you are right with your assumption regarding  FS#8668  and the plugin load. Boosting the CPU manually and then load the plugin always seem to work.

UDMA Mode 0: I have changed your patch to simply use UDMA Mode 0. It works with 24 and 100MHz and does not show any problems loading the pluging while CPU is unboosted. I have attached the disk-speeds for 24/100MHz. You will see that boosted UDMA 0 is nearly as fast as boosted UDMA 2.
Update: UDMA 0 crashed during battery bench...

Configuration for 24MHz: In future there might be some kind of gui boost in svn (there are several talks about it). So, the default clock will be most likely lowered to 24MHz. In this case the DMA transfer should be either limited to UDMA 0 or at least use boosting for all modes >0, right?
Comment by Robert Guzik (bobbyguzik) - Monday, 29 December 2008, 13:07 GMT
I have tested, and the USB bug in my last comment is not present in SVN, but is present with all of your patches applied. Nice job on the patch so far though.
Comment by Boris Gjenero (dreamlayers) - Monday, 29 December 2008, 14:40 GMT
With r19610 from SVN (and no changes other than defining USE_ROCKBOX_USB in config-ipodvideo.h) I can't get USB to work with Vista. The mass storage device and drive appear in Device Manager. "Computer" (My Computer in former versions of Windows) and disk management both hang, and the iPod has short bursts of disk activity at regular intervals.
Comment by Robert Guzik (bobbyguzik) - Monday, 29 December 2008, 15:29 GMT
Hi Boris,
At least on my system, windows XP, with Rockbox SVN, it does what you mentioned for a couple minutes, but in my case, if I leave it, the drive will mount. With your patch, I don't get the bursts of disk access, neither does the drive mount.
Comment by Boris Gjenero (dreamlayers) - Wednesday, 31 December 2008, 05:50 GMT
With Rockbox SVN and USE_ROCKBOX_USB, I haven't been able to get the drive to appear in Vista. This patch didn't seem to change anything; I still got the same pulses of disk activity. I'll have to see what happens from another OS.

Regarding the bootloader, if it doesn't work for you, here are two things to try:
- Run it from within Rockbox. (Put bootloader-ipodvideo.ipod on the drive as a file and play it. Rockbox should start up again.)
- Decrease ATA_MAX_UDMA for the bootloader to 0 in firmware/target/arm/ata-target.h. (Note that there are separate bootloader and normal settings.)

I'm attaching a new version of the patch. It's a collection of all the fixes above, except with a proper fix instead of the 80G workaround. I've tested it on my 30G iPod both with phys_sector_mult set to 1 (which is normal) and with it forced to 2 (which should be like on the 80G). Note that with phys_sector_mult > 1, all aligned transfers are done via DMA. This includes the FAT. If you'd rather have FAT transfers via PIO, you can add the 80G workaround.
Comment by Robert Guzik (bobbyguzik) - Wednesday, 31 December 2008, 14:07 GMT
Hi boris,
On my system, USE_ROCKBOX_USB works with the last patch you uploaded. It's still slow, but it's slow in SVN when mounting, so it has nothing to do with your patch.
Nice job on the patch so far.
Comment by Robert Guzik (bobbyguzik) - Wednesday, 31 December 2008, 23:17 GMT
Hi Boris,
When using your latest patch on the 80 GB ipod, at least on mine, when playing music, after it's done buffering, the disk never spins down until I stop the music. This did not occur with the previous patch, that I'm aware of.
Thanks
Comment by Boris Gjenero (dreamlayers) - Thursday, 01 January 2009, 04:17 GMT
This also happens for me with the downloaded r19629 builds, so I think it's not due to my patch. I submitted it as  FS#9739  .

In ata.c I see a possibility for the disk to not spin down if setup commands or all data transfer commands after a spinup fail. I don't think this is what is going on however.

Comment by Jonathan Gordon (jdgordon) - Thursday, 01 January 2009, 05:11 GMT
yep, my bad... that little problem should be fixed
Comment by Boris Gjenero (dreamlayers) - Thursday, 01 January 2009, 05:20 GMT
Yep, fixed. Thanks, that sure was fast!
Comment by William (lee321987) - Friday, 02 January 2009, 00:14 GMT
Any chance this would help get USB transfers working for the Sansa c200 series (It's PP5022, and it's flash based)?
Comment by Boris Gjenero (dreamlayers) - Friday, 02 January 2009, 04:45 GMT
I don't think so this will help with Sansa devices. This patch works with the parallel ATA (commonly known as IDE) controller and its own DMA controller. The driver is firmware/drivers/ata.c. The Sansa devices don't use it. You might be interested in FS#8830. That patch seems to use a general-purpose DMA channel (a different part of the PortalPlayer chip) in the SD driver.
Comment by Andree Buschmann (Buschel) - Saturday, 03 January 2009, 12:13 GMT
Just did a short test with your latest patch configured to UDMA 1 using 24MHz (unboosted) and 100MHz (boosted). The disk_test plugin shows results which are very similar to UDMA 2 or UDMA 4. Loading plugins in an unboosted state worked fine.
Comment by Andree Buschmann (Buschel) - Sunday, 04 January 2009, 13:47 GMT
Quick draft of a theory why UDMA does not result in better battery lifetime in my tests. I am using mpc-format which is quite efficient and needs ~23MHz CPU-clock for playback (optimized dual-core mp3 needs a bit less than that). So, during buffering situation (which means boosted CPU) residual clock = boost-clock - playback-clock is left for loading via UDMA or PIO (e.g. 77MHz = 100-23 for my configuration). For efficient codecs this means the PIO-buffering is nearly as fast as the UDMA-buffering. This is valid for looping folders. When starting to play a new folder the codec needs to fill the pcm-buffer with decoded samples. In this case the codec's CPU load is much higher than e.g. 23MHz. This is why we see fatser loading with UDMA -- PIO is much slower than UDMA with lower clocks.
Anybody available to battery bench with high load codecs like AAC or APE?
Comment by David Hall (Soap) - Sunday, 04 January 2009, 18:25 GMT
Buschel, would not loading the CPU with the EQ have the same effect as using APE in this case?
Comment by Andree Buschmann (Buschel) - Sunday, 04 January 2009, 19:04 GMT
Soap, I do not know what how much CPU the EQ needs. But in general: yes, this should do the same.
Comment by Boris Gjenero (dreamlayers) - Sunday, 04 January 2009, 20:06 GMT
The buffer is loaded in 32 KB chunks (see BUFFERING_DEFAULT_FILECHUNK in apps/buffering.c). I just modified test_disk to benchmark that, and DMA reads were 10034 KB/s at 30 Mhz. I see that 100 Mhz gives a significant PIO performance boost, but unaligned 1 MB reads are still only 6681 KB/s in test_disk_log_udma1_100MHz.txt. (In PIO benchmarks I see aligned and unaligned speed is very close, and unaligned uses PIO even if DMA is available.) So, it seems PIO reads couldn't be sped up to be as fast as the DMA benchmark numbers. I wonder if DMA reads might slow down when the CPU uses memory bandwidth. While waiting for DMA to finish, this patch yields, and I think the codec might run then and take a significant amount of memory bandwidth.

I'll write some code to gather statistics. Those may help explain what is going on.
Comment by MichaelGiacomelli (saratoga) - Sunday, 04 January 2009, 20:28 GMT
I don't think codecs should have a lot of contention for DRAM. Particularly MP3 which runs almost entirely out of IRAM on the Ipod Video.
Comment by Boris Gjenero (dreamlayers) - Monday, 05 January 2009, 07:23 GMT
From my data collection results it seems that a lot of buffer filling reads are done via PIO. I think this has to be because they're not word-aligned (((unsigned long)addr & 3) != 0). I will investigate tomorrow.
Comment by Andree Buschmann (Buschel) - Monday, 05 January 2009, 08:05 GMT
Next battery bench result with r19589 +  FS#9708  (v0.2 UDMA1) +  FS#9749  + FS#9746 +  FS#8668  attached. The last bench I did (same as this without  FS#9708 ) lead to the assumption that neither  FS#9749  nor FS#9746 had an impact to the runtime.
So, there are two conclusions after this latest bench (total runtime 14:45h, 90-10: 12:36h):
1. It ran stable over >14:45h
2. UDMA seems to save at least some power (~0.5mA)
Comment by Boris Gjenero (dreamlayers) - Monday, 05 January 2009, 18:37 GMT
I understand the alignment issue now. In apps/buffering.c, the offset passed to read is aligned. However from the point of view of storage code, what matters is not alignment of the offset passed to read but alignment of sector boundaries. If first_frame_offset is not a multiple of 4, the whole file is read unaligned. It would be easy to change alignment (at least for TYPE_PACKET_AUDIO) so it suits storage code. However that would mean codecs may get unaligned data, and I don't know if that can cause problems or increased power consumption.

I'm attaching my current statistics code. Results are accessible via the debug menu. Some things may be viewed in "View disk info". All stats may be dumped to a file or reset via corresponding debug menu entries. The patch requires the v0.2 DMA patch, but it should also build with HAVE_ATA_DMA undefined. Time values are in microseconds from USEC_TIMER.

I'm also attaching a simple apps/buffering.c patch which I made now to change alignment. All I can say about it is that it works for me with MP3 now and all PIO reads are single sector reads now. Otherwise it is untested.

Andree, thanks for the battery bench results.
Comment by Andree Buschmann (Buschel) - Monday, 05 January 2009, 19:05 GMT
So, with this alignment patch all buffer-reads (except single sector reads) are DMA now. Therefor the buffering should be nearly completely done via DMA?
Comment by Boris Gjenero (dreamlayers) - Tuesday, 06 January 2009, 02:54 GMT
Oops, it was still possible to get unaligned reads if the user causes rebuffering by seeking within a track. I'm attaching a new buffer alignment patch which handles that too. Because I still can't say I know the buffering code very well, I'll just explain why I think this is correct:

In the buffering code, the initial file-to-memory alignment can be set in two places: bufopen when opening a new file and reset_handle when seeking. Additional reads are sequential on disk and in memory, so alignment is preserved. The buffer start is aligned and buffer length is a multiple of 4, so wrapping should preserve alignment. If all these things are correct, all buffer reads should be properly aligned for DMA.

Reads of file data when buffering a file are of BUFFERING_DEFAULT_FILECHUNK size unless reaching the buffer wrapping point or end of file. So those reads can only be PIO when only one sector needs to be transferred in those cases. However, remember that there are also other single sector reads such as FAT and audio metadata reads.
Comment by Andree Buschmann (Buschel) - Tuesday, 06 January 2009, 11:35 GMT
Next battery bench result with r19589 +  FS#9708  (v0.2 UDMA1 + v0.1 alignment) +  FS#9749  + FS#9746 +  FS#8668  attached.

Total runtime: 14:44h (90-10: 12:42h). The software also ran stable again, the 90-10 runtime slightly improved -- I am not sure whether the improvement is within measurement precision.

Boris, could you already compare the HDD spin-time of svn against patched version when performing battery benchs or longtime runs?
Comment by MichaelGiacomelli (saratoga) - Tuesday, 06 January 2009, 14:45 GMT
Could you use the microsecond timer to measure the buffering time with and without the patch? Might be neat to know the exact timing details.
Comment by Boris Gjenero (dreamlayers) - Wednesday, 07 January 2009, 23:53 GMT
Here is some data with charts: http://spreadsheets.google.com/ccc?key=pkeRcfM0sg949P3a5EYXtew . I'm comparing:
PIO: v0.2 DMA patch with HAVE_ATA_DMA undefined. Buffering is unmodified.
DMA,al: v0.2 DMA patch and align_buffer-v0.2.patch.
DMA,al,ch: v0.2 DMA patch, align_buffer-v0.2.patch, plus buffer filling in 0x100 sector chunks with no yields between blocks (because DMA yields).

It's all done on my 30 GB 5th generation iPod. H2BH8, Gravity and EP7 are shorthand for albums. H2BH8 and Gravity consist of multiple MP3 files and EP7 is a single FLAC file. Data is collected via the ATA statistics patch, with a few modifications to collect more data. PIO was done with the DMA patch included because the statistics patch depends on it. It shouldn't mess up the results because it's almost entirely not compiled due to HAVE_ATA_DMA not being defined. To really call this statistics, I'd probably have to collect more data. However, I think what I have now pretty clearly shows what's going on.

(BTW. It's interesting that the EP7 initial load PIO time includes a lot more unused time than other tests. I ran it a few times and got very similar results. I wonder if that's a bug. I think I've observed that happening occasionally with older unmodified versions of Rockbox.)
Comment by Andree Buschmann (Buschel) - Thursday, 08 January 2009, 18:53 GMT
The measurements and statistics are very interesting. Your patch reduces "Time powered" by ~30% -- this is the duration of the IDE-enable, right?

Do I understand the "Setup time" as the spin-up time corectly? It nearly matches the setup time of the HDD's data sheet (3s from standby).

Another question: There is difference of ~4-5s per buffer access between sum of all PIO+DMA transfers and the spinning time. Is this because of the spin down delay? Is it configured to 4-5s on your iPod?

I am still not sure why the impact on battery runtime is that low... HDD's data sheet sais power consumption is near 1000mW while spinning. This equals about 300mA of current. If I adapt your measurements to my tests we will save about 4-5s per buffer access. My test suite has 3 buffer access per hour -> 3*4-5 = 12s of 3600s -> 1-1,25mA average savings. From my battery benchs I can see half of this gain. Maybe the HDD uses less power?
Comment by Boris Gjenero (dreamlayers) - Friday, 09 January 2009, 00:59 GMT
Yes, "Time powered" is time during which IDE power is enabled.

Setup time includes all time from ide_power_enable(true) to when the HD is ready for its first data transfer command. During this time the hard drive spins up and a bunch of commands are executed to set up parameters (freeze lock, set transfer modes, etc.).

My spindown delay is configured to 5s. However, that ought to be irrelevant because the buffer filling code does storage_sleep(). Except for the anomalous but repeatable EP7 PIO result, I don't see 4-5s of unused time. I see 1 to 1.5 s without DMA and 0.75 to 1 s with DMA plus buffer filling in 0x100 sector chunks with no yields between blocks. I guess you may have been misled by how the spinning time also includes setup time and sleep command time.

According to http://www.toshibastorage.com/main.aspx?Path=StorageSolutions/1.8-inchHardDiskDrives/MK3008GAL/MK3008GALSpecifications , the MK3008GAL uses 1.0 W when reading or writing but only 0.4 W when idle. Here's what I see at the battery debug screen with  FS#9728  and a battery voltage between 4.0V and 4.1V:
HD off, playback stopped: 30-33 mA
HD off, playing MP3: 34 mA (Is this more than it should be? Why such a small difference compared to above? Is it just the debug screen wasting power?)
Spinning, heads loaded: ~260 mA for a few seconds, then ~120 mA
After a while there's a head unload click and current goes down to: ~100 mA
Sleeping but still powered: 70 mA
I think the highest number is the relevant one because I don't think the HD gets enough idle time to decrease current. So, if you believe  FS#9728 , it seems the MK3008GAL is using 230 mA of battery current. (This would mean more current supplied to the HD at 3.3 V if the DC-DC converter powering it has decent efficiency.)

An hour of 34 mA is 34 mAh and 12s of 230 mA is just 0.77 mAh. Plus the HD uses additional power for spinups, seeks and other operations, and the patch wouldn't affect that.

Where do your milliamp numbers come from?
Comment by MichaelGiacomelli (saratoga) - Friday, 09 January 2009, 01:31 GMT
>HD off, playing MP3: 34 mA (Is this more than it should be? Why such a small difference compared to above? Is it just the debug screen wasting power?)

MP3 runs almost completely unboosted, so the only additional draw is whatever is required to power up the DAC and the tiny increase in going from 24mhz (idle) to 30mhz (unboosted).
Comment by Boris Gjenero (dreamlayers) - Friday, 09 January 2009, 02:35 GMT
>MP3 runs almost completely unboosted, so the only additional draw is whatever is required to power up the DAC and the tiny increase in going from 24mhz (idle) to 30mhz (unboosted).

I was thinking the CPU would be doing more work, but I guess it's always 100% active? What about the COP, is it not sleeping normally? (Sorry about the off-topic questions.)

I don't think 24 MHz is ever set during normal operation. It would only be set if cpu_idle_mode(true) was called, and that is never called (I only see it in radio.c and in code which applies to other targets in usb.c.). If 24 MHz was set, UDMA2 reads would fail and UDMA1 or boosting for DMA would be necessary.
Comment by Andree Buschmann (Buschel) - Friday, 09 January 2009, 09:06 GMT
Boris, you are right regarding the current consumption. I simply calculated the 300mA from 1W/3.3V, but the voltage is 4V...

Does the HDD consume the same power as for reading while it is spinning up? I would expect more power to be needed.

If we would present the different steps for each buffering cycle, would it look like this?
Step Duration mA mAh
Setup 3.0s 230 0.19
Buffering 3.0s 230 0.19
Idle spin 5.0s 90 0.13

Total 0.51 mAh per buffer cycle

With PIO the buffering takes ~7s, total energy consumption is 0.76 mAh per buffer cycle then.

Where do Ih have the numbers from:
230 mA = 260-30mA
90 mA = 120-30mA
Setup time = 3s (from table)
Buffering = all DMA+PIO transfer time / 4 (whole ablum play, Gravity)
Idle spin = from your settings

Did I get the numbers right from your table?


Comment by MichaelGiacomelli (saratoga) - Friday, 09 January 2009, 15:03 GMT
>I was thinking the CPU would be doing more work, but I guess it's always 100% active? What about the COP, is it not sleeping normally? (Sorry about the off-topic questions.)

The COP sleeps normally, but the difference is very small between sleeping and full load: http://www.duke.edu/~mgg6/rockbox/30mhz-2.png

These are taken on PP5024 but should be similar.

>I don't think 24 MHz is ever set during normal operation.

I assumed it dropped back to that when not playing but I never looked. In that case the couple mA is probably the DAC and the extra CPU load.

Comment by Boris Gjenero (dreamlayers) - Friday, 09 January 2009, 21:34 GMT
Here are the sections I see, in order:
Setup: 2.8 s, power usage unknown. I see a brief peak of 400 mA and some time at >300 mA. MK3008GAL datasheet says 1.8 W.
Usable time: Time as measured, power usage >=230 mA.
Sleep: 0.7 s, power usage unknown. I see a peak of >300mA when unloading the heads.
Power off timeout: 2 s, 40 mA

Regarding the usable time, I don't know the exact sequence there, just the totals.
Idle time: except in the anomalous EP7-PIO result, there is well under 2 s of idle time. I think it's safe to assume that current is 230 mA because there is no activity to increase it and not enough idle time to decrease it. It takes more than 2 s for it to go down to 90 mA.
Reads, writes, seeks: > 230 mA. I've seen > 300 mA for a significant length of time even.

Yes, the current measurements are incomplete. In order to get accurate measurements, continuous sampling is needed, not just a measurement here and there. It might be possible to do this by making the COP fill a buffer with 4066_ISTAT and USEC_TIMER readings while the other core does normal work. It would definitely be possible to do the measurements with a digital storage oscilloscope or other ADC hardware with a computer interface (but I don't have any of this). The data could be graphed and integration could be used to figure out the amount of power used. Is this really important?
Comment by Boris Gjenero (dreamlayers) - Saturday, 10 January 2009, 04:25 GMT
In the EP7 PIO test, the storage_sleep() part of fill_buffer() (in apps/buffering.c) is executed early in the buffering and it is not executed at the end. That's why the disk doesn't spin down right after buffering is complete and it waits for the disk timeout. This delayed spindown happens with the downloaded r19739 build if dircache is enabled. I've submitted FS#9775 and I will investigate further.
Comment by Andree Buschmann (Buschel) - Sunday, 25 January 2009, 13:05 GMT
Just to give some feedback about stability: I am using dma_v0.2 + align_v0.2 for ~2 weeks now without any problems. Only change is to use UDMA 1 instead of UDMA 2 (as I am using 24MHz normal clock). I also did another battery bench which resulted in comparable results to the bench attached above.

Maybe a developer with more knowledge of and experience in the ata-part can review this patch?
Comment by Boris Gjenero (dreamlayers) - Wednesday, 28 January 2009, 19:28 GMT
I'm unable to see any difference in FireWire input current when idle and at the main menu. That's evidence that the changes made by this patch don't increase current draw when IDE power is off. It would be nice to see a current graph as Rockbox performs normal operations, but I would have to build a circuit for that.
Comment by Boris Gjenero (dreamlayers) - Saturday, 31 January 2009, 02:19 GMT
Here are two photos of FireWire input current without this patch:
http://picasaweb.google.com/boris.gjenero/RockboxRelated#5297270003826303634
http://picasaweb.google.com/boris.gjenero/RockboxRelated#5297270005596450130
I'm not sure if this is useful. I was thinking of adding more filtering, using it to AM modulate an audio frequency carrier, and recording that to WAV on a different device, but that might be overkill.
Comment by Andree Buschmann (Buschel) - Thursday, 12 February 2009, 05:27 GMT
Sync'ed against svn r19983 + minor changes:
- chunk size set to 256KB in alignment patch
- maximum UDMA mode set to 1 to have stable functionality even with 24MHz clock
Comment by Boris Gjenero (dreamlayers) - Thursday, 12 February 2009, 22:19 GMT
Thanks Andree

Regarding BUFFERING_DEFAULT_FILECHUNK: FAT code transfers up to 128KB at a time, doing multiple transfers if more is requested (or if data is fragmented). Also buffering code stops filling the buffer when a full chunk won't fit, so increasing chunk size may decrease the average amount of buffer space that is filled. I am attaching a small patch on top of the 0.3 patches above which enables 1 meg blocks and fills the whole buffer, as an experiment. I don't know if it is beneficial.

Technical explanation: ATA transfers are done in 512 byte blocks called logical sectors. (Physical sector size may differ, eg. on the 80G iPod.) LBA28 commands only allow an 8 bit number of logical sectors, and 0 means 256, so the maximum transfer size is 256 * 512 = 128KB. LBA48 commands allow a 16 bit value, or up to 65536 * 512 = 32MB. However, DMA hangs if the number of sectors is greater than 2048, so the limit is 2048 * 512 = 1MB.
Comment by Boris Gjenero (dreamlayers) - Tuesday, 17 February 2009, 02:01 GMT
http://www.rockbox.org/tracker/task/9910#comment28343 made me realize the cache handling is not correct for reads. The problem is that if the start or end isn't aligned to a cache line boundary, writes to memory just before or after the destination area can overwrite the start or end of what's been read with old data. I also understand why if the start and end is cache line aligned, I could just invalidate at the start of the read (instead of flushing before and invalidating later, as I do now.).

So, how do I solve the cache line alignment issue? Do I do 3 ATA commands, a PIO read at the start, a DMA read for the middle, and a PIO read at the end? Do I shut off the cache in the affected area during DMA?
Comment by Michael Sevakis (MikeS) - Thursday, 19 February 2009, 20:56 GMT
I'd just use PIO for alignment on "dangling ends". I think the cache management should be moved into the target code as well. If I use the generic part on gigabeat S, caching and transfers would be handled quite differently than for PP.

Nothing should be looking at the buffer anyway. If something must probe the buffer during that time, it should be done from the physical address.

cpucache_invalidate does writeback first so there's no need to worry about performing that. cpucache_flush before writes and cpucache_invalidate before reads is fine.

EDIT: Make response more coherent.
Comment by Frank Gevaerts (fg) - Thursday, 26 February 2009, 17:28 GMT
On my video 5G 30GB with pp5020-dma-v0.3.patch increases USB read speed from 5.0 MB/s to 8.3 MB/s. Write speed also goes up a bit, 8.3 MB/s to 8.9 MB/s. This is with r20109.
On my ipod mini 2G with a 32GB transcend 133x CF card, USB read speed goes from 3.2 MB/s to 7.2 MB/s. Write speed goes from 3.0 MB/s to 2.8 MB/s (obviously I added the HAVE_ATA_DMA define there too)
Comment by Frank Gevaerts (fg) - Thursday, 26 February 2009, 17:51 GMT
More tests, all over USB, using dd to the raw device with a 128k blocksize.
The USB code does 16K reads and writes. I suspect that this is the limiting factor on the video (but that's a bit hard to test without rewriting lots of USB mass storage code)

read write
mini 2G 32GB 133X CF
no dma 3.2 3.0
udma1 7.2 2.8
udma2 8.2 3.5
udma3 8.6 3.6
udma4 9.6 3.6

video 5G 30GB
no dma 5.0 8.3
udma1 8.3 8.9
udma2 8.9 8.9
udma3 8.9 8.9
udma4 8.9 8.9
Comment by Frank Gevaerts (fg) - Thursday, 26 February 2009, 20:23 GMT
I was wrong. At least the read speed seems to be limited by the USB side, not by the ATA side. For the video, this is also the case for write.
Comment by Boris Gjenero (dreamlayers) - Saturday, 28 February 2009, 18:03 GMT
I finally did my own battery bench. I compared unmodified r20111 with r20111 patched with pp5020-dma-v0.3.patch, align_buffer-v0.3.patch and 1_meg_blocks_experiment.patch, with UDMA mode set to 2.

For disk time measurements done with an earlier version of the patch, see the album "gravity" in http://spreadsheets.google.com/ccc?key=pkeRcfM0sg949P3a5EYXtew . (You can go to the bottom of the spreadsheet and click on the "Whole album chart" tab.)

I'm not impressed. Because the shutdown happened at 20% with DMA vs. 15% without, the time was shorter. If line up the start of the graphs (see .png), the difference very small, similar to FS#9746.

I guess if there's any reason for this patch now it'd be faster USB transfers. So, what should be the future of this patch? Is it important or not? What work needs to be done?
Comment by Andree Buschmann (Buschel) - Monday, 02 March 2009, 06:37 GMT
I just want to mention my opinion and wishes about how to proceed with this patch. I think this patch is absolutely valuable and should make it into svn -- even though the expected positive impact on runtime is not visible. The reasons are: it still gives faster buffering, it speeds up USB transfers (which is important now that RB-USB has been enabled on most PP-targets), it is bringing into work some more of the undocumented PP-hardware.
Nevertheless I am not too deep into the area of ata driver. That's why I hope that some of the other experts will find some time to review the code, maybe make some changes and submit this patch.

Good work so far!
Comment by Michael Sevakis (MikeS) - Monday, 02 March 2009, 09:45 GMT
I'm working on getting the DMA modes working on imx31 with the generic portion of this patch as a basis and that should help to identify any futher target-oriented splits that are required or possible simplifications. Still the cache-alignment on reads should be addressed and handled in ata-pp5020.c along with the cache handling. The main reason is the range operations that can be done with the ARM MMU.

Another thing I noticed is addition to obtain the physical address but what about IRAM buffers or addresses already physical (mpegplayer uses an uncached disk buffer)? Addresses in those ranges should have no transformation applied.
Comment by Boris Gjenero (dreamlayers) - Thursday, 05 March 2009, 03:03 GMT
>Another thing I noticed is addition to obtain the physical address but what about IRAM buffers or addresses already physical (mpegplayer uses an uncached disk buffer)? Addresses in those ranges should have no transformation applied.

Uncached RAM and IRAM should be handled properly in this patch, though I still need to test DMA to/from IRAM. Flash remapping isn't handled however, and so an attempt to dump the flash will hang. Rockbox needs functions which convert between physical and virtual addresses.

Regarding cache handling, if i wanted to do it efficiently, I'd have to handle it in fat.c and file.c. (In ata.c you don't know if the overlap is with another part of the same read or other data.) Should I just do it in a simpler way in ata.c so that less platform-independent code is changed? I'm a bit concerned about changing too much stuff to accomodate DMA and because of that ruining chances for acceptance of this patch.

When BUFFERING_DEFAULT_FILECHUNK is increased, if you resume playback near the end of a track that can lead to looping audio and playback continuing past the end. This also happens without any of the patches here. See:
http://www.rockbox.org/tracker/task/8206#comment28813
(Sorry, I didn't mean to include the patch in that comment. I don't think it works right. See the comment.)
Comment by Michael Sevakis (MikeS) - Thursday, 05 March 2009, 06:03 GMT
I'd do any cache alignment/address stuff in ata-pp5020.c. Dumping flash could use a temp buffer. I did introduce functions for gigabeat S (mmu-imx31.c) if you wish to use the API for consistency. Gigabeat S will still need virtual addresses passed to ata_dma_setup to do cache range operations properly and efficiently.

Edit: remove bit I'm sure was plain wrong (name confusion)
Comment by Boris Gjenero (dreamlayers) - Thursday, 05 March 2009, 06:51 GMT
DMA without cache coherency, cache lines and cache line interference may be found on many platforms. Because of this, I think it this be handled by platform-independent code. The only platform-specific facts seem to be: does the problem exist and what's the cache line size.

I agree that ata_dma_setup should get virtual addresses. I just thought it should use something external (eg. a function in system-pp502x.c or a define in system-target.h) to translate them to physical addresses. If the translation is hard-coded in ata_dma_setup, that may cause problems in the future.

I wish I could do cache range operations on the PP502x, but I've never seen functions for that. One day I'll read through http://daniel.haxx.se/sansa/memory_controller.txt and see what I can do.
Comment by Michael Sevakis (MikeS) - Thursday, 05 March 2009, 08:01 GMT
I gave some further thought to the problem and it's possible to scatterlist the DMA transfers by using a cache-aligned temp buffer for the first and last sectors if it's not aligned and DMA direct for what's in between. It is possible to initiate several transfers without issuing further commands (as opposed to mixing PIO and DMA)? Since I'm a bit new to this, I seem to have finally figured what you meant by "sector-aligned"-- each command has to transfer a whole sector. I guess that can make it easier or harder-- depending.

My thoughts really aren't on playback usage alone, btw. It should work in all cases and I wouldn't be too shy about making it work in general. I'm not sure yet but alignment may be immaterial to whether DMA can be used and only the cache alignment matters for imx31.

I'll just keep working independently over there from now on since I have enough to get through already for the beast.
Comment by Michael Sevakis (MikeS) - Thursday, 05 March 2009, 08:06 GMT
> I wish I could do cache range operations on the PP502x, but I've never seen functions for that. One day I'll read through http://daniel.haxx.se/sansa/memory_controller.txt and see what I can do.

There is no "standard" way I know of to do it with the flexibility of the ARM MMU on other chips. Every retailos code I've disassembled just does a full op. One possibility is to flush/invalidate by index into CACHE_FLUSH/CACHE_INVALIDATE register arrays if the cached address is in the range but I don't know if that would have side effects by just touching it. The mask and match method doesn't seem quite flexible enough (that wasn't known to work for sure anyway).
Comment by Boris Gjenero (dreamlayers) - Thursday, 05 March 2009, 08:30 GMT
Yes, I agree that scatter-gather would be faster, but I don't think PP502x supports scatter-gather DMA or single sector DMA transfers.

Regarding alignment, I said alignment of sector boundaries is what matters. When a read happens at the file level, both the current file offset and memory address matter. If the file offset isn't at the start of a sector, file-level code will take care of that, doing a single sector transfer into a separate buffer if necessary. Then the rest of the transfer starts at the offset of the first whole sector. This is the offset which must be word aligned for DMA to be possible on PP502x. If it's also cache line aligned, that's good. If it's not cache line aligned, but the rest of the cache line is used for partial sector data mentioned earlier, that's also good. If any data outside of what is being read shares the cache line, that requires special handling. At the end the same concerns apply. Oh, and if the file is fragmented or the request is large, FAT level code will split up the request into multiple ATA level calls.

Yes, I would also like to make this work in general. For example the database or some plugins may benefit. The things that really need absolute maximum speed can maybe care about cache line alignment themselves, but DMA should work regardless.
Comment by Michael Sevakis (MikeS) - Thursday, 05 March 2009, 09:57 GMT
Think it might not pay off to align sector caches in both file.c and fat.c then (at least for PP)?

Edit: typo
Comment by Boris Gjenero (dreamlayers) - Friday, 06 March 2009, 15:35 GMT
Rockbox has many buffers for disk sectors. There's are at least the large physical sector cache, FAT cache, directory cache, file sector cache, buffers in disk.c and buffers used by the USB stack. This patch uses PIO for single logical sector transfers, which probably means only the large physical sector cache needs to be aligned to a cache line boundary. (The USB stack already cache line aligns its buffers.) If you do DMA for single sector transfers, then at least the buffers which are used regularly should be cache line aligned (assuming not cache aligned cases are handled correctly but slower).
Comment by Michael Sevakis (MikeS) - Monday, 09 March 2009, 13:03 GMT
What restrictions for min/max DMA size are inherent to the standard? I've no restrictions as far as DMA itself on i.MX31.
Comment by Boris Gjenero (dreamlayers) - Monday, 09 March 2009, 17:03 GMT
According to the ATA standard, DMA does not place additional restrictions on size. The limits come from what you can specify via the command block registers: the minimum is one logical sector, and the maximum is 0x100 logical sectors with LBA28 and 0x10000 logical sectors with LBA48.
Comment by Michael Sevakis (MikeS) - Tuesday, 10 March 2009, 02:10 GMT
Well, I have UDMA4 (at least it should have chosen that since the drive supports up to UDMA5) running on i.IMX31 and buffering speed is pretty obscene. :-) test_disk later after actually doing some optimizing. A buffer fill only takes a few seconds.

There are in fact no restrictions whatsoever on the DMA engine (the SDMA core is involved by chip design). Trying to handle cache issues in a generic way would be a big negative for this processor since I can do it all within ata-imx31.c by scattering.

Edit: Fix driver/drive
Comment by Boris Gjenero (dreamlayers) - Tuesday, 10 March 2009, 04:44 GMT
Impressive! I'm curious about the test_disk scores and the effect on battery_bench results.

It would be easy to move the cache flush and invalidate into ata_dma_setup() and do away with the "#ifndef HAVE_DMA_CACHE_COHERENCY" code in ata.c.
I guess it's possible to read the start and end of a transfer via PIO from ata_dma_setup() and ata_dma_finish(). That is sometimes not needed, because ata_read_sectors() calls can be for data in the middle of a read() call, where the cache lines are only shared with other data from the same read() call. However, I guess that's not too bad.
Comment by Michael Sevakis (MikeS) - Tuesday, 10 March 2009, 07:21 GMT
Here it all is including PP5020 code. I did move the cache handling out at well so if I messed something up, feel free to correct it. This task should really be a more comprehensive one for a general DMA ATA framework.

Timings are for test_disk on Gigabeat S.

Edit: Attach speed test with optimized writing enabled for comparison.



Comment by Michael Sevakis (MikeS) - Tuesday, 10 March 2009, 08:02 GMT
I should mention that after invalidate for a read, a prefetch of the data is needed at the DMA boundary or the line contents not written by the MCU but written by DMA are undefined IIUC. Dummy reads take care of that. i.IMX31 has a prefetch instruction which help this without performing a full register load.
Comment by Michael Sevakis (MikeS) - Tuesday, 10 March 2009, 11:04 GMT
This just corrects a small error handling mistake (not that I"m completely positive about how I want to specifically handle errors) and adds enabling DMA in the Gigabeat S bootloader.

PS. I mucked up enabling it for the S bootloader and had a nice fast restore of my backed-up files. :)
Comment by Michael Sevakis (MikeS) - Tuesday, 10 March 2009, 22:46 GMT
I separated it into common patches and CPU type patches to aid collaboration. Apply the common part and the patch for the particular processor.
Comment by Michael Sevakis (MikeS) - Thursday, 12 March 2009, 02:24 GMT
Resync for pp502x. No update for i.MX31 because I plan on adding that soon and that target is not subject to the feature freeze.The common portion is in SVN as of r20298 thus only this patch is needed now.
Comment by Przemysław Hołubowski (p.h.) - Thursday, 12 March 2009, 09:34 GMT
I have tested the patch on H10-20 with 60GB HDD using UDMA1 mode. There is a problem - RB quite frequently crashes during coping large files from PC onto H10's HDD.

USB wite (PC->H10) spped is 9.4MB/s (about 1.5MB/s faster then using OF), and USB read speed is 8MB/s (OF attains 14-18MB/s).

I attach test_disk results too.
Comment by Andree Buschmann (Buschel) - Thursday, 12 March 2009, 10:27 GMT
Does RB also crash during copying via USB without using this patch?
Comment by Przemysław Hołubowski (p.h.) - Thursday, 12 March 2009, 10:31 GMT
No it does not.
Comment by Boris Gjenero (dreamlayers) - Thursday, 12 March 2009, 20:33 GMT
Mike, thanks for your work on this patch.

I see that the common portion got committed. Removing  FS#9721  code causes a problem: no ATA_STATUS error checking after DMA reads. PIO reads do their own error check, but DMA reads relied on wait_for_end_of_transfer(). There ought to be an error check after DMA reads. While DMA code is undergoing development, I think it is a good idea to also check write errors, but once the code is known to be stable, that issue may be left to  FS#9721 .

BTW I don't like how a single write failure can lead to a panic with the current  FS#9721  patch. I think the best solution is to retry failed writes like reads are currently retried, and the best way to do that is to merge the read and write functions similarly to 19581-dma-preview.patch. Any comments on this idea are welcome on the 9721 page.
Comment by Boris Gjenero (dreamlayers) - Friday, 13 March 2009, 05:25 GMT
I was unable to crash my 5G 30GB iPod running r20307 + ata-dma-pp5020-v0.6.patch. Using Vista, I moved 5 gigs back and forth a few times and I created and copied several 1.3 gig files.

How does RB crash on the H10? Is any message displayed? If the drive automatically unloads the heads after a period of inactivity, did that happen?
Comment by Przemysław Hołubowski (p.h.) - Friday, 13 March 2009, 09:32 GMT
It happened quite frequently and when it happened RB sometimes froze during USB write and the only thing I could do was to reset the player. (with pp5020-imx31-dma-v0.5.patch)

I have tested this once more using current using clean r20316 and with ata-dma-pp5020-v0.6 and  FS#10015  patches applied and RB worked stable during USB writes.
Comment by Frank Gevaerts (fg) - Friday, 13 March 2009, 09:41 GMT
OK. If it crashes during USB write and not elsewhere (test_disk plugin, copying files,...), blame USB :) Can you file a separate bug for that, and possibly also add your  FS#10015  related findings to  FS#10015 ?
Comment by Andree Buschmann (Buschel) - Saturday, 14 March 2009, 13:23 GMT
I have rebuilt rockbox against r20323 with v0.6-patch for PP502x. The interesting effect ist that UDMA 4 works stable now even for 24MHz! :-)
Attached are the results for 24MHz UDMA4 (01) and 100MHz UDMA4 (02).

We should re-test UDMA4 for other targets as well as this shows a measurable positive effect on the USB transfer rate (reading from target).

Edit: Just saw that I talked nonsense, UDMA4 boosts the CPU. So, it is not really working with 24MHz. Nevertheless UDMA4 configuration worked stable for >10 days now.
Comment by Przemysław Hołubowski (p.h.) - Sunday, 15 March 2009, 22:34 GMT
I think there's no need to add a bug report regarding USB write as it is stable with the ata-dma-pp5020-v0.6.patch. R20325 with the patch works stable up to UDMA4 80MHz on H10. UDMA5 is unstable.

Setting UDMA1 and 80MHz results in the best performance. Interestingly higher UDMA modes not only don't give better transfer rates but significantly lower.
Comment by Przemysław Hołubowski (p.h.) - Sunday, 15 March 2009, 22:48 GMT
Mentioned above H10 test_disk results for various UDMA modes.

Theoretically 1.8" HDDs have transfer rates below 33MB/s (Toshiba HDDs up to 80GB have, as far as I remember, about 25MB/s read speed) so UDMA2 and higher modes aren't justified and shouldn't give any significant speed improvement.

In opposite to the higher UDMA modes higher CPU frequency makes difference.
Comment by Andree Buschmann (Buschel) - Wednesday, 25 March 2009, 08:19 GMT
Boris, can you try to explain some details about ata_dma_set_mode()? I am interested in the yet unknown register that you set to 0xC0000000 as well as the IDE1 timings. Did you reverse engineer these timings or do you have any register information about it? Also the IDE0_CFG is of interest: The ">65MHz"-bit is unset (needs to be changed dependent upon the clock on iPod nanos), and at the very end a ">50MHz"-bit is set. As this setup is a one-time setup: Would maybe changes of some registers be needed when changing the clock? Is it possible to change the timings in a way that they allow UDMA2 or higher without boosting?
Comment by Boris Gjenero (dreamlayers) - Wednesday, 25 March 2009, 17:44 GMT
I don't have any PortalPlayer documentation. The code is based on reverse engineering. Code I examined only uses DMA at 80 Mhz. I set up everything for DMA at 80 Mhz. When I found that UDMA 2 worked at 30 Mhz I just left it that way. One concern is that CPU frequency changes may happen during DMA, and changing timings and other ATA host configuration during DMA might be bad.

I once tried changing 0x600060C4 and the 50 Mhz bit to see if there is a difference in power consumption and to see if I could do UDMA4 at 30 Mhz. I saw no difference in power consumption and I couldn't get UDMA4 to work at 30 Mhz. (0x600060C4 is set to zero at 24 Mhz.)

I don't know what's going on with the ">65MHz"-bit (IDE0_CFG & 0x10000000). It's cleared at both 80 Mhz and 24 Mhz. It is only set in a branch for all other frequencies. That branch doesn't change 0x600060C4 and it uses 0x11C1 for PIO timing and 0x80003261 for DMA timing regardless of mode.
Comment by Andree Buschmann (Buschel) - Wednesday, 25 March 2009, 21:08 GMT
So, from what you see from the code:

@80MHz:
0x600060C4 = 0xC0000000;
IDE0_CFG &= ~0x10000000;
IDE0_CFG |= 0x20000000;
PIO is untouched
DMA = depends upon DMA mode, did you also get the timings via re-engineering (based upon 80MHz?)

@24MHz:
0x600060C4 = 0x0;
IDE0_CFG &= ~0x10000000;
IDE0_CFG |= 0x20000000;
PIO is untouched
DMA is same as for 80MHz

other (this was not examined in DMA context but in context of clock setting, right?):
0x600060C4 is untouched (what is the default?)
IDE0_CFG |= 0x10000000;
IDE0_CFG is untouched
PIO = 0x11C1 (but 0x10 is faster)
DMA = 0x80003261; (for all DMA modes, is similar to UDMA3 setting)

An easy way to avoid clock changes during DMA transfer is of course to boost the CPU (as you do for UDMA > 2). Why not boosting for all UDMA modes? This equals the timings and settings you have examined from the code. :
Question: How often does the code boost/unboost right now (e.g. when loading 1 MB of data). I remember that each boost/unboost costs some hundred microseconds.

Btw, it is interesting that the last three digits of the DMA setting (e.g. 0x271 for UDMA3) are highly correlated to the maximum UDMA transfer rate: N / 0x7C1 ~= 16 (MB/s for UDMA0), N / 0x491 ~= 25 (MB/s for UDMA1), and so on (for N=30000) First and only exception is the setting for UDMA4 (which would lead to 265 MB/s). Coincidence?
Comment by Boris Gjenero (dreamlayers) - Thursday, 26 March 2009, 06:04 GMT
Changes to what you wrote above:
IDE0_CFG & 0x20000000 is the >= 50 Mhz bit, so it would be set at 80 Mhz and cleared at 24 Mhz. In the third case it is set if an only if the frequency is 50 Mhz or more.
At 80 and 24 Mhz, PIO timings are set from the corresponding table. The 80 Mhz table is in a comment in ata-pp5020.c. The 24 Mhz table is identical except for mode 0, where the timing value is 0 at 24 Mhz. However, it seems they are independent of DMA and they are slower than 0x10, so I don't set them.
The DMA timing is set from the tm_udma table at 80 Mhz only. It is set to zero at 24 Mhz. (At other frequencies, it is 0x80003261, as you wrote.)

I never tested the third third set of values. It's weird how those values don't depend on what the device claims to support. I wonder if they're for the Nano, but that's pure speculation.

None of this was found in a CPU clock setting context so far; it is part of IDE initialization code. I will do more research.

If boosting/unboosting was done, there would be one boost/unboost per ATA command. I can't answer in general regarding loading 1 MB. If loading the LBA28 maximum of 256 sectors at a time, that would be 8 times. If loading into the buffer, due to BUFFERING_DEFAULT_FILECHUNK it's 32 times, although the CPU stays boosted while filling the buffer and the cpu_boost calls just increment and decrement the boost counter. It's also possible to load 1 MB in one LBA48 command.

I thought that the timing registers contain multiple fields. I don't think a single number would be useful to the hardware because there are multiple timing parameters, and they don't all scale in similar ways. Plus that UDMA4 exception, which isn't a typo, is a counter-example. With DMA timings, I can draw a connection between some timing parameters and individual digits of the timing value, but that could just be a coincidence.
Comment by Boris Gjenero (dreamlayers) - Sunday, 29 March 2009, 23:37 GMT
Today I deleted everything except the small bit of OF data from my 5G 30GB iPod and then defragmented. Surprisingly, I still saw the 17 MB/s upper limit in test_disk and I still heard disk seeking during the test. I then tried a different test: reading 10 meg chunks via ata_read_sectors at regular intervals along the disk. A graph is attached. Here is a summary of the results:

* When DMA is fast enough, MK3008GAL read speed varies from 20.5 MB/s at the start to 9.7 MB/s at the end of the disk. I assume this is read speed off the media.
* When DMA isn't fast enough, there is a very slight increase in read speed until DMA becomes fast enough. Then speed follows the former curve. I assume the slight increase during the first part is due to decreasing DMA flow control overhead.
* UDMA 1-4 are virtually identical. UDMA 0 limits read speed. MDMA 2 is even slower.
* 30 Mhz and 80 Mhz are virtually identical.
* Reading in 1 meg chunks (with 1_meg_blocks_experiment.patch) instead of 128 k blocks makes virtually no difference.
* 24 Mhz slows down UDMA 1 by less than 2 MB/s in a small area near the start of the disk.
Comment by Andree Buschmann (Buschel) - Monday, 30 March 2009, 06:45 GMT
Very interesting results. The decreasing transfer rates over the disk area is expected, but the effect of clock and DMA-mode to the transfer rates are very valuable.

1.) As we see a connection from the clock to the transfer rates for with the test_disk-plugin: Is this an effect of faster copying within copy_read_sectors() and copy_write_sectors() in ata.c?

2.) So, we should really simply use MAX_UDMA 1 as configuration then (as it works stable for all clocks)?

3.) If 1) is true, is there a way to avoid the copying?
Comment by Boris Gjenero (dreamlayers) - Monday, 30 March 2009, 14:42 GMT
copy_read_sectors and copy_write_sectors are just used for PIO. So, they're only used for the FAT and only when physical sector size is 512. They're located in firmware/target/arm/ata-as-arm.S.

I was thinking about how I blamed the "slight increase in read speed until DMA becomes fast enough" on "DMA flow control overhead". This is wrong. Slower DMA modes have slower transfer rates, and there would be less flow control. I think the issue is what happens when the drive cannot send data as fast as it can read read off the media. The drive can continue filling its buffer. Even if the current command ended, the drive can read ahead (Rockbox enables that feature). However, eventually the buffer is full or the drive decides it has read ahead enough and reading stops. To restart reading, there is rotational latency: the drive needs to wait for the proper spot to rotate to the head. Stopping and restarting reading might also have a bit of other overhead. BTW. This is similar to why interleave mattered on MFM hard drives.

When a sequential read is broken up into multiple ATA commands (eg. by BUFFERING_DEFAULT_FILECHUNK), faster DMA allows more time for other code to execute between DMA operations, while still not interrupting reading. I think this is the only reason for using a DMA mode faster than UDMA 1 with an MK3008GAL. I don't know if this is important right now, because FAT reads often interrupt such reads, and that is a different issue.

Note that I didn't say anything about writes so far. DMA writes are slower than PIO writes at 30 Mhz! At 80 Mhz they are a bit faster.
Comment by Andree Buschmann (Buschel) - Monday, 30 March 2009, 15:47 GMT
> copy_read_sectors and copy_write_sectors are just used for PIO

Oops, I should have a deeper look at the sources before I write a comment. You are of course right.
Comment by Boris Gjenero (dreamlayers) - Thursday, 02 April 2009, 23:20 GMT
Single logical sector (512 byte) DMA transfers are possible on PP502x. The problems I experienced before were due to cache line interference. After aligning sector buffers in disk.c and fat.c, single sector transfers work properly. This leads to an increase in 512 byte aligned read speed, and a decrease in 512 byte aligned write speed. Here are the increases:
30 MHz: 1532 KB/s PIO, 1680 KB/s UDMA2
80 MHz: 2757 KB/s PIO, 3280 KB/s UDMA2
More limited testing using UDMA1 showed similar gains.

I propose aligning such buffers via a STORAGEALIGN_ATTR define, which expands to CACHEALIGN_ATTR when needed, and nothing when not needed. I guess that has to be set based on something from config-*.h, but I'm not sure what would be the proper config directive name. How about CACHE_ALIGN_SECTOR_BUFFERS? Note that this would be beneficial even when ATA properly handles unaligned buffers, because such handling requires some extra time. Does this seem reasonable? If it does, I'll create a patch.
Comment by Frank Gevaerts (fg) - Tuesday, 26 May 2009, 17:28 GMT
Resync after r21083.
What's actually blocking this? Are there remaining serious issues?
Comment by Rafaël Carré (funman) - Friday, 12 June 2009, 18:15 GMT
On the Sansa AMS there is also problems with cache line interference, although the functions in mmu-arm.S should take care of that.

The workaround used is to copy data from/to an aligned, uncached buffer, and use this buffer for DMA transfers, I just thought aligning buffers used in *_read/write_sectors() would be too much work / binsize increase.
Comment by Sascha Wolf (Horscht) - Wednesday, 17 June 2009, 20:17 GMT
currently (rev. 21318) this patch breaks playback, at least on my ipod video (5.5G 80GB).

(Jpeg) album art is displayed garbled and in wrong colors when the albumart is being displayed during disk activity, Playback freezes randomly after a few (5-7) songs.

This happens with either UDMA 1 or UDMA 4 mode.
Comment by Gman (Thecoolgman) - Wednesday, 17 June 2009, 21:38 GMT
Hmm, That's not good. Especially because I have an iPod video.
Comment by Andree Buschmann (Buschel) - Friday, 19 June 2009, 05:55 GMT
Horscht, can you check which change in svn breaks playback when using this patch?
Comment by Sascha Wolf (Horscht) - Saturday, 20 June 2009, 11:01 GMT
I am slightly Confused about what build breaks it.

I tried bisecting between 21250 and 21300 and I narrowed it down to 21261 being the first instance it breaks. But 21261 being a commit to the 3.3 branch confuses me...
Comment by Andree Buschmann (Buschel) - Saturday, 20 June 2009, 15:01 GMT
Horscht, there were some changes to jpeg-decoding (r21254-r21256) as well as the first commit of time pitch/stretch (r21258).
Did you try to use a WPS without cover art? or use .bmp instead?
Are you using time stretch?
Comment by Andree Buschmann (Buschel) - Saturday, 20 June 2009, 16:42 GMT
Horscht, there were some changes to jpeg-decoding (r21254-r21256) as well as the first commit of time pitch/stretch (r21258).
Did you try to use a WPS without cover art? or use .bmp instead?
Are you using time stretch?
Comment by Sascha Wolf (Horscht) - Saturday, 20 June 2009, 16:51 GMT
I did some re-testing and now narrowed it down to a revision between 21250 and 21260. Which one exactly it is, not yet sure, I'm gonna do further testing. I have not tried to use a WPS without coverart or bmp coverart. I will do as soon as I narrowed it down which revision breaks it.
Comment by Sascha Wolf (Horscht) - Saturday, 20 June 2009, 21:09 GMT
21258 is to blame for these issues, I think. I was not able to reproduce the albumart issue with bmp coverart, that one only happens on jpeg files.
Wether or not the timestrechting option is enabled in the options menu or not doesn't matter, it happens on both.
What I will try next: using a no-aa wps to check if album art is at fault for the playback issues.
Comment by Sascha Wolf (Horscht) - Saturday, 20 June 2009, 22:28 GMT
Addition: it is not related to the wps I am using, nor albumart.
I tried with the rockbox default wps (the realy bland one), and it also froze there.
Comment by Robert Keevil (obo) - Saturday, 20 June 2009, 23:12 GMT
A build using the v7 patch above with the latest SVN (r21404) works fine on my ipod g5 (60gb), including jpeg AA, using both my normal WPS and the default. Horscht - are you using that patch as-is, or do I need to make some changes?
Comment by Sascha Wolf (Horscht) - Sunday, 21 June 2009, 00:50 GMT
I am using the patch as is, allthough i have not yet built a new built (as I have the ipod running a few hours to check wether or not it freezes). I have just tried again with a current build (21440) and it is still happening. I added a screenshot of the album art issue I am having. Please take note, that the issue is at its worst when the player has just been started and I start playback the first time. After that it only happens after re-buffering but also not as bad as in that picture, often only 10 percent of the image (from the bottom) are garbled at that point
Comment by Andree Buschmann (Buschel) - Sunday, 21 June 2009, 10:15 GMT
Horscht, when taking a look at the changed files in r21258 and I cannot see any connection to this patch :/ Maybe you experience some other effect like data corruption on data read from HDD? Are you using any other additional patch like overclocking or such? What is DMA configured to (MAX_UDMA = 0, 1, 2, 3 or 4)?
Comment by Sascha Wolf (Horscht) - Sunday, 21 June 2009, 10:18 GMT
I am not using another patch. Also, it works fine without the patch. Also, the jpegs appear fine in the jpeg viewer, this issue only appears on aa. The first time i experienced that issue, I tried UDMA 1 and 4, both not working.

After that, I left it at default (UDMA 1) which should also be the safer mode.
Comment by Steve Bavin (pondlife) - Sunday, 21 June 2009, 11:58 GMT
Perhaps a bug in r21258 is corrupting things?
Comment by Andree Buschmann (Buschel) - Sunday, 21 June 2009, 13:31 GMT
pondlife, I was trying to figure this out through playing around a bit with dsp.c -- without any result yet. Maybe there is some bug in the buffer handling which was heavily changed in dsp.c?
Comment by Sascha Wolf (Horscht) - Sunday, 21 June 2009, 17:17 GMT
Bushel had me making a few changes that made it working again. I am this involved using the files dsp.c, dsp_arm.S, dsp_cf.S und dsp_asm.h from r21257 and adding 4 lines at the bottom of dsp.c.
As far as I understood, this removed the actual signal processing, but Andree is probably able to explain all this further.

Anyways, these changes fixed the albumart issue and also probably the playback issue. I had the player running for 1.5 hours straight right now without it freezing.
It indeed seems to be an issue with r21258 breaking this patch.
Comment by Sascha Wolf (Horscht) - Sunday, 21 June 2009, 20:40 GMT
I think this issue might only happen with dircache enabled... I wiped my settings (as per Andree's idea) and tried again. It worked fine this time. I then went on and recreated my original settings one after one, testing playback each time. Right after enabling dircache and the obligatory reboot, the albumart went garbled.
Comment by Boris Gjenero (dreamlayers) - Tuesday, 23 June 2009, 03:47 GMT
This may have something to do with the cache. Remember, the patch does flush and invalidate the cache as needed, but it ignores cache line interference. Perhaps some arrangement of memory leads to a cache line interference problem.

Also, it would be useful to check for ATA errors, for example using http://www.rockbox.org/tracker/9721?getfile=18233 .
Comment by Torne Wuff (torne) - Saturday, 27 June 2009, 21:53 GMT
Is the v0.3 buffering alignment patch the one to use with the v0.7 dma patch? Or is it no longer needed?
Comment by Torne Wuff (torne) - Sunday, 28 June 2009, 10:20 GMT
The ata stats patch also no longer applies and there's not enough context in some of the failing hunks for me to figure out where they are supposed to go, so I can't use that to work out if buffer alignment makes a difference either ;)
Comment by Torne Wuff (torne) - Sunday, 28 June 2009, 20:55 GMT
...ok, looks like the DMA patch v0.7 makes my 80gb 5.5g ipod hang, a few seconds after it starts buffering audio. My build is r21529 + the v0.7 patch here +  FS#9721  + FS#9746.
Comment by Torne Wuff (torne) - Sunday, 28 June 2009, 21:07 GMT
Actually, it hangs at pretty much random times, including before starting playback. Never had it fail to boot though. I have dircache+tagcache enabled and probably a million other things that interfere :)
Comment by Torne Wuff (torne) - Thursday, 09 July 2009, 13:37 GMT
OK, I've tried again with nothing but this patch and without my settings. Building r21732 with the v0.7 patch above and no other patches, it seems to work ok using the default settings, but if I turn on dircache, after rebooting album art corruption occurs exactly as Horscht reports. I've not played with it long enough to see if it actually hangs/crashes, but something is definitely going wrong there :)

It seems likely that cache line interference could be responsible.
Comment by Torne Wuff (torne) - Thursday, 09 July 2009, 14:17 GMT
I commented out cache_init() in system-pp502x.c and tried again, and though the player is unusably slow/stuttering while playing music, it does appear to work fine, no album art corruption or crashing. So, I guess someone needs to write the code to take care of cacheline alignment. :)
Comment by Torne Wuff (torne) - Thursday, 09 July 2009, 21:56 GMT
Modifying the patch so that it only does DMA for requests which are fully cacheline aligned also stops the problems from occuring for me, though obviously many requests are then not performed using DMA. I've updated the ata stats patch to work with the current code, and it shows that 5-10% of requests are being done with DMA for the testing I did with the stats turned on. I'm going to keep running a build like this for now, to test its stability, then will look into how to get buffer alignments up so that more requests are DMA'ed.
Comment by Torne Wuff (torne) - Thursday, 09 July 2009, 21:57 GMT
Oops, didn't attach the updated stats patch.
Comment by Torne Wuff (torne) - Saturday, 18 July 2009, 20:19 GMT
OK, I've run my player with the v0.7 DMA patch (no buffer alignment patch yet, one thing at a time) but with the added modification that it only does DMA for requests where the address in memory is aligned to a cacheline at both ends. The patch attached does that (and only that, requires the 0.7 patch first). The result is that only a small proportion of requests are done using DMA, but I can no longer get corrupt album art or crashes with any settings I've tried.

This is easier than fixing the code to actually handle the head/tail of the unaligned area properly. I'm now working on a patch to the buffering code so that it will try to align the majority of requests to cacheline boundaries, but the attached patch could probably do with more testing with people who have different configurations to me. You might want to use my updated ATA stats patch to see how many requests are being done with DMA, but it's optional.
Comment by Torne Wuff (torne) - Monday, 20 July 2009, 14:48 GMT
Slightly improved patch: Only require cacheline alignment for reads from disk, as writes to disk cannot cause cacheline interference, and tidy it up slightly.
Comment by Boris Gjenero (dreamlayers) - Tuesday, 18 August 2009, 18:50 GMT
I thought that ATA code had to read the start and end via PIO when not cache aligned. However, I never thought of an appealing implementation, and anyways, making one ATA command into three ATA commands seems very inefficient. The dma_cacheline_alignment patch seems like a better approach. Higher level code (such as buffering) can cache line align where performance matters. I'm now testing an updated buffer align patch.

The "bytes <= 512" test can be removed from ata_dma_setup(). Single sector DMA transfers work. I had problems with that in the past, but it was apparently due to cache line alignment. Currently, the change wouldn't make much difference because many of the buffers into which single sectors are read are not cache line aligned, but aligning them isn't hard.
Comment by Torne Wuff (torne) - Wednesday, 19 August 2009, 09:07 GMT
Unless I'm totally misunderstanding how ATA works, there is no way to read the start and end via PIO and read the middle via DMA: you're reading whole numbers of sectors, no? So if the start of the buffer isn't aligned it will never be aligned, each read just pushes the pointer on by some multiple of 512..

I've not had a chance to finish looking at the buffering alignment yet.
Comment by Boris Gjenero (dreamlayers) - Wednesday, 19 August 2009, 12:05 GMT
Yes, you have to read whole numbers of sectors, and if the start isn't aligned it will never be aligned. However, cache line interference which is entirely within the current read request ought to be harmless. It's all under your control (you decide when to do the reads, when to flush and when to invalidate). Nothing else should be accessing the area.

The buffering alignment code I'm testing is successfully causing all buffer data to be read via DMA. (The ATA stats show that the number of PIO reads equals the number of sectors read via PIO.) I did have one lockup though, so I want to examine and test things a bit more.
Comment by Corey (crakerjac) - Wednesday, 02 September 2009, 02:38 GMT
Can someone update to the latest SVN?

I am trying to convince myself that this patch decreases the time it takes to refill the music buffer.
Comment by Boris Gjenero (dreamlayers) - Wednesday, 09 September 2009, 17:20 GMT
ata-dma-pp5020-v0.7.patch ( http://www.rockbox.org/tracker/task/9708?getfile=19528 ) applies perfectly to r22644, so I see no reason to update it.

I am attaching two patches:
* dma_cacheline_alignment_v3.patch : In addition to the change in v2, it also permits single sector transfers via DMA. Earlier problems with single sector transfers were due to cache line alignment. (First apply ata-dma-pp5020-v0.7.patch.)
* buffer_cacheline_alignment_v1.patch : a re-done buffer alignment patch which cache line aligns the buffer so buffer data may be read via DMA without cache line alignment concerns. It appears correct, but I feel it needs a bit more examination and testing because I'm not familiar with the buffering code.

If all this works, then hopefully it can finally be committed.

When the buffer is being filled via DMA, it definitely fills faster. For example, see the data and charts at https://spreadsheets.google.com/ccc?key=0AsEpnuZKewlIcGtlUmNmTTBzZzk0OVAzYTVFWVh0ZXc&hl=en
Comment by Torne Wuff (torne) - Thursday, 10 September 2009, 23:10 GMT
Buffering patch looks reasonable to me, building with it and the v3 alignment patch now and will test under normal use for a while. I've updated the ATA stats patch again, both to resync it and to fix a few ifdefs that were missing which prevented bootloaders being built while it was applied.
Comment by Andree Buschmann (Buschel) - Friday, 11 September 2009, 06:06 GMT
Patched, built and updated yesterday evening. Overnight test on 5.5G ran smoothly so far, will test further in normal use.

Edit, 14-09-2009: No issues yet, performed a single runtime test which shows comparable results to former UDMA-builds.
Comment by Torne Wuff (torne) - Friday, 18 September 2009, 11:18 GMT
Been using the ipod like this for a week; atastats shows almost all reads are now done with DMA and I have had no functional issues. Not actually done a battery bench on it. dreamlayers, you could post on the dev ML maybe asking if anyone acquainted with buffering.c is willing to review the buffering alignment patch?
Comment by Andree Buschmann (Buschel) - Friday, 25 September 2009, 18:34 GMT
I am using ata-dma-pp5020-v0.7.patch (UDMA4-configured), dma_cacheline_alignment_v3.patch and buffer_cacheline_alignment_v1.patch for 2 weeks now without any problems. I recommend to submit this patch.

One additional (minor) idea: Can the test_disk plugin also be changed to use cache aligned read access (at least for the "aligned" measurements)? Otherwise we cannot see the positive effect of DMA transfers in this plugin for read access.
Comment by MichaelGiacomelli (saratoga) - Friday, 25 September 2009, 18:50 GMT
We are past the feature freeze so there is no reason not to commit it now. Some instability is expected immediately after a release.
Comment by Torne Wuff (torne) - Wednesday, 30 September 2009, 22:35 GMT
Oh, one thing; the current patches still don't handle the flash remapping, so attempting to dump the rom contents from the debug menu with the DMA patch causes a hang...
Comment by Torne Wuff (torne) - Thursday, 26 November 2009, 12:49 GMT
I have merged reading and writing in ata.c and committed  FS#9721  - write errors should now be detected properly, and will be handled by retrying until a timeout, the same as read errors. (PIO writes are also now done with WRITE MULTIPLE). The ATA stats patch no longer applies, though, and I've not had time to fix it up yet.

I am now testing DMA again; there were several crashes that were fairly reproducible on my ipodvideo, but I have good reason to believe they were all caused by the old patch for  FS#9721  (write errors because of player being subject to shaking, causing the FAT code to panic). I'll see how it goes, but I think we're getting close to committing this one :)

Someone (I forget who) suggested that DMA for single sector transfers is perhaps slower than PIO... is there any chance of someone doing some proper benchmarks of the current PIO code in svn (with WRITE MULTIPLE) against the latest patches here?
Comment by Rosso Maltese (asettico) - Friday, 27 November 2009, 16:24 GMT
First of all, a premise... I don't know DMA in detail. Now the question. :-)

I applied the last 2 patches at http://www.rockbox.org/tracker/task/9708#comment32509 and I never got any problem.
But I can't feel any enhancement in loading dircache and playback buffer. The disk info in debug menu shows UDMA 1 mode.
Can I change something to see the other modes or I'm missing something else?

Thanks for any hint.
Comment by Torne Wuff (torne) - Friday, 27 November 2009, 17:19 GMT
DMA is not a huge performance benefit in most cases, is the reason: the main benefit of DMA is not that it is faster, it's that it uses less CPU, and when you are playing cheap codecs (mp3 and friends) you will probably not notice a difference. You may find the difference more profound if you use a very expensive-to-run codec like APE. There are some battery bench results earlier in this discussion which show a consistent but small improvement in runtime with DMA (of the order of a couple of percent at best).

The disks in these players are not very fast to begin with: PIO is not *innately* slow, it's only slow if the CPU can't keep up with the transfer, and the disks are not quick enough (and the CPU not busy enough) that there is a problem in the usual cases we encounter.

Faster DMA modes are probably not worth using on a stock disk because the standard iPod disk is too slow for it to make any difference; a large upgraded disk could benefit from a faster mode, possibly, but the faster modes require the CPU to be running at a higher speed for DMA to be stable: the loss of power efficiency is probably not worth the speed improvement of having to increase the unboosted clock, or to force boosting...

Did you test DMA on a build at least r23741? If not can you update to at least that revision and try again? I want to make sure write errors are handled correctly; DMA is looking pretty safe, anyway, so we probably want to do some better benchmarks just to prove that it's never any *worse* than PIO, and it can go in...
Comment by Andree Buschmann (Buschel) - Friday, 27 November 2009, 21:51 GMT
Some test results from a build based on r23773 on a 5.5G with 30GB HDD.

For the PIO-build I have disabled HAVE_ATA_DMA in config-ipodvideo.h.
For the UDMA-build I have set ATA_MAX_UDMA to 1 in ata-target.h and removed the dma_cacheline_alignment patch to ensure UDMA is always used.

Remark: The build uses 24MHz normal clock and 100MHz boosted clock. Nevertheless I am sure the results can be used for further assumptions when comparing DMA and PIO.

Edit1: Removed UDMA4-tests (UDMA4 will always boost during transfers) and added UDMA1-tests.
Edit2: Forced UDMA mode for UDMA-tests via removal of dma_cacheline_alignment patch.
Comment by Rosso Maltese (asettico) - Saturday, 28 November 2009, 10:07 GMT
@torne

Thanks for the detailed explanation, now I have a clear idea on what's going on.
My latest build is r23570 plus some patches, this one included, so I haven't tested the new write errors handling.
I hope to do it asap.
Thank you.
Comment by Torne Wuff (torne) - Saturday, 28 November 2009, 16:47 GMT
test_disk isn't a suitable benchmark for this, really: it doesn't align to a large enough size to ensure that DMA is really being used...
Comment by Andree Buschmann (Buschel) - Sunday, 29 November 2009, 10:07 GMT
I have remeasured on a build without the dma_alignment_patch. From my understanding this will force the usage of UDMA mode for test_disk and all read/write actions. The results can be seen above.
It seems like writing with DMA is a bit slower for test_disk's "aligned" test cases. Reading for test_disk's "aligned" test cases is (much) faster.
Comment by Torne Wuff (torne) - Sunday, 29 November 2009, 14:19 GMT
The only reason I can think of why DMA should be slower to write is because writing involves a cache flush, but surely the cache flush isn't *that* slow to make that much difference on a 1MB block...
Comment by Torne Wuff (torne) - Sunday, 29 November 2009, 15:12 GMT
I removed the cache flush (very carefully, because doing it unconditionally made my ipod destroy its root directory) for test_disk and there is little difference in the write speed. Writes are way slower when unboosted even without the cache flush. Boosted, the difference is less, as your benchmarks show, but it's still worse.
Comment by Torne Wuff (torne) - Sunday, 29 November 2009, 15:46 GMT
DMA writes are also slower over USB. Hmm...
Comment by Andree Buschmann (Buschel) - Sunday, 29 November 2009, 18:07 GMT
Well, as reading is way faster with DMA, we could configure the patch to only use DMA for reading and keep PIO for writing. On the other hand I remember that I did real life measurements (transferring large files to the iPod via USB) that clearly showed a large positive impact on writing transfer rate.
Comment by Torne Wuff (torne) - Sunday, 29 November 2009, 18:53 GMT
PIO writing is faster in current svn than at the time this patch was developed: my changes to ata.c to use WRITE MULTIPLE sped it up... so it's possible that's no longer true? My USB testing was not particularly thorough, though; you are encouraged to test whatever you can think of ;)

I'm becoming tempted to commit the patch with DMA for reading only (and leave this FS# open while we investigate writes further) - it appears to be stable in all configs and it's not likely to get tested any *better* until it's in SVN.
Comment by Andree Buschmann (Buschel) - Sunday, 29 November 2009, 19:08 GMT
I agree 100%.
Comment by Rosso Maltese (asettico) - Saturday, 12 December 2009, 16:29 GMT
The dma_cacheline_alignment_v3.patch in http://www.rockbox.org/tracker/task/9708#comment32509 doesn't apply any more with r23944.
I have not the necessary knowledge to align it: please, can anyone help me?
TIA
Comment by Andree Buschmann (Buschel) - Saturday, 12 December 2009, 17:47 GMT
Synched against r23943.

ata-dma-pp5020-v0.8 now includes the dma_cacheline_alignment_v3 patch.
buffer_cacheline_alignment_v2 still needs to be applied seperately.
Comment by Torne Wuff (torne) - Sunday, 20 December 2009, 12:16 GMT
OK, I've experimented at length and DMA writes are a good 20-25% slower than PIO writes under basically all conditions, even if you don't bother to flush the cache. This is bizarre but what're you gonna do. :) The only reason to use DMA for writes would be because it can yield while writing, but that's not a huge advantage at this point I don't think, especially since in LBA28 mode it will never be more than 256 sectors at a time.

I have updated the patch to only bother to do DMA for cache-aligned reads, and have added it to the configuration for every disk-based PP502x target. I have also set it to use UDMA 2 if the clock speed is defined sufficiently high to support it - it will work for your GUI-boost build as well, buschel (automatically changing to UDMA 1). The buffer alignment patch is still seperate as above.

This could do with some testing on all the *other* players; it pretty much certainly works on ipodvideo and probably on the other ipods but I've never seen half these targets ;)
Comment by David Hall (Soap) - Sunday, 20 December 2009, 14:35 GMT
Is there any point on testing this on flash targets? How about flash-modified targets?
Comment by Torne Wuff (torne) - Sunday, 20 December 2009, 17:58 GMT
The file listing for the patch shows which targets it affects. It doesn't matter whether it's a rotating disk or a flash device, as long as it's accessed via the PP502x's ATA controller. ATA flash devices might support different DMA modes, perhaps, but the code should handle any combination (it detects it).
Comment by Rosso Maltese (asettico) - Wednesday, 23 December 2009, 11:34 GMT
The last patch (v9) with the last buffer alignment seem stable (ipodvideo 64MB).

With the v8 version, I noticed a short silence in playback when the buffer starts refilling. I mean not the first time the playback starts, but only when the buffer is almost used and needs to be refilled.
I can't tell if the same behaviour happens with the v9, because I need some hours to use the entire buffer: probably I can tell something more this night, as I will travel for a couple of hundreds kilometres.

A question a little bit OT: before making this test, I had applied also  FS#8668 . With this one, playback during buffering hang RB up. Reading the disk info in the debug menu, I saw UDMA mode 4, not 2. IIRC mode 4 was declared unstable in some previous comment. Could it depend on the CPU frequencies modifications with

#define CPUFREQ_SLEEP 32768
#define CPUFREQ_DEFAULT 24000000
-#define CPUFREQ_NORMAL 30000000
+#define CPUFREQ_NORMAL 24000000
#define CPUFREQ_MAX 80000000
Comment by Torne Wuff (torne) - Friday, 25 December 2009, 14:25 GMT
The v8 patch is wrong and is set to use UDMA 4 which causes boosting, this is probably why you saw problems. The latest patch should work with  FS#8668  fine as I am choong the appropriate DMA mode for the configured clock speed.
Comment by Andree Buschmann (Buschel) - Friday, 01 January 2010, 21:31 GMT
Sorry for the UDMA4 misconfigiguration -- I was using this since ever on my 5.5G 32MB. Nevertheless it is interesting that there are reproducible issues with a 64MB 5G... The described failures seem to be connected to the UDMA4 itself and not the boosting, as the boosting is done independently by the buffering code (the CPU is boosted while buffering).

This made me think of another experiment: Why not disabling the "buffering boost" when DMA is used (encapsulate the boosting/unboosting with #ifndef HAVE_ATA_DMA)? Boris's measurements showed there is no big difference in read speed, very short tests with buffering of a new folder on my iPod showed the same result. We would save several seconds of boosting per hour :o)
Comment by Rosso Maltese (asettico) - Friday, 01 January 2010, 22:50 GMT
I applied the v9 patch (as now without  FS#8668 ): it's stable and I never got a system hang up.
Unfortunately, the short playback pause when buffer refill starts is yet present (with v3 patch the pause did not occur).
Comment by Torne Wuff (torne) - Saturday, 02 January 2010, 12:33 GMT
Not boosting for buffering might indeed work out better; there is indeed no real difference in speed (I have quite a lot of benchmarks of raw sector access now).

Rosso, I've never had that problem; the DMA code itself has not changed at all, the changes I have made are simply to disable using DMA for writes, and to tidy up a few things... there's no reason why it should behave differently with regards to buffering with the seperate patches vs now..
Comment by Andree Buschmann (Buschel) - Saturday, 02 January 2010, 13:05 GMT
Here is the buffer aligment patch v2 + removed boost. My iPod is just performing a stability/runtime test since 13h with this (+unchanged dma_v9 +fs#8668). Until now there were no issues, no dropouts, no hangups. Overall runtime is extrapolating towards 20-21h (which is to be expected).

Edit: Worked stable until battery ran dry after 20:11h.
Comment by Seheon Ryu (cpu98) - Saturday, 02 January 2010, 16:12 GMT
8668+v9+noboost+unrelated changes

My one sometimes stopped buffering and kept disk running while I was browsing.(no seeking)
Buffer wasn't fully filled(alloc:54M/59.6M) while there was more song on dynamic playlist.
Presssing >> or stop(long >) caused lockup.

Another one?
Disconnected from PC, pressed > button, got wps, played nothing, debug shows(pcm:-1342873487somthing like this/529200), selected same song, same happens, selected another song, hung up.
Comment by Torne Wuff (torne) - Saturday, 02 January 2010, 16:24 GMT
Can you test it without patches which change the CPU boost behaviour?
Comment by Seheon Ryu (cpu98) - Saturday, 02 January 2010, 16:24 GMT
torne: Compiling. :)
Comment by Seheon Ryu (cpu98) - Saturday, 02 January 2010, 16:45 GMT
v9+noboost
Stable.
v9+noboost+8668
Rock stable even under crazy commands....

Sorry, maybe "unrelated changes" made it!
I'll report when I experience real problem.
Comment by Seheon Ryu (cpu98) - Saturday, 02 January 2010, 16:49 GMT
v9+noboost+8668(no other changes)
I got real lockup, same as above.
Comment by Andree Buschmann (Buschel) - Saturday, 02 January 2010, 17:23 GMT
Could you please retest with the following configurations:
1) v9 + buffer _cache_alignment_v2
2) v9
3) svn
So, all without the boost-disabling.
Comment by Seheon Ryu (cpu98) - Saturday, 02 January 2010, 17:31 GMT
I confirm 1 is stable(I used them for 7+days).
So 2 and 3 will be stable too.
I'll fall back to v9+v2_noboost(without 8668) and test it.
Comment by Seheon Ryu (cpu98) - Sunday, 03 January 2010, 12:15 GMT
v9+v2_noboost(without 8668)
I had similar problem.
Music stopped->debug shows pcm:128123612481/52????(default value), playing static->selected another song->tags aren't read->hit long >->lockup->shut down(by timeout setting))

Right now I don't know it is due to possible fs corruption caused by previous hard reset.

I'll test v9+buffer_v2(Which I said stable for 7+days).
Comment by Seheon Ryu (cpu98) - Sunday, 03 January 2010, 16:34 GMT
v9+buffer_cacheline_alignment_v2

I had the same issue.
Data on disk weren't corrupt.
My one can be defective.

It seems they happen only while filling buffer.

Now I'll try without buffer_cacheline_alignment.


Comment by Rosso Maltese (asettico) - Monday, 18 January 2010, 15:43 GMT
I'm just using WITH NO ERROR the following configuration:
* r24272
* v9
* buffer_cacheline_alignment_v2_no_boost
* FS#9621
* a modified version of  FS#8668  (based on my comment at http://www.rockbox.org/tracker/task/9708#comment33951), in which I removed the redefinition of NORMAL frequency.

Playback during buffering at start up does not hang up any more.
I can't tell anything about the short silent gap when buffer refilling starts, because I don't encounter it yet.
I will report later.
Comment by Rosso Maltese (asettico) - Tuesday, 19 January 2010, 13:39 GMT
Unfortunately, the playback stops when the buffer refilling starts.
The DAP works, but if I stop (press long play), it hangs up.
Now I will try without  FS#8668 .
Comment by Rosso Maltese (asettico) - Tuesday, 19 January 2010, 16:10 GMT
Without  FS#8668  I always got the short gap in playback, but it doesn't stop any more! :-)
As the last test, I will try the same configuration as in http://www.rockbox.org/tracker/task/9708#comment34247, but with buffer_cacheline_alignment_v2.
Comment by Torne Wuff (torne) - Sunday, 31 January 2010, 11:23 GMT
I've committed the ATA v9 patch, slightly tweaked to have a #define which will turn DMA writes back on (in case anyone wants to experiment with it). I'll commit the buffer cacheline alignment patch in a future commit (want to work on it a little more), and we may want to align some other buffers as well, so I'm leaving this task open.
Comment by Thomas Martitz (kugel.) - Sunday, 31 January 2010, 14:20 GMT
do we need introduce the #ifdef's for aligning buffering? buffering.c isn't really easy to understand as of now, so I would prefer if we could do it without (i.e. align unconditionally on all targets)
Comment by Torne Wuff (torne) - Monday, 01 February 2010, 10:56 GMT
Well, we could just set the alignment to 4 bytes on the targets which don't otherwise care, if the tiny code size increase is better than it being ifdef'ed even more (which seems likely) :)
Comment by Torne Wuff (torne) - Monday, 01 February 2010, 15:49 GMT
Slightly tweaked version of the buffering alignment patch: no #ifdefs, it just defines the mask as 0 for platforms which don't need it. The compiler will *probably* optimise out at least most of the code anyway as a result. Added a couple comments also.
Comment by Torne Wuff (torne) - Monday, 01 February 2010, 17:24 GMT
OK, previous patch committed in r24440 (with better comments in config.h). Virtually all buffering reads should be using DMA on pp502x now! Other disk accesses might not be. I'm going to put together a logf build which can track down the most common sources of unaligned reads, and see if we can align those buffers as well, so again I'm leaving this open :)
Comment by Torne Wuff (torne) - Monday, 01 February 2010, 17:25 GMT
Just to be clear to anyone wanting to test these changes: all the relevant patches above are now in svn, with the exception of the ATA stats functionality (never intended to be committed) and the no-boost buffering code (seems to be unstable? anyone want to test this again?). So, you shouldn't need a custom build to test this.
Comment by Andree Buschmann (Buschel) - Monday, 01 February 2010, 18:29 GMT
Thanks for submitting all relevant patches, Torne. Below I am adding the "unboost"-patch against svn.

Loading...