This is the bug/patch tracker for Rockbox. Click here for more information.
Quick links: Bugs · Patches · Rockbox frontpage
FS#9708 - PP5020 ATA (IDE) DMA
Attached to Project:
Rockbox
Opened by Boris Gjenero (dreamlayers) - Thursday, 25 December 2008, 02:09 GMT+2
Last edited by Torne Wuff (torne) - Tuesday, 23 March 2010, 21:50 GMT+2
Opened by Boris Gjenero (dreamlayers) - Thursday, 25 December 2008, 02:09 GMT+2
Last edited by Torne Wuff (torne) - Tuesday, 23 March 2010, 21:50 GMT+2
|
DetailsI got Ultra DMA working on my 5th generation 30 gig iPod. Mode 2 (33.3 MB/s) seems stable, and it doesn't need CPU frequency boost. Mode 4 (66.7 MB/s) is the highest supported, but modes above 2 require CPU boosting.
The main benefit I can see from DMA is somewhat faster buffer filling. For example, with one large unfragmented FLAC file and the FS#9621 FAT read-ahead patch, the initial load is about twice as fast with DMA. Test_disk says 1 meg reads are 14102 KB/s and writes aren't sped up. I suspect the reported read speed is from the drive's cache, not the media. (BTW. Why are PIO reads slower than PIO writes in test_disk?) The main cost of DMA is having to flush and invalidate the cache. I note that pp5020.h defines a CACHE_FLUSH_MASK, which makes me think it may be possible to only affect part of the cache. However, I don't know how to use that yet. While implementing this I significantly altered firmware/drivers/ata.c. I merged the read and write functions into a transfer function which can do both reads and writes with both PIO and DMA. This removes some code duplication and IMHO enhances writes by using "write multiple" and doing retries if needed. I plan to make other changes to ata.c and add interrupt support. However, I feel the right thing to do is to redo the patch with minimum modifications to ata.c. The ata.c changes can become another, separate patch. Due to interest shown on the developer mailing list, I am submitting this preliminary patch so I can get it out ASAP. I know it needs to be cleaned up, and it won't compile correctly for other targets, and things like detection of DMA modes need to be added. I'm sorry if this bothers you. I will release a cleaned up patch. Note that messing with disk-related code can lead to filesystem corruption, losing all of your files, and having to restore your iPod. If you're concerned about this, you may want to at first disable writes at "//if (write)" in ata.c. I didn't have any problems, the test_disk write verify test passed, single sector transfers (like the FAT) are still done via PIO, and Ultra DMA CRC-checks DMA bursts. |
This task depends upon
Closed by Torne Wuff (torne)
Tuesday, 23 March 2010, 21:50 GMT+2
Reason for closing: Fixed
Additional comments about closing: submitted in r24405
Tuesday, 23 March 2010, 21:50 GMT+2
Reason for closing: Fixed
Additional comments about closing: submitted in r24405
1st test: start to play an album (>80MB)
svn: ~10s
+fs#9708: ~7s
+fs#9708+fs#9621: ~7s
2nd test: start to play an large unfragmented track (>60MB)
svn: ~10s
+fs#9708: ~7s
+fs#9708+fs#9621: ~6s
The test_disk plugin shows:
read (4K, aligned): 7.653 / 12.171 KB/s (unboosted 24MHz / boosted 100MHz)
read (1M, aligned): 14.098 / 14.077 KB/s (unboosted 24MHz / boosted 100MHz)
Speed of writes is similar to svn.
I'll battery bench the fully patched version over night now -- let's see whether the theory of missing DMA being one of the battery eaters on iPods is true :-)
Currently I'm not using official bootloader but few weeks old svn bootloader.
(svn or 3.0) bootloader + (9621+9708) bin = 'no partition found'
(9621+9708) bootloader + any bin = stuck at apple logo
When using this patch I have some issues:
1. Sometimes the default font is loaded and not the theme's font.
2. Loading a plugin for the first time fails (tested with battery_bench and calculator). When re-trying to load the same plugin it works.
Bad thing:
Performing a battery bench is not possible right now. With r19589 the playback stops after 1h 15min. This is reproducible with the downloaded build.
FS#9621 works fine.
firmware/export/: DMA is enabled on the video iPod by defining HAVE_ATA_DMA in config-ipodvideo.h. If you want to test the code on another PP502x device, add that to your config file and report the results here.
debug_menu.c: Code to display DMA modes supported by the device was already present. I enabled it and changed it to check word 88 validity and check for UDMA 6 (133 MB/s). I also added a line to show what DMA mode is currently enabled. (Go to the "View disk info" debug screen for this.)
ata.c: The fastest DMA mode supported by both the device and the controller is used. PIO is used if DMA is unavailable. Some devices have design flaws and fail at modes which they claim to support (eg. http://www.linuxdevices.com/articles/AT5102023409.html ).
ata-pp5020.c: DMA compatible initialization and code for working with the DMA controller. This is the PP502x specific stuff.
ata-target.h: If you want UDMA 4, define ATA_UDMA_MAX to 4 here. I don't see any benefit however. Note that this causes boosting, so it's unfair to compare it to UDMA 2 without boosting.
About speed:
- Since boosting does not increase speed of 1 MB reads I assume CPU speed doesn't slow down UDMA reads. However, CPU speed limits what modes work reliably. UDMA 4 is unstable at 30 Mhz and UDMA 2 is unstable at 24 Mhz.
- UDMA writes are slower than PIO writes at 30 Mhz. At 80 Mhz, both are much faster and comparable in speed. (Should there be boosting or automatic picking of the fastest alternative?)
- If the address is not 32-bit aligned or the transfer is just one sector, DMA is not used.
- The Rockbox 0x10 PIO timing is faster than the PIO timings I have, so I'm not changing it.
In the first patch I flushed the cache after a read instead of invalidating it. That may have something to do with the plugin and font issues. I also put the DMA transfer setup before the timeout check, which would probably cause problems in case of a timeout. These issues have been fixed in the second patch.
Regarding FS#9621: the only thing that patch has in common with this one is that it also tries to optimize disk access to speed up buffer filling. The two patches affect different things and they do not depend on each other. If Rockbox doesn't run correctly with this patch, adding 9621 almost certainly won't help.
Can you please make sure that on the 80gb transfers work properly after the drive has been asleep? Do they become much slower? If I set phys_sector_mult to 2 on mine, the difference is very visible when loading plugins.
I'll try "possible_80G_workaround.patch" and report again.
I'll upload test disk result after do test.
FS#9708+ FS#9621 (+FS#8668).Total runtime is 14:37 (90-10: 12:30). Typical runtime achieved before was 14:22 (90-10: 12:18), but I need to retest svn (+
FS#8668) withouth your patches.Current savings calculated from this runtime are ~0.5mA. Which is less than I have expected.
Note1: This is an iPod 5.5G 30Gb, playing mpc. The default clock is 24MHz.
Note2: Boris, you have told the DMA transfer is unstable at 24Mhz. How does this instability show up, retries of read access while buffering?
Note3: I will first battery bench svn (+
FS#8668) to have a clean test, afterwards I will test your actual patch with boosted DMA.Question: Does anybody of you have the problems I have described above (error on loading battery bench or calculator plugin the first time)?
I don't have problems Buschel mentioned.
By the way CPU_FREQ in config-ipodvideo.h looks like for coldfire targets,
though that is likely to be ignored.
Patched bootloader says no partition found. (of course 3.0 bootloader works)
It doesn't stuck at apple logo now.
Andree, thanks for performing a battery bench. It's good to see that there is at least some savings.
If the CPU frequency isn't fast enough for the UDMA mode, I've seen various things happen. Sometimes it's just a lockup or read error. I've also been told that a plugin is for the wrong model. This makes me wonder if unreported errors occur. I imagine the data might get across the interface correctly but then the buffer holding data to be written to memory overflows. A flow control method is defined for UDMA, but the way 1 meg read speed doesn't change with CPU speed makes me wonder if it is works correctly. Note that music playback is not a valid test for unboosted disk access because the buffering code boosts while filling the buffer.
Regarding the bootloader, sorry, I forgot about that. The issues were: DRAM isn't remapped to 0 and the CPU runs at 24 Mhz. I'm including a small patch. UDMA1 seems to work at 24 Mhz. I'm not sure if cpu_boost works in the bootloader.
The refactored ata.c checked for errors, and when putting my code in ata.c I just assumed that wait_for_end_of_transfer() returned zero if something is wrong. Actually, it doesn't. Oh, and when reads fail because of an overly low CPU frequency, I was getting:
ATA_STATUS == ( STATUS_RDY | 0x10 |STATUS_ERR ) (0x10 is "na" in the spec, so just ignore that)
ATA_ERROR == ( ICRC | ABRT ) (UDMA CRC error, command aborted)
When I saw this I looked looked at ata_write_sectors and found that write errors aren't checked, even in SVN without any patches! This should probably be submitted as a separate bug. I'll deal with that tomorrow.
I'm attaching a fix for error checking, currently only changing things if DMA support is compiled in.
Maybe it's another xxxx-bytes sector problem?
I can't test any device except 5.5G 80GB.
The battery bench of r19589 (+
FS#8668) showed nearly identical runtime compared to the patched version (90-10: 12:24h). This leads to the assumption that there is no real current saving.Question: Do you explicitly enable the DMA-controller somewhere? If so: Can it it be disabled after transfer and re-enabled before the next transfer? Maybe an enabled controller eats up some mA?
I have compiled and tested your second version of the patch combined with the error checking fix in two versions: UDMA 2 and UDMA 4. Both work, but the UDMA 2 version still shows an error when trying to load the battery bench plugin the first time -- the UDMA 4 worked flawless until now. As I am also using
FS#8668which boosts the (with timeout) CPU after using the any button: Is it possible that boosting/unbossting or the CPU-delays between both clocks are the cause of DMA problems? This may also happen during normal playback and buffering. The UDMA 4 version boost during the whole DMA transfer, so it is not "disturbed" by re-clocking...Update: The UDMA 4 version crashed with "data error" after some hours of playback.
I added
FS#8668and tested. If the hard drive is spun down, plugin load fails, and if it's spinning, plugin load succeeds. I think it fails if button boost time fromFS#8668expires (due to spinup) and the read is attempted at 24 Mhz. To see if frequency changing causes problems, I removed 8668, made ata_wait_intrq keep toggling boost (30 to 80 and back) and disabled retries. I did get to a state once where disk access didn't work anymore until I restarted, but other than that I did a lot of plugin and image loading and it all worked.I'm not sure what should be my top priority regarding this patch now. I'd like to fix the problem that happens if DMA is the first read after the drive has been off or asleep.
the information about PCF ADC ch7 showing the battery current is new to me. Do you have a patch at hand? I would like to check with some current relevant settings.
It seems you are right with your assumption regarding
FS#8668and the plugin load. Boosting the CPU manually and then load the plugin always seem to work.UDMA Mode 0: I have changed your patch to simply use UDMA Mode 0. It works with 24 and 100MHz and does not show any problems loading the pluging while CPU is unboosted. I have attached the disk-speeds for 24/100MHz. You will see that boosted UDMA 0 is nearly as fast as boosted UDMA 2.
Update: UDMA 0 crashed during battery bench...
Configuration for 24MHz: In future there might be some kind of gui boost in svn (there are several talks about it). So, the default clock will be most likely lowered to 24MHz. In this case the DMA transfer should be either limited to UDMA 0 or at least use boosting for all modes >0, right?
At least on my system, windows XP, with Rockbox SVN, it does what you mentioned for a couple minutes, but in my case, if I leave it, the drive will mount. With your patch, I don't get the bursts of disk access, neither does the drive mount.
Regarding the bootloader, if it doesn't work for you, here are two things to try:
- Run it from within Rockbox. (Put bootloader-ipodvideo.ipod on the drive as a file and play it. Rockbox should start up again.)
- Decrease ATA_MAX_UDMA for the bootloader to 0 in firmware/target/arm/ata-target.h. (Note that there are separate bootloader and normal settings.)
I'm attaching a new version of the patch. It's a collection of all the fixes above, except with a proper fix instead of the 80G workaround. I've tested it on my 30G iPod both with phys_sector_mult set to 1 (which is normal) and with it forced to 2 (which should be like on the 80G). Note that with phys_sector_mult > 1, all aligned transfers are done via DMA. This includes the FAT. If you'd rather have FAT transfers via PIO, you can add the 80G workaround.
On my system, USE_ROCKBOX_USB works with the last patch you uploaded. It's still slow, but it's slow in SVN when mounting, so it has nothing to do with your patch.
Nice job on the patch so far.
When using your latest patch on the 80 GB ipod, at least on mine, when playing music, after it's done buffering, the disk never spins down until I stop the music. This did not occur with the previous patch, that I'm aware of.
Thanks
FS#9739.In ata.c I see a possibility for the disk to not spin down if setup commands or all data transfer commands after a spinup fail. I don't think this is what is going on however.
Anybody available to battery bench with high load codecs like AAC or APE?
I'll write some code to gather statistics. Those may help explain what is going on.
FS#9708(v0.2 UDMA1) +FS#9749+ FS#9746 +FS#8668attached. The last bench I did (same as this withoutFS#9708) lead to the assumption that neitherFS#9749nor FS#9746 had an impact to the runtime.So, there are two conclusions after this latest bench (total runtime 14:45h, 90-10: 12:36h):
1. It ran stable over >14:45h
2. UDMA seems to save at least some power (~0.5mA)
I'm attaching my current statistics code. Results are accessible via the debug menu. Some things may be viewed in "View disk info". All stats may be dumped to a file or reset via corresponding debug menu entries. The patch requires the v0.2 DMA patch, but it should also build with HAVE_ATA_DMA undefined. Time values are in microseconds from USEC_TIMER.
I'm also attaching a simple apps/buffering.c patch which I made now to change alignment. All I can say about it is that it works for me with MP3 now and all PIO reads are single sector reads now. Otherwise it is untested.
Andree, thanks for the battery bench results.
In the buffering code, the initial file-to-memory alignment can be set in two places: bufopen when opening a new file and reset_handle when seeking. Additional reads are sequential on disk and in memory, so alignment is preserved. The buffer start is aligned and buffer length is a multiple of 4, so wrapping should preserve alignment. If all these things are correct, all buffer reads should be properly aligned for DMA.
Reads of file data when buffering a file are of BUFFERING_DEFAULT_FILECHUNK size unless reaching the buffer wrapping point or end of file. So those reads can only be PIO when only one sector needs to be transferred in those cases. However, remember that there are also other single sector reads such as FAT and audio metadata reads.
FS#9708(v0.2 UDMA1 + v0.1 alignment) +FS#9749+ FS#9746 +FS#8668attached.Total runtime: 14:44h (90-10: 12:42h). The software also ran stable again, the 90-10 runtime slightly improved -- I am not sure whether the improvement is within measurement precision.
Boris, could you already compare the HDD spin-time of svn against patched version when performing battery benchs or longtime runs?
PIO: v0.2 DMA patch with HAVE_ATA_DMA undefined. Buffering is unmodified.
DMA,al: v0.2 DMA patch and align_buffer-v0.2.patch.
DMA,al,ch: v0.2 DMA patch, align_buffer-v0.2.patch, plus buffer filling in 0x100 sector chunks with no yields between blocks (because DMA yields).
It's all done on my 30 GB 5th generation iPod. H2BH8, Gravity and EP7 are shorthand for albums. H2BH8 and Gravity consist of multiple MP3 files and EP7 is a single FLAC file. Data is collected via the ATA statistics patch, with a few modifications to collect more data. PIO was done with the DMA patch included because the statistics patch depends on it. It shouldn't mess up the results because it's almost entirely not compiled due to HAVE_ATA_DMA not being defined. To really call this statistics, I'd probably have to collect more data. However, I think what I have now pretty clearly shows what's going on.
(BTW. It's interesting that the EP7 initial load PIO time includes a lot more unused time than other tests. I ran it a few times and got very similar results. I wonder if that's a bug. I think I've observed that happening occasionally with older unmodified versions of Rockbox.)
Do I understand the "Setup time" as the spin-up time corectly? It nearly matches the setup time of the HDD's data sheet (3s from standby).
Another question: There is difference of ~4-5s per buffer access between sum of all PIO+DMA transfers and the spinning time. Is this because of the spin down delay? Is it configured to 4-5s on your iPod?
I am still not sure why the impact on battery runtime is that low... HDD's data sheet sais power consumption is near 1000mW while spinning. This equals about 300mA of current. If I adapt your measurements to my tests we will save about 4-5s per buffer access. My test suite has 3 buffer access per hour -> 3*4-5 = 12s of 3600s -> 1-1,25mA average savings. From my battery benchs I can see half of this gain. Maybe the HDD uses less power?
Setup time includes all time from ide_power_enable(true) to when the HD is ready for its first data transfer command. During this time the hard drive spins up and a bunch of commands are executed to set up parameters (freeze lock, set transfer modes, etc.).
My spindown delay is configured to 5s. However, that ought to be irrelevant because the buffer filling code does storage_sleep(). Except for the anomalous but repeatable EP7 PIO result, I don't see 4-5s of unused time. I see 1 to 1.5 s without DMA and 0.75 to 1 s with DMA plus buffer filling in 0x100 sector chunks with no yields between blocks. I guess you may have been misled by how the spinning time also includes setup time and sleep command time.
According to http://www.toshibastorage.com/main.aspx?Path=StorageSolutions/1.8-inchHardDiskDrives/MK3008GAL/MK3008GALSpecifications , the MK3008GAL uses 1.0 W when reading or writing but only 0.4 W when idle. Here's what I see at the battery debug screen with
FS#9728and a battery voltage between 4.0V and 4.1V:HD off, playback stopped: 30-33 mA
HD off, playing MP3: 34 mA (Is this more than it should be? Why such a small difference compared to above? Is it just the debug screen wasting power?)
Spinning, heads loaded: ~260 mA for a few seconds, then ~120 mA
After a while there's a head unload click and current goes down to: ~100 mA
Sleeping but still powered: 70 mA
I think the highest number is the relevant one because I don't think the HD gets enough idle time to decrease current. So, if you believe
FS#9728, it seems the MK3008GAL is using 230 mA of battery current. (This would mean more current supplied to the HD at 3.3 V if the DC-DC converter powering it has decent efficiency.)An hour of 34 mA is 34 mAh and 12s of 230 mA is just 0.77 mAh. Plus the HD uses additional power for spinups, seeks and other operations, and the patch wouldn't affect that.
Where do your milliamp numbers come from?
MP3 runs almost completely unboosted, so the only additional draw is whatever is required to power up the DAC and the tiny increase in going from 24mhz (idle) to 30mhz (unboosted).
I was thinking the CPU would be doing more work, but I guess it's always 100% active? What about the COP, is it not sleeping normally? (Sorry about the off-topic questions.)
I don't think 24 MHz is ever set during normal operation. It would only be set if cpu_idle_mode(true) was called, and that is never called (I only see it in radio.c and in code which applies to other targets in usb.c.). If 24 MHz was set, UDMA2 reads would fail and UDMA1 or boosting for DMA would be necessary.
Does the HDD consume the same power as for reading while it is spinning up? I would expect more power to be needed.
If we would present the different steps for each buffering cycle, would it look like this?
Step Duration mA mAh
Setup 3.0s 230 0.19
Buffering 3.0s 230 0.19
Idle spin 5.0s 90 0.13
Total 0.51 mAh per buffer cycle
With PIO the buffering takes ~7s, total energy consumption is 0.76 mAh per buffer cycle then.
Where do Ih have the numbers from:
230 mA = 260-30mA
90 mA = 120-30mA
Setup time = 3s (from table)
Buffering = all DMA+PIO transfer time / 4 (whole ablum play, Gravity)
Idle spin = from your settings
Did I get the numbers right from your table?
The COP sleeps normally, but the difference is very small between sleeping and full load: http://www.duke.edu/~mgg6/rockbox/30mhz-2.png
These are taken on PP5024 but should be similar.
>I don't think 24 MHz is ever set during normal operation.
I assumed it dropped back to that when not playing but I never looked. In that case the couple mA is probably the DAC and the extra CPU load.
Setup: 2.8 s, power usage unknown. I see a brief peak of 400 mA and some time at >300 mA. MK3008GAL datasheet says 1.8 W.
Usable time: Time as measured, power usage >=230 mA.
Sleep: 0.7 s, power usage unknown. I see a peak of >300mA when unloading the heads.
Power off timeout: 2 s, 40 mA
Regarding the usable time, I don't know the exact sequence there, just the totals.
Idle time: except in the anomalous EP7-PIO result, there is well under 2 s of idle time. I think it's safe to assume that current is 230 mA because there is no activity to increase it and not enough idle time to decrease it. It takes more than 2 s for it to go down to 90 mA.
Reads, writes, seeks: > 230 mA. I've seen > 300 mA for a significant length of time even.
Yes, the current measurements are incomplete. In order to get accurate measurements, continuous sampling is needed, not just a measurement here and there. It might be possible to do this by making the COP fill a buffer with 4066_ISTAT and USEC_TIMER readings while the other core does normal work. It would definitely be possible to do the measurements with a digital storage oscilloscope or other ADC hardware with a computer interface (but I don't have any of this). The data could be graphed and integration could be used to figure out the amount of power used. Is this really important?
Maybe a developer with more knowledge of and experience in the ata-part can review this patch?
http://picasaweb.google.com/boris.gjenero/RockboxRelated#5297270003826303634
http://picasaweb.google.com/boris.gjenero/RockboxRelated#5297270005596450130
I'm not sure if this is useful. I was thinking of adding more filtering, using it to AM modulate an audio frequency carrier, and recording that to WAV on a different device, but that might be overkill.
- chunk size set to 256KB in alignment patch
- maximum UDMA mode set to 1 to have stable functionality even with 24MHz clock
Regarding BUFFERING_DEFAULT_FILECHUNK: FAT code transfers up to 128KB at a time, doing multiple transfers if more is requested (or if data is fragmented). Also buffering code stops filling the buffer when a full chunk won't fit, so increasing chunk size may decrease the average amount of buffer space that is filled. I am attaching a small patch on top of the 0.3 patches above which enables 1 meg blocks and fills the whole buffer, as an experiment. I don't know if it is beneficial.
Technical explanation: ATA transfers are done in 512 byte blocks called logical sectors. (Physical sector size may differ, eg. on the 80G iPod.) LBA28 commands only allow an 8 bit number of logical sectors, and 0 means 256, so the maximum transfer size is 256 * 512 = 128KB. LBA48 commands allow a 16 bit value, or up to 65536 * 512 = 32MB. However, DMA hangs if the number of sectors is greater than 2048, so the limit is 2048 * 512 = 1MB.
So, how do I solve the cache line alignment issue? Do I do 3 ATA commands, a PIO read at the start, a DMA read for the middle, and a PIO read at the end? Do I shut off the cache in the affected area during DMA?
Nothing should be looking at the buffer anyway. If something must probe the buffer during that time, it should be done from the physical address.
cpucache_invalidate does writeback first so there's no need to worry about performing that. cpucache_flush before writes and cpucache_invalidate before reads is fine.
EDIT: Make response more coherent.
On my ipod mini 2G with a 32GB transcend 133x CF card, USB read speed goes from 3.2 MB/s to 7.2 MB/s. Write speed goes from 3.0 MB/s to 2.8 MB/s (obviously I added the HAVE_ATA_DMA define there too)
The USB code does 16K reads and writes. I suspect that this is the limiting factor on the video (but that's a bit hard to test without rewriting lots of USB mass storage code)
read write
mini 2G 32GB 133X CF
no dma 3.2 3.0
udma1 7.2 2.8
udma2 8.2 3.5
udma3 8.6 3.6
udma4 9.6 3.6
video 5G 30GB
no dma 5.0 8.3
udma1 8.3 8.9
udma2 8.9 8.9
udma3 8.9 8.9
udma4 8.9 8.9
For disk time measurements done with an earlier version of the patch, see the album "gravity" in http://spreadsheets.google.com/ccc?key=pkeRcfM0sg949P3a5EYXtew . (You can go to the bottom of the spreadsheet and click on the "Whole album chart" tab.)
I'm not impressed. Because the shutdown happened at 20% with DMA vs. 15% without, the time was shorter. If line up the start of the graphs (see .png), the difference very small, similar to FS#9746.
I guess if there's any reason for this patch now it'd be faster USB transfers. So, what should be the future of this patch? Is it important or not? What work needs to be done?
Nevertheless I am not too deep into the area of ata driver. That's why I hope that some of the other experts will find some time to review the code, maybe make some changes and submit this patch.
Good work so far!
Another thing I noticed is addition to obtain the physical address but what about IRAM buffers or addresses already physical (mpegplayer uses an uncached disk buffer)? Addresses in those ranges should have no transformation applied.
Uncached RAM and IRAM should be handled properly in this patch, though I still need to test DMA to/from IRAM. Flash remapping isn't handled however, and so an attempt to dump the flash will hang. Rockbox needs functions which convert between physical and virtual addresses.
Regarding cache handling, if i wanted to do it efficiently, I'd have to handle it in fat.c and file.c. (In ata.c you don't know if the overlap is with another part of the same read or other data.) Should I just do it in a simpler way in ata.c so that less platform-independent code is changed? I'm a bit concerned about changing too much stuff to accomodate DMA and because of that ruining chances for acceptance of this patch.
When BUFFERING_DEFAULT_FILECHUNK is increased, if you resume playback near the end of a track that can lead to looping audio and playback continuing past the end. This also happens without any of the patches here. See:
http://www.rockbox.org/tracker/task/8206#comment28813
(Sorry, I didn't mean to include the patch in that comment. I don't think it works right. See the comment.)
Edit: remove bit I'm sure was plain wrong (name confusion)
I agree that ata_dma_setup should get virtual addresses. I just thought it should use something external (eg. a function in system-pp502x.c or a define in system-target.h) to translate them to physical addresses. If the translation is hard-coded in ata_dma_setup, that may cause problems in the future.
I wish I could do cache range operations on the PP502x, but I've never seen functions for that. One day I'll read through http://daniel.haxx.se/sansa/memory_controller.txt and see what I can do.
My thoughts really aren't on playback usage alone, btw. It should work in all cases and I wouldn't be too shy about making it work in general. I'm not sure yet but alignment may be immaterial to whether DMA can be used and only the cache alignment matters for imx31.
I'll just keep working independently over there from now on since I have enough to get through already for the beast.
There is no "standard" way I know of to do it with the flexibility of the ARM MMU on other chips. Every retailos code I've disassembled just does a full op. One possibility is to flush/invalidate by index into CACHE_FLUSH/CACHE_INVALIDATE register arrays if the cached address is in the range but I don't know if that would have side effects by just touching it. The mask and match method doesn't seem quite flexible enough (that wasn't known to work for sure anyway).
Regarding alignment, I said alignment of sector boundaries is what matters. When a read happens at the file level, both the current file offset and memory address matter. If the file offset isn't at the start of a sector, file-level code will take care of that, doing a single sector transfer into a separate buffer if necessary. Then the rest of the transfer starts at the offset of the first whole sector. This is the offset which must be word aligned for DMA to be possible on PP502x. If it's also cache line aligned, that's good. If it's not cache line aligned, but the rest of the cache line is used for partial sector data mentioned earlier, that's also good. If any data outside of what is being read shares the cache line, that requires special handling. At the end the same concerns apply. Oh, and if the file is fragmented or the request is large, FAT level code will split up the request into multiple ATA level calls.
Yes, I would also like to make this work in general. For example the database or some plugins may benefit. The things that really need absolute maximum speed can maybe care about cache line alignment themselves, but DMA should work regardless.
Edit: typo
There are in fact no restrictions whatsoever on the DMA engine (the SDMA core is involved by chip design). Trying to handle cache issues in a generic way would be a big negative for this processor since I can do it all within ata-imx31.c by scattering.
Edit: Fix driver/drive
It would be easy to move the cache flush and invalidate into ata_dma_setup() and do away with the "#ifndef HAVE_DMA_CACHE_COHERENCY" code in ata.c.
I guess it's possible to read the start and end of a transfer via PIO from ata_dma_setup() and ata_dma_finish(). That is sometimes not needed, because ata_read_sectors() calls can be for data in the middle of a read() call, where the cache lines are only shared with other data from the same read() call. However, I guess that's not too bad.
Timings are for test_disk on Gigabeat S.
Edit: Attach speed test with optimized writing enabled for comparison.
PS. I mucked up enabling it for the S bootloader and had a nice fast restore of my backed-up files. :)
USB wite (PC->H10) spped is 9.4MB/s (about 1.5MB/s faster then using OF), and USB read speed is 8MB/s (OF attains 14-18MB/s).
I attach test_disk results too.
I see that the common portion got committed. Removing
FS#9721code causes a problem: no ATA_STATUS error checking after DMA reads. PIO reads do their own error check, but DMA reads relied on wait_for_end_of_transfer(). There ought to be an error check after DMA reads. While DMA code is undergoing development, I think it is a good idea to also check write errors, but once the code is known to be stable, that issue may be left toFS#9721.BTW I don't like how a single write failure can lead to a panic with the current
FS#9721patch. I think the best solution is to retry failed writes like reads are currently retried, and the best way to do that is to merge the read and write functions similarly to 19581-dma-preview.patch. Any comments on this idea are welcome on the 9721 page.How does RB crash on the H10? Is any message displayed? If the drive automatically unloads the heads after a period of inactivity, did that happen?
I have tested this once more using current using clean r20316 and with ata-dma-pp5020-v0.6 and
FS#10015patches applied and RB worked stable during USB writes.FS#10015related findings toFS#10015?Attached are the results for 24MHz UDMA4 (01) and 100MHz UDMA4 (02).
We should re-test UDMA4 for other targets as well as this shows a measurable positive effect on the USB transfer rate (reading from target).
Edit: Just saw that I talked nonsense, UDMA4 boosts the CPU. So, it is not really working with 24MHz. Nevertheless UDMA4 configuration worked stable for >10 days now.
Setting UDMA1 and 80MHz results in the best performance. Interestingly higher UDMA modes not only don't give better transfer rates but significantly lower.
Theoretically 1.8" HDDs have transfer rates below 33MB/s (Toshiba HDDs up to 80GB have, as far as I remember, about 25MB/s read speed) so UDMA2 and higher modes aren't justified and shouldn't give any significant speed improvement.
In opposite to the higher UDMA modes higher CPU frequency makes difference.
I once tried changing 0x600060C4 and the 50 Mhz bit to see if there is a difference in power consumption and to see if I could do UDMA4 at 30 Mhz. I saw no difference in power consumption and I couldn't get UDMA4 to work at 30 Mhz. (0x600060C4 is set to zero at 24 Mhz.)
I don't know what's going on with the ">65MHz"-bit (IDE0_CFG & 0x10000000). It's cleared at both 80 Mhz and 24 Mhz. It is only set in a branch for all other frequencies. That branch doesn't change 0x600060C4 and it uses 0x11C1 for PIO timing and 0x80003261 for DMA timing regardless of mode.
@80MHz:
0x600060C4 = 0xC0000000;
IDE0_CFG &= ~0x10000000;
IDE0_CFG |= 0x20000000;
PIO is untouched
DMA = depends upon DMA mode, did you also get the timings via re-engineering (based upon 80MHz?)
@24MHz:
0x600060C4 = 0x0;
IDE0_CFG &= ~0x10000000;
IDE0_CFG |= 0x20000000;
PIO is untouched
DMA is same as for 80MHz
other (this was not examined in DMA context but in context of clock setting, right?):
0x600060C4 is untouched (what is the default?)
IDE0_CFG |= 0x10000000;
IDE0_CFG is untouched
PIO = 0x11C1 (but 0x10 is faster)
DMA = 0x80003261; (for all DMA modes, is similar to UDMA3 setting)
An easy way to avoid clock changes during DMA transfer is of course to boost the CPU (as you do for UDMA > 2). Why not boosting for all UDMA modes? This equals the timings and settings you have examined from the code. :
Question: How often does the code boost/unboost right now (e.g. when loading 1 MB of data). I remember that each boost/unboost costs some hundred microseconds.
Btw, it is interesting that the last three digits of the DMA setting (e.g. 0x271 for UDMA3) are highly correlated to the maximum UDMA transfer rate: N / 0x7C1 ~= 16 (MB/s for UDMA0), N / 0x491 ~= 25 (MB/s for UDMA1), and so on (for N=30000) First and only exception is the setting for UDMA4 (which would lead to 265 MB/s). Coincidence?
IDE0_CFG & 0x20000000 is the >= 50 Mhz bit, so it would be set at 80 Mhz and cleared at 24 Mhz. In the third case it is set if an only if the frequency is 50 Mhz or more.
At 80 and 24 Mhz, PIO timings are set from the corresponding table. The 80 Mhz table is in a comment in ata-pp5020.c. The 24 Mhz table is identical except for mode 0, where the timing value is 0 at 24 Mhz. However, it seems they are independent of DMA and they are slower than 0x10, so I don't set them.
The DMA timing is set from the tm_udma table at 80 Mhz only. It is set to zero at 24 Mhz. (At other frequencies, it is 0x80003261, as you wrote.)
I never tested the third third set of values. It's weird how those values don't depend on what the device claims to support. I wonder if they're for the Nano, but that's pure speculation.
None of this was found in a CPU clock setting context so far; it is part of IDE initialization code. I will do more research.
If boosting/unboosting was done, there would be one boost/unboost per ATA command. I can't answer in general regarding loading 1 MB. If loading the LBA28 maximum of 256 sectors at a time, that would be 8 times. If loading into the buffer, due to BUFFERING_DEFAULT_FILECHUNK it's 32 times, although the CPU stays boosted while filling the buffer and the cpu_boost calls just increment and decrement the boost counter. It's also possible to load 1 MB in one LBA48 command.
I thought that the timing registers contain multiple fields. I don't think a single number would be useful to the hardware because there are multiple timing parameters, and they don't all scale in similar ways. Plus that UDMA4 exception, which isn't a typo, is a counter-example. With DMA timings, I can draw a connection between some timing parameters and individual digits of the timing value, but that could just be a coincidence.
* When DMA is fast enough, MK3008GAL read speed varies from 20.5 MB/s at the start to 9.7 MB/s at the end of the disk. I assume this is read speed off the media.
* When DMA isn't fast enough, there is a very slight increase in read speed until DMA becomes fast enough. Then speed follows the former curve. I assume the slight increase during the first part is due to decreasing DMA flow control overhead.
* UDMA 1-4 are virtually identical. UDMA 0 limits read speed. MDMA 2 is even slower.
* 30 Mhz and 80 Mhz are virtually identical.
* Reading in 1 meg chunks (with 1_meg_blocks_experiment.patch) instead of 128 k blocks makes virtually no difference.
* 24 Mhz slows down UDMA 1 by less than 2 MB/s in a small area near the start of the disk.
1.) As we see a connection from the clock to the transfer rates for with the test_disk-plugin: Is this an effect of faster copying within copy_read_sectors() and copy_write_sectors() in ata.c?
2.) So, we should really simply use MAX_UDMA 1 as configuration then (as it works stable for all clocks)?
3.) If 1) is true, is there a way to avoid the copying?
I was thinking about how I blamed the "slight increase in read speed until DMA becomes fast enough" on "DMA flow control overhead". This is wrong. Slower DMA modes have slower transfer rates, and there would be less flow control. I think the issue is what happens when the drive cannot send data as fast as it can read read off the media. The drive can continue filling its buffer. Even if the current command ended, the drive can read ahead (Rockbox enables that feature). However, eventually the buffer is full or the drive decides it has read ahead enough and reading stops. To restart reading, there is rotational latency: the drive needs to wait for the proper spot to rotate to the head. Stopping and restarting reading might also have a bit of other overhead. BTW. This is similar to why interleave mattered on MFM hard drives.
When a sequential read is broken up into multiple ATA commands (eg. by BUFFERING_DEFAULT_FILECHUNK), faster DMA allows more time for other code to execute between DMA operations, while still not interrupting reading. I think this is the only reason for using a DMA mode faster than UDMA 1 with an MK3008GAL. I don't know if this is important right now, because FAT reads often interrupt such reads, and that is a different issue.
Note that I didn't say anything about writes so far. DMA writes are slower than PIO writes at 30 Mhz! At 80 Mhz they are a bit faster.
Oops, I should have a deeper look at the sources before I write a comment. You are of course right.
30 MHz: 1532 KB/s PIO, 1680 KB/s UDMA2
80 MHz: 2757 KB/s PIO, 3280 KB/s UDMA2
More limited testing using UDMA1 showed similar gains.
I propose aligning such buffers via a STORAGEALIGN_ATTR define, which expands to CACHEALIGN_ATTR when needed, and nothing when not needed. I guess that has to be set based on something from config-*.h, but I'm not sure what would be the proper config directive name. How about CACHE_ALIGN_SECTOR_BUFFERS? Note that this would be beneficial even when ATA properly handles unaligned buffers, because such handling requires some extra time. Does this seem reasonable? If it does, I'll create a patch.
What's actually blocking this? Are there remaining serious issues?
The workaround used is to copy data from/to an aligned, uncached buffer, and use this buffer for DMA transfers, I just thought aligning buffers used in *_read/write_sectors() would be too much work / binsize increase.
(Jpeg) album art is displayed garbled and in wrong colors when the albumart is being displayed during disk activity, Playback freezes randomly after a few (5-7) songs.
This happens with either UDMA 1 or UDMA 4 mode.
I tried bisecting between 21250 and 21300 and I narrowed it down to 21261 being the first instance it breaks. But 21261 being a commit to the 3.3 branch confuses me...
Did you try to use a WPS without cover art? or use .bmp instead?
Are you using time stretch?
Did you try to use a WPS without cover art? or use .bmp instead?
Are you using time stretch?
Wether or not the timestrechting option is enabled in the options menu or not doesn't matter, it happens on both.
What I will try next: using a no-aa wps to check if album art is at fault for the playback issues.
I tried with the rockbox default wps (the realy bland one), and it also froze there.
After that, I left it at default (UDMA 1) which should also be the safer mode.
As far as I understood, this removed the actual signal processing, but Andree is probably able to explain all this further.
Anyways, these changes fixed the albumart issue and also probably the playback issue. I had the player running for 1.5 hours straight right now without it freezing.
It indeed seems to be an issue with r21258 breaking this patch.
Also, it would be useful to check for ATA errors, for example using http://www.rockbox.org/tracker/9721?getfile=18233 .
FS#9721+ FS#9746.It seems likely that cache line interference could be responsible.
This is easier than fixing the code to actually handle the head/tail of the unaligned area properly. I'm now working on a patch to the buffering code so that it will try to align the majority of requests to cacheline boundaries, but the attached patch could probably do with more testing with people who have different configurations to me. You might want to use my updated ATA stats patch to see how many requests are being done with DMA, but it's optional.
The "bytes <= 512" test can be removed from ata_dma_setup(). Single sector DMA transfers work. I had problems with that in the past, but it was apparently due to cache line alignment. Currently, the change wouldn't make much difference because many of the buffers into which single sectors are read are not cache line aligned, but aligning them isn't hard.
I've not had a chance to finish looking at the buffering alignment yet.
The buffering alignment code I'm testing is successfully causing all buffer data to be read via DMA. (The ATA stats show that the number of PIO reads equals the number of sectors read via PIO.) I did have one lockup though, so I want to examine and test things a bit more.
I am trying to convince myself that this patch decreases the time it takes to refill the music buffer.
I am attaching two patches:
* dma_cacheline_alignment_v3.patch : In addition to the change in v2, it also permits single sector transfers via DMA. Earlier problems with single sector transfers were due to cache line alignment. (First apply ata-dma-pp5020-v0.7.patch.)
* buffer_cacheline_alignment_v1.patch : a re-done buffer alignment patch which cache line aligns the buffer so buffer data may be read via DMA without cache line alignment concerns. It appears correct, but I feel it needs a bit more examination and testing because I'm not familiar with the buffering code.
If all this works, then hopefully it can finally be committed.
When the buffer is being filled via DMA, it definitely fills faster. For example, see the data and charts at https://spreadsheets.google.com/ccc?key=0AsEpnuZKewlIcGtlUmNmTTBzZzk0OVAzYTVFWVh0ZXc&hl=en
Edit, 14-09-2009: No issues yet, performed a single runtime test which shows comparable results to former UDMA-builds.
One additional (minor) idea: Can the test_disk plugin also be changed to use cache aligned read access (at least for the "aligned" measurements)? Otherwise we cannot see the positive effect of DMA transfers in this plugin for read access.
FS#9721- write errors should now be detected properly, and will be handled by retrying until a timeout, the same as read errors. (PIO writes are also now done with WRITE MULTIPLE). The ATA stats patch no longer applies, though, and I've not had time to fix it up yet.I am now testing DMA again; there were several crashes that were fairly reproducible on my ipodvideo, but I have good reason to believe they were all caused by the old patch for
FS#9721(write errors because of player being subject to shaking, causing the FAT code to panic). I'll see how it goes, but I think we're getting close to committing this one :)Someone (I forget who) suggested that DMA for single sector transfers is perhaps slower than PIO... is there any chance of someone doing some proper benchmarks of the current PIO code in svn (with WRITE MULTIPLE) against the latest patches here?
I applied the last 2 patches at http://www.rockbox.org/tracker/task/9708#comment32509 and I never got any problem.
But I can't feel any enhancement in loading dircache and playback buffer. The disk info in debug menu shows UDMA 1 mode.
Can I change something to see the other modes or I'm missing something else?
Thanks for any hint.
The disks in these players are not very fast to begin with: PIO is not *innately* slow, it's only slow if the CPU can't keep up with the transfer, and the disks are not quick enough (and the CPU not busy enough) that there is a problem in the usual cases we encounter.
Faster DMA modes are probably not worth using on a stock disk because the standard iPod disk is too slow for it to make any difference; a large upgraded disk could benefit from a faster mode, possibly, but the faster modes require the CPU to be running at a higher speed for DMA to be stable: the loss of power efficiency is probably not worth the speed improvement of having to increase the unboosted clock, or to force boosting...
Did you test DMA on a build at least r23741? If not can you update to at least that revision and try again? I want to make sure write errors are handled correctly; DMA is looking pretty safe, anyway, so we probably want to do some better benchmarks just to prove that it's never any *worse* than PIO, and it can go in...
For the PIO-build I have disabled HAVE_ATA_DMA in config-ipodvideo.h.
For the UDMA-build I have set ATA_MAX_UDMA to 1 in ata-target.h and removed the dma_cacheline_alignment patch to ensure UDMA is always used.
Remark: The build uses 24MHz normal clock and 100MHz boosted clock. Nevertheless I am sure the results can be used for further assumptions when comparing DMA and PIO.
Edit1: Removed UDMA4-tests (UDMA4 will always boost during transfers) and added UDMA1-tests.
Edit2: Forced UDMA mode for UDMA-tests via removal of dma_cacheline_alignment patch.
Thanks for the detailed explanation, now I have a clear idea on what's going on.
My latest build is r23570 plus some patches, this one included, so I haven't tested the new write errors handling.
I hope to do it asap.
Thank you.
It seems like writing with DMA is a bit slower for test_disk's "aligned" test cases. Reading for test_disk's "aligned" test cases is (much) faster.
I'm becoming tempted to commit the patch with DMA for reading only (and leave this FS# open while we investigate writes further) - it appears to be stable in all configs and it's not likely to get tested any *better* until it's in SVN.
I have not the necessary knowledge to align it: please, can anyone help me?
TIA
ata-dma-pp5020-v0.8 now includes the dma_cacheline_alignment_v3 patch.
buffer_cacheline_alignment_v2 still needs to be applied seperately.
I have updated the patch to only bother to do DMA for cache-aligned reads, and have added it to the configuration for every disk-based PP502x target. I have also set it to use UDMA 2 if the clock speed is defined sufficiently high to support it - it will work for your GUI-boost build as well, buschel (automatically changing to UDMA 1). The buffer alignment patch is still seperate as above.
This could do with some testing on all the *other* players; it pretty much certainly works on ipodvideo and probably on the other ipods but I've never seen half these targets ;)
With the v8 version, I noticed a short silence in playback when the buffer starts refilling. I mean not the first time the playback starts, but only when the buffer is almost used and needs to be refilled.
I can't tell if the same behaviour happens with the v9, because I need some hours to use the entire buffer: probably I can tell something more this night, as I will travel for a couple of hundreds kilometres.
A question a little bit OT: before making this test, I had applied also
FS#8668. With this one, playback during buffering hang RB up. Reading the disk info in the debug menu, I saw UDMA mode 4, not 2. IIRC mode 4 was declared unstable in some previous comment. Could it depend on the CPU frequencies modifications with#define CPUFREQ_SLEEP 32768
#define CPUFREQ_DEFAULT 24000000
-#define CPUFREQ_NORMAL 30000000
+#define CPUFREQ_NORMAL 24000000
#define CPUFREQ_MAX 80000000
FS#8668fine as I am choong the appropriate DMA mode for the configured clock speed.This made me think of another experiment: Why not disabling the "buffering boost" when DMA is used (encapsulate the boosting/unboosting with #ifndef HAVE_ATA_DMA)? Boris's measurements showed there is no big difference in read speed, very short tests with buffering of a new folder on my iPod showed the same result. We would save several seconds of boosting per hour :o)
FS#8668): it's stable and I never got a system hang up.Unfortunately, the short playback pause when buffer refill starts is yet present (with v3 patch the pause did not occur).
Rosso, I've never had that problem; the DMA code itself has not changed at all, the changes I have made are simply to disable using DMA for writes, and to tidy up a few things... there's no reason why it should behave differently with regards to buffering with the seperate patches vs now..
Edit: Worked stable until battery ran dry after 20:11h.
My one sometimes stopped buffering and kept disk running while I was browsing.(no seeking)
Buffer wasn't fully filled(alloc:54M/59.6M) while there was more song on dynamic playlist.
Presssing >> or stop(long >) caused lockup.
Another one?
Disconnected from PC, pressed > button, got wps, played nothing, debug shows(pcm:-1342873487somthing like this/529200), selected same song, same happens, selected another song, hung up.
Stable.
v9+noboost+8668
Rock stable even under crazy commands....
Sorry, maybe "unrelated changes" made it!
I'll report when I experience real problem.
I got real lockup, same as above.
1) v9 + buffer _cache_alignment_v2
2) v9
3) svn
So, all without the boost-disabling.
So 2 and 3 will be stable too.
I'll fall back to v9+v2_noboost(without 8668) and test it.
I had similar problem.
Music stopped->debug shows pcm:128123612481/52????(default value), playing static->selected another song->tags aren't read->hit long >->lockup->shut down(by timeout setting))
Right now I don't know it is due to possible fs corruption caused by previous hard reset.
I'll test v9+buffer_v2(Which I said stable for 7+days).
I had the same issue.
Data on disk weren't corrupt.
My one can be defective.
It seems they happen only while filling buffer.
Now I'll try without buffer_cacheline_alignment.
* r24272
* v9
* buffer_cacheline_alignment_v2_no_boost
* FS#9621
* a modified version of
FS#8668(based on my comment at http://www.rockbox.org/tracker/task/9708#comment33951), in which I removed the redefinition of NORMAL frequency.Playback during buffering at start up does not hang up any more.
I can't tell anything about the short silent gap when buffer refilling starts, because I don't encounter it yet.
I will report later.
The DAP works, but if I stop (press long play), it hangs up.
Now I will try without
FS#8668.FS#8668I always got the short gap in playback, but it doesn't stop any more! :-)As the last test, I will try the same configuration as in http://www.rockbox.org/tracker/task/9708#comment34247, but with buffer_cacheline_alignment_v2.