Revision r9 - 25 Dec 2010 - 08:03 - MichaelGiacomelli
Guide to CPU specific Optimizations of Rockbox Targets
ARM Flavors: See the arm quick reference card: http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001l/QRC0001_UAL.pdf
- ARMv4 - Basic ARM ISA with support for aligned memory accesses, multiple data transfer to/from memory, and basic 32 =32x32 and 64 = 32x32 multiplication instructions.
- ARMv5E? - Upgraded ARM ISA that includes additional functions for accelerating DSP operations. All ARMv5E?+ hardware has at least a 16 bit wide multiplication unit. Additional instructions for single cycle packed 16 bit fixed point multiplication and multiplication w/ accumulation as well as saturating addition.
- ARMv6 - Upgraded with additional DSP styled operations and support for unaligned load/stores (although these are slower). Adds support for packed SIMD style addition and multiplication operations on 32 bit registers. Adds additional fixed point multiplication instructions (basically instructions that multiply then shift the result).
- ARMv7 w/ NEON - Adds NEON coprocessor. A separate SIMD core with 32 64 bit registers which can also be accessed as 16 128 bit registers. All NEON instructions operate on the NEON register file and can only be returned to the main core via coprocessor instructions or writing out to memory, both of which incur high latency. NEON operations include vector fixed and floating point addition, multiplication, load, store and shifts. Most operations are fully pipelined, making them vastly more efficient then standard ARM operations.
With each arm generation, scheduling becomes increasingly important. Fortunately, code scheduled for a later arm processor uses runs at or near optimal on earlier arm cores, with a few exceptions. Therefore one should ideally consider scheduling for the arm11 processor when writing code, even when developing on earlier processors.
Slow multiplier and load/store performance makes careful use of multiplication and load/store multiple instructions essential.
- Cache: 8KB Unified, 1 cycle latency (hardware bug)
- IRAM: 96KB, 0 cycle latency
- Multiplier latency: 3-5 cycle typical
- Pipeline Interlocks: All single loads have an unconditional 1 cycle stall
Dual core CPU, work can be splitted between both cores.
Same as PP5002 except:
- Cache: 8KB Unified, 0 cycle latency
- IRAM: 96KB, 4 blocks x 24KB with 0/1 cycle latency (hardware bug)
(latencies for CPU/COP: block 0: 0/1, block 1: 1/0, blocks 2 + 3: 1/1)
- Examples: Ipod 4G, Nano 1G, H10
Dual core CPU, work can be splitted between both cores.
Same as PP5020, except IRAM is 128KB long and zero latency.
- Examples: e200v1, c200v1, Ipod Video
* Multiplier latency: 3-5 cycle typical
- Pipeline Interlocks: All single loads have a 1 cycle interlock if used immediately after load.
Similar to ARM7TDMI?, but with 5 cycle pipeline and separate cache for instructions and data that eliminates the unconditional ldr delay under some circumstances. Attention to pipelining becomes more important. All arm9 cores should not use registers on the cycle immediately after issuing a load against them.
- Cache: 8KB I, 8KB D, 0 cycle latency
- IRAM: 320KB, performance is comparable to DRAM
- Boosting: Yes (62/248MHz)
- Examples: e200v2, m200v4, c200v2, fuzev1, clipv1
Memory performance is fairly poor. poor when boosted, which hurts battery life if codecs require frequent boosting. IRAM seems no better then DRAM. Codecs run entirely from IRAM. Slow 62MHz IRAM/DRAM bus hurts performance when boosting.
- Cache: 16KB I, 16KB D, 0 cycle latency, 32 byte cacheline
Relatively slow main memory compared to clock speed.
- Cache: 4KB I, 4KB D, 16 byte cacheline
- IRAM: 256KB (S5L8700?), 176KB (S5L8701?)
- Boosting: Yes (48/192MHz)
IRAM is significantly faster then DRAM, but still has higher latency then cache. In general memory performance is very poor on this CPU. Memory bus speed limited to 100MHz so latency increases when boosting.
Similar to ARM922 except the multiplier has been doubled from 8 bits to 16 bits, and the ARM ISA version is upgraded fro from v4 to v5E.
- Multiplier latency: 1-3 1-2 cycle typical, 32x16 multiply accumulate instructions are fully pipelined with 1 issued per clock
- Pipeline Interlocks: All single loads have a 1 cycle interlock if used immediately after load, single multiplies have a single cycle interlock if used outside the multiplier unit on the next cycle (e.g. multiply accumulate has no interlock on sequential cycles, but a multiply followed by a store does).
- Cache: 8KB I, 8KB D, 0 cycle latency
- IRAM: 1MB, performance significantly better then DRAM, bus speed still limited to 64MHz so latency increases significantly when boosting.
- Examples: Fuzev2, Clipv2, Clip+
Apparently very similar to AS3525v2.
- Multiplier latency: 1-3 1-2 cycle typical, 32x16 multiply accumulate instructions are fully pipelined with 1 issued per clock if done to different registers
- Pipeline Interlocks: Load/Stores: Still All single issue, however the load/store and ALU pipelines are independent, and loads have a 3 cycle interlock can retire out of order if used immediately after load. Double there are no dependencies. Load word aligned multiple loads instructions are much faster then single loads. Interlocks now occur when using some multiplication instructions single cycle, with memory accesses occupying only the memory pipeline on sequential subsequent cycles, so in principle one can load many registers in just one cycle if subsequent cycles (e.g. smlawY to accumulate to the are occupied with independent ALU ops. same register will stall).
Similar to ARM9E? but with even longer pipeline, branch prediction, added L2 cache, 64 bit load/store unit and ISA upgraded to v6. Load multiple and load double instructions now fetch two registers per clock if they are even word aligned. Load multiple and store multiple instructions issue in one cycle but will stall if the used registers are read before they are available when loading or written before their contents are stored when storing or if any other memory accesses are started. Large interlock latencies mean considering pipelining is essential. If properly scheduled, performance is substantially improved over ARM9E? due to improved cache, branch prediction and wider load/store units.
- Pipeline Interlocks: All single loads have a 3 cycle interlock if used immediately after load. Double word aligned multiple loads are much faster then single loads, and multiple loads can be faster then double loads due to memory pipeline. Interlocks now occur when using some multiplication instructions on sequential cycles (e.g. smlawY to accumulate to the same register will stall), so avoid accumulating into the same register on sequential cycles.
Similar to ARM9E? but with even longer pipeline, branch prediction, added L2 cache, 64 bit load/store unit, separate ALU and memory pipelines, and ISA upgraded to v6. Load multiple and load double instructions now fetch two registers per clock if they are even word aligned. Load multiple and store multiple instructions issue in one cycle but will stall if the used registers are read before they are available when loading or written before their contents are stored when storing or if any other memory accesses are started. Large interlock latencies mean considering pipelining is essential. If properly scheduled, performance is substantially improved over ARM9E? due to improved cache, branch prediction and wider load/store units.
- Cache: 16KB I, 16KB D, 0 cycle latency, 128KB L2
High clock speed means memory is fairly slow.
These are RISC variants of the Motorola 68k architecture. Coldfire architecture versions:
- V2 - Basic coldfire ISA (ISA_A), only some models support division and remainder instructions
- V3 - ISA_A with mandatory division instructions
- V4 - ISA_B, adds long branches, extended compare and move instructions, separate supervisor stack
There are also some optional units:
- (E)MAC: (Enhanced) multiply-accumulate unit
- Coldfire V2, hardware division, no FPU, EMAC
- 8 KB direct mapped instruction cache. Be aware that you might observe aliasing effects when optimizing.
- No data cache, therefore it's crucial to put often used data in IRAM
- IRAM: 96 KB single cycle (64 KB + 32 KB; only first block is DMA capable)
- Boosting: Yes (45/124MHz)
- Pipeline interlock for back-to-back single stores. Leave at least two instruction cycles between single stores. Doesn't apply to multiple stores (movem.l).
- Coldfire instructions are variable length (1..3 16 bit words). Too many "long" instructions in sequence will starve the pipeline.
- When accessing data in DRAM, use 16 byte aligned movem.l wherever possible. Line burst transfers are ~2.5 times as fast as 4x 4 byte (longword) transfers.
- Use the EMAC if the algorithm allows it. Standard multiplication instructions use the same multiplier, but they always need several cycles because they're synchronous. EMAC is pipelined; stalls occur only if you're fetching the result from %accN too early.
- EMAC instructions can load from memory in parallell while multiplying with only one extra cycle. The point above about long instructions apply though so avoid using offsets if possible.
Same as MCF5249 except IRAM is 128 KB single cycle (64 KB + 64 KB; only first block is DMA capable).
Revision r8 - 29 Nov 2010 - 20:58 - MichaelGiacomelli
Copyright © by the contributing authors.