release
dev builds
extras
themes manual
wiki
device status forums
mailing lists
IRC bugs
patches
dev guide



Search | Go
Wiki > Main > TargetSpecificOptimization

Guide to CPU specific Optimizations of Rockbox Targets


Devices

ARM

ARM Flavors: See the arm quick reference card: http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001l/QRC0001_UAL.pdf

  • ARMv4 - Basic ARM ISA with support for aligned memory accesses, multiple data transfer to/from memory, and basic 32 =32x32 and 64 = 32x32 multiplication instructions.

  • ARMv5E - Upgraded ARM ISA that includes additional functions for accelerating DSP operations. All ARMv5E+ hardware has at least a 16 bit wide multiplication unit. Additional instructions for single cycle packed 16 bit fixed point multiplication and multiplication w/ accumulation as well as saturating addition.

  • ARMv6 - Upgraded with additional DSP styled operations and support for unaligned load/stores (although these are slower). Adds support for packed SIMD style addition and multiplication operations on 32 bit registers. Adds additional fixed point multiplication instructions (basically instructions that multiply then shift the result).

  • NEON - Adds NEON coprocessor. A separate SIMD core with 32 64 bit registers which can also be accessed as 16 128 bit registers. All NEON instructions operate on the NEON register file and can only be returned to the main core via coprocessor instructions or writing out to memory, both of which incur high latency. NEON operations include vector fixed and floating point addition, multiplication, load, store and shifts. Most operations are fully pipelined, making them vastly more efficient then standard ARM operations.

ARMv6 and below have 16 general purpose registers, one of which is the program counter (PC) and one of which is the stack pointer (SP). In principle SP could be stored and used as a general purpose register, but this breaks when used in a hosted environment and should not be done. This leaves 14 general purpose registers. Note that r14 is the link register, which contains the address to the calling function. This can be used in asm blocks provided it is stacked.

More about registers and gcc: http://www.ethernut.de/en/documents/arm-inline-asm.html

With each arm generation, scheduling becomes increasingly important. Fortunately, code scheduled for a later arm processor uses runs at or near optimal on earlier arm cores, with a few exceptions. Therefore one should ideally consider scheduling for the arm11 processor when writing code, even when developing on earlier processors.

ARM7TDMI

  • ISA: ARMv4

  • Multiplier latency: 3-5 cycle typical (early termination for small numbers)

  • Pipeline Interlocks: All loads have an unconditional 1 cycle stall

Slow multiplier and load/store performance makes careful use of multiplication and load/store multiple instructions essential.

PP5002

  • Cache: 8KB Unified, 1 cycle latency (hardware bug)

  • IRAM: 96KB, 0 cycle latency

  • Boosting: Yes (30/80MHz)

  • Examples: Ipod 1-3G

Dual core CPU, work can be splitted between both cores.

PP5020

Same as PP5002 except:

  • Cache: 8KB Unified, 0 cycle latency

  • IRAM: 96KB, 4 blocks x 24KB with 0/1 cycle latency (hardware bug)
    (latencies for CPU/COP: block 0: 0/1, block 1: 1/0, blocks 2 + 3: 1/1)

  • Examples: Ipod 4G, Nano 1G, H10

Dual core CPU, work can be splitted between both cores.

PP5022/PP5024

Same as PP5020, except IRAM is 128KB long and zero latency.

  • Examples: e200v1, c200v1, Ipod Video

ARM922T

  • ISA: ARMv4

  • Multiplier latency: 3-5 cycle typical

  • Pipeline Interlocks: All single loads have a 1 cycle interlock if used immediately after load.

Similar to ARM7TDMI, but with 5 cycle pipeline and separate cache for instructions and data that eliminates the unconditional ldr delay under some circumstances. Attention to pipelining becomes more important. All arm9 cores should not use registers on the cycle immediately after issuing a load against them.

AS3525v1

  • Cache: 8KB I, 8KB D, 0 cycle latency

  • IRAM: 320KB, performance is comparable to DRAM

  • Boosting: Yes (62/248MHz)

  • Examples: e200v2, m200v4, c200v2, fuzev1, clipv1

Memory performance is fairly poor when boosted, which hurts battery life if codecs require frequent boosting. IRAM seems no better then DRAM. Codecs run entirely from IRAM.

S3C2440A

  • Cache: 16KB I, 16KB D, 0 cycle latency, 32 byte cacheline

  • IRAM: None

  • Boosting: No

  • Examples: Gigabeat F/X

Relatively slow main memory compared to clock speed.

S5L870x (ARM940T)

  • ISA: ARMv4

  • Cache: 4KB I, 4KB D, 16 byte cacheline

  • Boosting: Yes (48/192MHz)

  • Examples: Nano 2G

IRAM is significantly faster then DRAM, but still has higher latency then cache. In general memory performance is very poor on this CPU. Memory bus speed limited to 100MHz so latency increases when boosting.

ARM9E

Similar to ARM922 except the multiplier has been doubled from 8 bits to 16 bits, and the ARM ISA version is upgraded from v4 to v5E.

  • Multiplier latency: 1-2 cycle typical, 32x16 multiply accumulate instructions are fully pipelined with 1 issued per clock

  • Pipeline Interlocks: All single loads have a 1 cycle interlock if used immediately after load, single multiplies have a single cycle interlock if used outside the multiplier unit on the next cycle (e.g. multiply accumulate has no interlock on sequential cycles, but a multiply followed by a store does).

AS3525v2

  • Cache: 8KB I, 8KB D, 0 cycle latency

  • IRAM: 1MB, performance significantly better then DRAM, bus speed still limited to 64MHz so latency increases significantly when boosting.

  • Boosting: Not yet.

  • Examples: Fuzev2, Clipv2, Clip+

TCC780x

Apparently very similar to AS3525v2.

ARM11

  • ISA: ARMv6

  • Multiplier latency: 1-2 cycle typical, 32x16 multiply accumulate instructions are fully pipelined with 1 issued per clock if done to different registers

  • Load/Stores: Still single issue, however the load/store and ALU pipelines are independent, and loads can retire out of order if there are no dependencies. Load multiple instructions are now single cycle, with memory accesses occupying only the memory pipeline on subsequent cycles, so in principle one can load many registers in just one cycle if subsequent cycles are occupied with independent ALU ops.

  • Pipeline Interlocks: All single loads have a 3 cycle interlock if used immediately after load. Double word aligned multiple loads are much faster then single loads, and multiple loads can be faster then double loads due to memory pipeline. Interlocks now occur when using some multiplication instructions on sequential cycles (e.g. smlawY to accumulate to the same register will stall), so avoid accumulating into the same register on sequential cycles.

Similar to ARM9E but with even longer pipeline, branch prediction, added L2 cache, 64 bit load/store unit, separate ALU and memory pipelines, and ISA upgraded to v6. Load multiple and load double instructions now fetch two registers per clock if they are even word aligned. Load multiple and store multiple instructions issue in one cycle but will stall if the used registers are read before they are available when loading or written before their contents are stored when storing or if any other memory accesses are started. Large interlock latencies mean considering pipelining is essential. If properly scheduled, performance is substantially improved over ARM9E due to improved cache, branch prediction and wider load/store units.

iMX31

  • Cache: 16KB I, 16KB D, 0 cycle latency, 128KB L2

  • IRAM: 16KB, not used

  • Boosting: No.

  • Examples: Gigabeat S

High clock speed means memory is fairly slow.

Coldfire

These are RISC variants of the Motorola 68k architecture. Coldfire architecture versions:

  • V2 - Basic coldfire ISA (ISA_A), only some models support division and remainder instructions

  • V3 - ISA_A with mandatory division instructions

  • V4 - ISA_B, adds long branches, extended compare and move instructions, separate supervisor stack

There are also some optional units:

  • Floating point unit

  • (E)MAC: (Enhanced) multiply-accumulate unit

MCF5249

  • Coldfire V2, hardware division, no FPU, EMAC

  • 8 KB direct mapped instruction cache. Be aware that you might observe aliasing effects when optimizing.

  • No data cache, therefore it's crucial to put often used data in IRAM

  • IRAM: 96 KB single cycle (64 KB + 32 KB; only first block is DMA capable)

  • Boosting: Yes (45/124MHz)

Further hints:

  • Pipeline interlock for back-to-back single stores. Leave at least two instruction cycles between single stores. Doesn't apply to multiple stores (movem.l).

  • Coldfire instructions are variable length (1..3 16 bit words). Too many "long" instructions in sequence will starve the pipeline.

  • When accessing data in DRAM, use 16 byte aligned movem.l wherever possible. Line burst transfers are ~2.5 times as fast as 4x 4 byte (longword) transfers.

  • Use the EMAC if the algorithm allows it. Standard multiplication instructions use the same multiplier, but they always need several cycles because they're synchronous. EMAC is pipelined; stalls occur only if you're fetching the result from %accN too early.

  • EMAC instructions can load from memory in parallell while multiplying with only one extra cycle. The point above about long instructions apply though so avoid using offsets if possible.

MCF5250

Same as MCF5249 except IRAM is 128 KB single cycle (64 KB + 64 KB; only first block is DMA capable).

r13 - 02 Apr 2021 - 20:46:07 - UnknownUser

Copyright © by the contributing authors.