Rockbox

Tasklist

FS#12431 - SH gcc 4.6.2 with link-time optimization, for Archos targets

Attached to Project: Rockbox
Opened by Boris Gjenero (dreamlayers) - Thursday, 08 December 2011, 05:56 GMT
Task Type Patches
Category Build environment
Status New
Assigned To No-one
Operating System HW-codec
Severity Low
Priority Normal
Reported Version Daily build (which?)
Due in Version Undecided
Due Date Undecided
Percent Complete 0%
Votes 0
Private No

Details

I'm now able to build a working copy of Rockbox r31177 for my Archos Recorder V2 using binutils 2.21.1 and gcc 4.6.2, with -Os -flto. The main advantages are a binary size and memory use decrease of 7kb and automatic discarding of unused code, and the main disadvantage is much slower linking. I don't know if this is worth it.

The new binutils is needed because a linker plugin is needed to enable link time optimization of object files stored in library archives, like libfirmware.a. Linker plugin support is automatically detected by gcc, so there's no need for -fuse-linker-plugin.

The attached gcc patch is based on the current gcc-4.0.3-rockbox-1.diff by Jens Arnold (amiconn). I still need to investigate whether the workaround in gcc/config/sh/sh.h is actually needed. Including it shouldn't cause any problems. You can find info about it in IRC logs around this date: http://www.rockbox.org/irc/rockbox-20060427.txt

The attached Rockbox patch changes rockboxdev.sh to build this toolchain, configure to add -flto for gcc 4.6.0 and above, and various things so Rockbox builds properly. The gcc patch can't be automatically downloaded by rockboxdev.sh, so put it the download directory, which is by default, /tmp/rbdev-dl. Note that configure will only use -Os if it finds "rockbox" in the sh-elf-gcc version string, so if you want to try an unpatched gcc, you need to edit configure or the generated Makefile.

Most of the code changes simply add __attribute__((used)) to stuff that gcc -flto would otherwise throw away. When C code is only referenced by assembler code, gcc will throw it away. This even happens for references from inline assembler in the same C file. Functions in apps/plugins/lib/gcc-support.c were also getting discarded, resulting in "defined in discarded section" errors.

Link time optimization shuffles around code, and then divides into several large assembler files. (Note how in rockbox.map, instead of the normal .o files, you see a bunch of .ltrans.o files.) Code from the same file may end up in different assembler files. This is why the "bsr _UIE" couldn't reach UIE(), and why .global is needed for _start_thread and _UIE4.

Various little notes: I see no improvement with GLOBAL_LDFLAGS=-fwhole-program, so gcc must be detecting that properly. Adding -ffunction-sections -Wl,--gc-sections is also not helpful. The patch doesn't fix some warnings added by using gcc 4.6.2, but there are only a few, and they should be easy to deal with. It also doesn't make changes needed for -flto for other targets. Without -flto, gcc 4.6.2 generates a binary that's 3 kb bigger than the gcc 4.0.3 binary.
This task depends upon

Comment by Thomas Martitz (kugel.) - Thursday, 08 December 2011, 07:48 GMT
Should probably add USED_ATTR to gcc_extensions.h. It's used in a number of places now.
Comment by Nils Wallménius (nls) - Thursday, 08 December 2011, 16:22 GMT
How large is the difference in compile time? Also have you tried with -fno-fat-lto-objects ?
Comment by Boris Gjenero (dreamlayers) - Friday, 09 December 2011, 18:30 GMT
I committed USED_ATTR support in r31188. An updated patch using USED_ATTR is attached.

Here are my benchmarks for building a r31189 Archos Recorder V2 binary. Times are in seconds. I first performed the operation once without timing. Then, I timed three repeated operations and divided the reported time by 3. Both columns use code patched by sh_flto-v2.patch. The first column uses the normal sh-elf-gcc 4.0.3 built using an unpatched rockboxdev.sh, and the second column uses gcc 4.6.2 built via the patched rockboxdev.sh and -flto. The computer has a Q6600 CPU at 2.4 GHz, 2GB RAM, a WD Black 1TB, hard drive, and Linux Mint Debian Edition with Update Pack 3 from 2011.08.30. I did not use ccache.

make clean, then make -j4
real 38.3 99.8
user 95.8 293
sys 10.9 16.6

touch main.c then make -j4
real 1.1 33.1
user 0.7 32.1
sys 0.1 0.8

Yeah, it really is that bad. (At first I wasn't even sure if I should create this tracker entry.) I didn't try -fno-fat-lto-objects, because there hasn't been a gcc release yet with that feature. It shouldn't help the second case much anyways. One thing that ought to help is parallelizing with -flto=jobserver, but unfortunately it causes linking to sometimes fail with:
make[1]: *** read jobs pipe: No such file or directory. Stop.
make[1]: *** Waiting for unfinished jobs....
lto-wrapper: make returned 2 exit status
/opt/sh-2211-462/lib/gcc/sh-elf/4.6.2/../../../../sh-elf/bin/ld: lto-wrapper failed
collect2: ld returned 1 exit status

A further size reduction of 132 bytes binary size and 104 bytes memory used is possible by using -flto-partition=none. That puts everything into one assembler file.
Comment by Nils Wallménius (nls) - Saturday, 10 December 2011, 08:47 GMT
Yes, the second case is rather bad
Comment by Rafaël Carré (funman) - Saturday, 17 December 2011, 01:38 GMT
It's something I'll accept.

When building ARM targets in thumb mode, building can take twice as long already

Loading...