I'd never looked at gcc (or any other compiler) internals before, so it was a matter of reading the on-line gcc internals docs and the code (and a certain amount of trial and error).
OTOH I've hand written assembler for quite a few cpus over the years.
In some places I just hacked the opcode strings in order to see exactly where some common instructions were generated.
I was writing a fairly small piece of code (less than 2kb) that is a multi-channel hdlc controller. It has 195 clocks to do a bytes rx and tx on each channel (doing the bit-stuffing and crc in software), so absolutely every clock counts and I needed to minimise the worst-case code paths, not the common ones.
I had a moderate incentive to optimise obvious defects in the code generator!
The big gain from fixing access to structures 'small data' was reducing 'register pressure' by stopping the compiler allocating a register to contain the start address of the structure - sometimes it sould generate the same pointer twice!
Some other stuff I noticed:
1) The gcc 4 config always puts switch statement jump tables directly into the .code segment (something about not having the appropriate relocations for PIC code). I run nios cpu with tightly coupled instruction and data memory (no caches) and without cpu data access to the code memory - so I need the .code to 'pure'. They could probably be written to a .rodata.switch (or .code.switch or ...) section so that the linker script can decide exactly where they end up.
2) The instruction scheduler doesn't know about the delay slots after 'ld' instructions (and a few others). I had to go to great lengths to get delay slots filled in order to avoid any stall cycles.
3) It ought to be possible to generate the switch statement jump table code as a series of rtx to aid instruction scheduling and also to move the 'add' into the load offset removing an instruction.
4) In my code the only references to the stack pointer are in the function prologue where some registers are saved - that seems a waste for a function that doesn't return! The code is compiled in a single unit and all functions are marked __attribute__((always_inline)).
5) The 'global pointer' / 'small data' stuff seems to be based on gcc support for 'page 0' addressing. Although I fixed the code for gp relative access to structures, the code for accessing 'small data' arrays still uses an extra register. I used gp as a register variable pointing to a structure because that generates better code (gcc knows about the 16bit offset in the final memeory reference).
5a) I'd arranged for my nios data areas and the 'small io' to be within a 64k block (to get gp relative addressing for everything), but I'd missed a trick! I should have put the 'small io' below 0x7fff - then it could be accessed by offsets from r0. It would be nice if gcc supported such variables - probably need a gcc attribute (or a special section - attribute is probably cleaner).
6) Add a gcc attribute to mark data an 'io', and generate ldio/stio (etc) for accesses to such data.
Unfortunately I can't give you a copy of my sources.