Forum Discussion
19 Replies
- Altera_Forum
Honored Contributor
One problem with that particular code is that you force an additional memory read (probably through %gp) for almost every access.
Something like:
will let the compiler generate (and maybe cache) the 32bit constant. If you are trying to squeeze every ounce of performance from the Nios, it is worth placing all the internal memory and io devices in a 64kb range (not necessarily aligned) with %gp set to it's middle - then all accesses to the memory and io devices will be done with %gp relative addressing. (You need my patches to gcc3 for this - don't know about gcc4.) (Actully you can get even better code by making all your data a single structure and defining a global register variable to point into it.)#define uart_pointer ((uart_regs_t *)(AVALON_UART_0_BASE|DATA_CACHE_BYPASS_MASK)) - Altera_Forum
Honored Contributor
--- Quote Start --- One problem with that particular code is that you force an additional memory read (probably through %gp) for almost every access. Something like:
will let the compiler generate (and maybe cache) the 32bit constant. If you are trying to squeeze every ounce of performance from the Nios, it is worth placing all the internal memory and io devices in a 64kb range (not necessarily aligned) with %gp set to it's middle - then all accesses to the memory and io devices will be done with %gp relative addressing. (You need my patches to gcc3 for this - don't know about gcc4.) (Actully you can get even better code by making all your data a single structure and defining a global register variable to point into it.) --- Quote End --- dsl, that is good information. can you explain a bit more. what is# gp? why does the code cause an additional read?#define uart_pointer ((uart_regs_t *)(AVALON_UART_0_BASE|DATA_CACHE_BYPASS_MASK)) - Altera_Forum
Honored Contributor
Read the NiosII instruction set :-)
Basically the read/write memory addressing modes are a 16bit signed offset from a register. One of the registers 'gp' (r26) is reserved to point into the middle of a 'small data' area, variables within this area are accessed using offsets from 'gp' - giving single instruction access. Anything <= 4 byes, and anything put into sections whose names start .sdata or .sbss is accessed as offset from 'gp'. To access variables that aren't in the small data segment, the compiler will generate a pointer to the data item (2 instructions), and then dereference it. It isn't possible to tell the compiler that low 16 bits of the constant could be moved to the load/store instruction - and in any case that would only be possible for accesses to non-aggregate items, and they end up in the small data segment. If a data item is used multiple times, the compiler will often keep it's address in a register (although I've seen it forget and load the value into a 2nd one!). So placing things in the .sdata area reduces register pressure as well as instruction count. There is a slight problem though - the code in gcc that is used for the 'small data' is really designed for processors that can access memory either side of address 0 with single instructions. As such it doesn't expect to be able to add in a constant offset. This shows up when you put an array into the small data area, when the compiler won't add the array index to 'gp' and then use the array base address (as offset from gp) as an offset from that result. If you put all your data into a structure, and use a global register variable to point to the start of it, then that optimisation will happen. Now with:
The assigment requires the compiler generate code to read 'uart_pointer' and then write to the uart register - at least two instructions and two memory cycles (and possibly a two cycle stall waiting for the read unless the instructions can be reordered). On the other hand, directly indexing the the constant address ought to be 2 instructions (one to load the high register bits, the 2nd the uart access). I suspect it is difficult to get the compiler to not generate a 3 instruction sequence - especially without my patches! However, get the uart registers inside the area addressable from 'gp' and you get single instruction access.volatile alt_u32 *uart_pointer; uart_pointer = 0xDEADBEEF; - Altera_Forum
Honored Contributor
For those of you masking your pointers with | 0x80000000 please use the cache remapping functions for your pointers instead. Sometimes what you are doing is safe but if you do that with a memory location that is already cached you may (probably) run into a cache coherency issue where your code attempts to bypass a location that is already cached!
The cache remapping functions perform a cache flush to ensure the locations you may be accessing are truly flushed from the cache and safe to access directly. For example if I did this in my code (not clean just proving a point....): int * my_ptr = (int *)malloc (4); IOWR_32DIRECT((unsigned long)my_ptr, 0, 0xdeadbeef); You would expect 0xdeadbeef to be written out to memory with whatever location malloc returned. Not necessarily..... If for example the location returned by malloc contained data that was already cached (taken from the heap), this IOWR at a hardware level will cause the cached value to be written out instead of the software intended 0xdeadbeef. If you are thinking "well that's a hardware bug".... well not really because if 0xdeadbeef went out like you want, eventually a cache line miss at this location will occur and whatever was in the cache will get flushed out eventually (blowing away 0xdeadbeef). Cache coherency problems like these are a big pain to debug so my recommendation is to write clean code using the proper software APIs provided to you and then you won't have to debug issues like these later. - Altera_Forum
Honored Contributor
Since IO areas should never be accessed by cacheable memory accesses this is all rather moot.
More likely is that, somewhere, you 'forget' to use the correct access function and do a cacheable access well after initialisation. If you are allocating memory descriptors that will be accessed by other bus masters then you also need to ensure that the area you use doesn't start/end in the middle of a cache line - otherwise other users might cause the cache line to be dirty. OTOH if you are reserving large chunks of memory in the physical memory map (not a run time), then you really don't need to worry about them being cached before you start. But, yes, problems with unexpected cache operations can be very difficult to resolve. - Altera_Forum
Honored Contributor
The comment about allocations starting/ending in the middle of a cache line is exactly the issue I have been working on this week.
If you use the alt_uncached_malloc function then it does flush the cache and set the cache bypass bit on the returned pointer, but it does not take into account the cache lines when generating the allocation. This means that you will almost certainly have cached and uncached allocations sharing the same cache line(s) which can lead to system hell as mentioned above. Maybe I'm missing something, maybe it's a bug, or maybe it's 'by design' but if you do not account for it things can get nasty. - Altera_Forum
Honored Contributor
You aren't missing anything and it isn't a bug, it is just a side effect of how the cache is designed. You have to take it into account, and the best way to do it is to align all the memory areas that you intend to flush or invalidate on the cache line size.
For a dynamically allocated memory area, you can allocate more than needed and then perform some operation on the pointer to align it. I haven't tested it, but this should align your pointer correctly:
This code will only work if NIOS2_DCACHE_LINE_SIZE is a power of 2 but I think it always is. It isn't a very elegant piece of code, but it may be rendered more readable by defining a 'mask" constant first, that is applied to the pointer. Remember to keep both pointers, as you will need the original my_buffer to free it. On a static variable, you can use gcc's align attribute:// to allocate a buffer of n bytes my_buffer = malloc(n+NIOS2_DCACHE_LINE_SIZE-1); my_pointer = (buffer + NIOS2_DCACHE_LINE_SIZE - 1) & (-NIOS2_DCACHE_LINE_SIZE);my_type my_variable __attribute__ ((aligned (NIOS2_DCACHE_LINE_SIZE))); - Altera_Forum
Honored Contributor
Thanks for confirming I'm not going mad!
I had already come to a similar conclusion and implemented my own wrapper function to code round this. I agree it is a side effect of how the hardware operates, but if you use a function called 'alt_uncached_malloc' which is documented as allocating an uncached region of memory you might assume that this was already taken into account. The static variable attribute ensures that it is aligned with the cache line size, but is it guaranteed that the compiler will not stack anything else directly on the end? I'm not familiar with the compiler internals, but if it does you could still have problems. - Altera_Forum
Honored Contributor
I'm not sure... I think that if the next variable in the source code is also aligned, then the compiler will pad the bytes between the two, but this is just an assumption. You could run some tests to check that (and in that case, be sure to check at all optimization levels!)