Hi,
--- Quote Start ---
AFAIK, that would include doing an additional memory interface for this instruction, as the infrastructure of the NIOS design does not allow using the processor's memory interface in a custom instruction. This of course prevents allowing for a cache within the processor. I suppose doing an external cache (aka 2nd level cache) instead of using the 1st leve cache provided by Altera will slow down the CPU a lot.
--- Quote End ---
To make a SMP system with normal Nios2, we must achieve next 2 points.
1) Atomic read-write memory instruction.
2) Coherency of 1st data caches
For atomic memory instructions, it is a kind of the game 'Beach Flags' (in this case the amount of flags is only one and this corresponds to a locking variable). So the flag must be set in the 2nd cache or main memory, not in the 1st caches. This means that the bus lock for atomic instructions is required between 1st cache and 2nd cache, not between cpu and 1st cache, So we can achieve 1) without tampering the Altera's data cache. But for 2), there is no method to flush the aimed line by external hardware, so it's impossible to achieve it except removing the normal data cache.
Of course, we must accept the disadvantage to add an external 1st data cache. It makes the cpu slow, but not a lot. Now to read and write between the cpu and external 1st cache, it takes 3 clocks in the case of cache-hit. But the codes are not fully occupied by 'load ' and 'store' instructions, so the bad influences are limited. (Less memory access is the major premise for RISC processors, though it is sometimes broken:D.)
And there are some advantages to adopt the external 1st cache. We can make the caches all physically-indexed and physically-tagged type, so the 1st data cache size can be enlarged beyond 4Kbytes without synonym problems. Moreover the bus between the 1st and 2nd cache can be made original, e.g. wider bus width or simultaneously readable & writeable. I adopt 128bits bus size and the peak data rates reaches 1.6GBytes/sec(@100MHz).
--- Quote Start ---
Maybe you could use the old A31-trick (A31=1 -> cache bypassed). With that you could define non-cacheable regions using the MMU target address.
--- Quote End ---
Yes, I used A28-trick.
--- Quote Start ---
But I don't think the problem with inter-CPU atomic instructions is solvable :(.
--- Quote End ---
If it is unsolvable, the Linux never boot;).
Kazu