Ok, then your accesses to the peripheral were being cached. To avoid this with a processor with data cache use this:
# include "io.h" // contains cache bypassing macros
IORD (base, offset); // base address and word address in the peripheral
IOWR (base, offset, data); // base address, word address, and the data you want to write
Those are for accessing native (registered) components like the one you created. If you need to bypass cache to access memory (dynamic) components there are similar macros in io.h for those only they use byte addressing instead of word addressing. They should be contained in this document:
http://www.altera.com/literature/hb/nios2/n2sw_nii5v2.pdf Also the keyword volatile tells the compiler to not optimize away code. For example if you read from a memory location polling for a specific value, you should make it volatile otherwise the compiler may get rid of the polling loop. The compiler would do this since it would see the multiple accesses to the same location and remove it since it doesn't know that some external event might change that value. So it would remove the loop and only read it once (thinking the value will never change so the loop isn't necessary).