Check that your avalon slave is 32bits wide and clocked from exactly the same clock as the Nios CPU.
With the /f core you should be able to get an uncached memory read from an Avalon slave in 3 cycles (I'm not sure it is possible to do better), plus the 2 instruction 'result delay'.
Cycles to an M9K block behind a clock crossing bridge take 10 clocks.
The fastest way to access peripherals is actually through the 'custom instruction' interface. Assuming everything is synchonised to the cpu clock unclocked status reads are single cycle with no result delay (just a mux), clocked instructions can be single cycle but are subject to the result delay.