Clancy,
Interesting theory but I've seen evidence to the contrary. Jesse seems to know the problem and the fact that the dma engine doesn't exhibit it pretty much proves the cpu core needn't either.
12 clocks on a 50MHz bus is 240ns. No competitive processor is going to take that long to do carefully hand coded back to back consecutive reads from PC100 SDRAM.
RAS + CAS for this part should be something like 30-50ns for initial access then 1 clock per additional accesses in the same column.
Like I've written before, I've seen the dma master read 480 32 bit words out of an onchip fifo and put them into SDRAM in about 485 clocks. This is writing, (and reading!) but no way is 12 clocks per read what we should expect.
An interesting trace would be dma SDRAM to onchip. This would show us what Dirk's traces should look like.
Ken