I would suspect ldw with a 4 byte/line data cache on a miss will take 3 cycles (assuming no pipelining and a read latency of 1) . If you had a 32 byte/line data cache it could be much higher since it doesn't implement critical line first loading.
I would expect ldwio to take 2 cycles assuming there is no pipelining in the fabric and the on-chip RAM is setup for 1 cycle of read latency.
So in general having a data cache when you place data into on-chip memory doesn't make a lot of sense. The data cache is implemented with on-chip memory itself so you are using on-chip memory.... to cache on-chip memory which is a bit redundant. Instead you can use the on-chip memory as a tightly coupled memory or remove the data cache assuming on-chip memory is the only memory connected to the data master.