Hi, I've got some very strange 'performance behavior' with a tightly coupled on chip memory used as data storage. In order to speed up my design, I've created a tightly coupled on chip memory where I calculate coefficient values while a DMA controller copies new data into another portion. At my first attempt I've used the DMA synchronized by means that I wait for the dma to finish copying and then doing my calculations. In my second attempt I copy while I do my calculations, but this gives me zero performance gain. How can that be? It almost seems as if the NIOS stalls while copying takes place. Can anyone give me a good explanation on what's going on? Thanks

Is the TCM single or dual port? If it's a single port ram, then it will definitly stall while the copy takes place. If its a true dual port, you should be able to make it work, just make sure the NIOS is attached to one port and the DMA controller is attached to the other. Pete

Dual Port, TCM only allows one memory master connected to the TCM :). The DMA Controller is the mSGDMA from the Altera Wiki. Seems all very strange.

Maybe the copy time isn't actually significant. You could try requesting the copy twice and see how much that slows it down.

Hi, already tried that too. I try to share some code snippet, where I see some of those strange behaviors. In the code below the 8x memcpy_dma is significantly faster than wrapping it all in one single dma call. And by significantly faster I mean, 1400ms per execution cycle of my algoritm to 1600ms. memcpy_read_dma_async(c_off(0,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK); memcpy_read_dma_async(c_off(1,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK); memcpy_read_dma_async(c_off(2,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK); memcpy_read_dma_async(c_off(3,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK); memcpy_read_dma_async(c_off(4,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK); memcpy_read_dma_async(c_off(5,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK); memcpy_read_dma_async(c_off(6,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK); memcpy_read_dma(c_off(7,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16)); for(int n = 0; n < synth->width; n++) { // lift 2's /*c_off(0,n,synth->width) -= HWLIFT2(*c_off(1,n,synth->width), *c_off(1,n,synth->width)); *c_off(2,n,synth->width) -= HWLIFT2(*c_off(1,n,synth->width), *c_off(3,n,synth->width)); *c_off(4,n,synth->width) -= HWLIFT2(*c_off(3,n,synth->width), *c_off(5,n,synth->width));*/ *c_off(0,n,synth->width) -= ((2 + *c_off(1,n,synth->width) + *c_off(1,n,synth->width)) >> 2); *c_off(2,n,synth->width) -= ((2 + *c_off(1,n,synth->width) + *c_off(3,n,synth->width)) >> 2); *c_off(4,n,synth->width) -= ((2 + *c_off(3,n,synth->width) + *c_off(5,n,synth->width)) >> 2); // lift 3's *c_off(1,n,synth->width) += ((8 - *c_off(0,n,synth->width) + 9*(*c_off(0,n,synth->width)) + 9*(*c_off(2,n,synth->width)) - (*c_off(4,n,synth->width))) >> 4); } for (int y = 0; y < synth->height/2; y++) { if(y < (synth->height/2-4)) { memcpy_read_dma(c_off((8 + 2*y)%MEM_LIFTCACHE_DEPTH,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16)); memcpy_read_dma(c_off((9 + 2*y)%MEM_LIFTCACHE_DEPTH,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16)); } // and so on ... my memcpy_read_dma is just a simple wrapper around the dma descriptor writing static void sgdma_read_complete_isr(void * context) { read_isr_fired++; clear_irq(MSGDMA_DISPATCHER_READ_CSR_BASE); } void memcpy_read_dma_async(void* dest, void* src, alt_u32 size, unsigned long control_bits) { sgdma_standard_descriptor descriptor; while((RD_CSR_STATUS(MSGDMA_DISPATCHER_READ_CSR_BASE) & CSR_DESCRIPTOR_BUFFER_FULL_MASK) != 0); // spin until there is room for another descriptor to be written to the SGDMA construct_standard_mm_to_mm_descriptor (&descriptor, (alt_u32 *)src, (alt_u32 *)dest, size, control_bits); write_standard_descriptor (MSGDMA_DISPATCHER_READ_CSR_BASE, MSGDMA_DISPATCHER_READ_DESCRIPTOR_SLAVE_BASE, &descriptor); return; } void memcpy_read_dma_compl() { while(read_isr_fired == 0); read_isr_fired = 0; // reset spin lock } void memcpy_read_dma(void* dest, void* src, alt_u32 size) { memcpy_read_dma_async(dest, src, size, DESCRIPTOR_CONTROL_TRANSFER_COMPLETE_IRQ_MASK); memcpy_read_dma_compl(); }

Any chance that the code fetches are being deferred by the dma? Personally I wouldn't take an interrupt at the end of the dma either. The cost of the ISR is probably more than the transfer. You just need to poll the 'dma busy' register. But I've really not looked at what the altera HAL code does. I'm not impressed by the HAL code I've looked at.

Tightly Coupled OnChip DMA | Altera Community

11 Replies

Altera_Forum
Honored Contributor
12 years ago
Is the TCM single or dual port?

If it's a single port ram, then it will definitly stall while the copy takes place.
If its a true dual port, you should be able to make it work, just make sure the NIOS is attached to one port and the DMA controller is attached to the other.

Pete
Altera_Forum
Honored Contributor
12 years ago
Dual Port, TCM only allows one memory master connected to the TCM :).
The DMA Controller is the mSGDMA from the Altera Wiki. Seems all very strange.
Altera_Forum
Honored Contributor
12 years ago
Maybe the copy time isn't actually significant.
You could try requesting the copy twice and see how much that slows it down.

Altera_Forum

Honored Contributor

12 years ago

Hi,

already tried that too. I try to share some code snippet, where I see some of those strange behaviors. In the code below the 8x memcpy_dma is significantly faster than wrapping it all in one single dma call. And by significantly faster I mean, 1400ms per execution cycle of my algoritm to 1600ms.


memcpy_read_dma_async(c_off(0,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK);
memcpy_read_dma_async(c_off(1,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK);
memcpy_read_dma_async(c_off(2,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK);
memcpy_read_dma_async(c_off(3,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK);
memcpy_read_dma_async(c_off(4,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK);
memcpy_read_dma_async(c_off(5,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK);
memcpy_read_dma_async(c_off(6,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK);
memcpy_read_dma(c_off(7,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16));
	for(int n = 0; n < synth->width; n++)
	{
		// lift 2's
		/*c_off(0,n,synth->width) -= HWLIFT2(*c_off(1,n,synth->width), *c_off(1,n,synth->width));
		*c_off(2,n,synth->width) -= HWLIFT2(*c_off(1,n,synth->width), *c_off(3,n,synth->width));
		*c_off(4,n,synth->width) -= HWLIFT2(*c_off(3,n,synth->width), *c_off(5,n,synth->width));*/
		*c_off(0,n,synth->width) -= ((2 + *c_off(1,n,synth->width) + *c_off(1,n,synth->width)) >> 2);
		*c_off(2,n,synth->width) -= ((2 + *c_off(1,n,synth->width) + *c_off(3,n,synth->width)) >> 2);
		*c_off(4,n,synth->width) -= ((2 + *c_off(3,n,synth->width) + *c_off(5,n,synth->width)) >> 2);
		// lift 3's
		*c_off(1,n,synth->width) += ((8 - *c_off(0,n,synth->width) + 9*(*c_off(0,n,synth->width))
				+ 9*(*c_off(2,n,synth->width)) - (*c_off(4,n,synth->width))) >> 4);
	}
	for (int y = 0; y < synth->height/2; y++)
	{
		if(y < (synth->height/2-4))
		{
			memcpy_read_dma(c_off((8 + 2*y)%MEM_LIFTCACHE_DEPTH,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16));
			memcpy_read_dma(c_off((9 + 2*y)%MEM_LIFTCACHE_DEPTH,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16));
		}
// and so on ...

my memcpy_read_dma is just a simple wrapper around the dma descriptor writing


static void sgdma_read_complete_isr(void * context)
{
	read_isr_fired++;
	clear_irq(MSGDMA_DISPATCHER_READ_CSR_BASE);
}
void memcpy_read_dma_async(void* dest, void* src, alt_u32 size, unsigned long control_bits)
{
	sgdma_standard_descriptor descriptor;
    while((RD_CSR_STATUS(MSGDMA_DISPATCHER_READ_CSR_BASE) & CSR_DESCRIPTOR_BUFFER_FULL_MASK) != 0);  // spin until there is room for another descriptor to be written to the SGDMA
    construct_standard_mm_to_mm_descriptor (&descriptor, (alt_u32 *)src, (alt_u32 *)dest, size, control_bits);
    write_standard_descriptor (MSGDMA_DISPATCHER_READ_CSR_BASE, MSGDMA_DISPATCHER_READ_DESCRIPTOR_SLAVE_BASE, &descriptor);
    return;
}
void memcpy_read_dma_compl()
{
	while(read_isr_fired == 0);
	read_isr_fired = 0;				// reset spin lock
}
void memcpy_read_dma(void* dest, void* src, alt_u32 size)
{
	memcpy_read_dma_async(dest, src, size, DESCRIPTOR_CONTROL_TRANSFER_COMPLETE_IRQ_MASK);
	memcpy_read_dma_compl();
}

Altera_Forum
Honored Contributor
12 years ago
Any chance that the code fetches are being deferred by the dma?

Personally I wouldn't take an interrupt at the end of the dma either.
The cost of the ISR is probably more than the transfer.
You just need to poll the 'dma busy' register.
But I've really not looked at what the altera HAL code does. I'm not impressed by the HAL code I've looked at.
Altera_Forum
Honored Contributor
12 years ago
I have a similar situation with simple DMA and internal FPGA memory ... is that the same as TCM ?

Can you explain the mechanism that stalls the NIOS processor while the DMA is in operation ? I would have thought that the Avalon MM fabric would take
care of that arbitration and both would make forward progress together .

Thanks, Bob
Altera_Forum
Honored Contributor
12 years ago
IIRC the arbiter is likely to have given both master the same priority, and also to avoid switching between equal priority masters unless they release their 'request'. I think you can change the priority by right clicking on the intersection.
The dma controller could easily be doing full bandwidth transfers on both the source and destination avalon slaves.
So if the cpu needs to access one of those slaves it could easily get locked out.
Altera_Forum
Honored Contributor
12 years ago
Got back on working on that issue yesterday, and it seems that by moving heap & stack and also the code to respective TCM coupled onchip memories resolve the contention on the bus.

One small question related to that whole issue: My algorithm heavily relies on memory accesses and I measured with a performance counter that my algorithm takes ~ 30 000 000 clock cycles. Doing a naive calculation this should roughly translate to 300ms, but measuring with a interval timer I see ~1200 ms now. Does anyone have good guess why the two methods differ that much? Or even better, has anyone a good advice on how to increase the utilization of the NIOS?

Thanks for all the help so far.
Altera_Forum
Honored Contributor
12 years ago
You will need to understand the object code generated by the compiler.
Compiling with (IIRC) -S --verbose-asm will give an annotated assembly output.
You might either find that you've miscounted the number of instructions, or that there are some pipeline stalls because values read from memory can't be used for the next two clocks (ie there is a stall if either of the next two instructions use the read value).

A quick look at your code makes me think that you need to copy some values to locals. I suspect a lot of them are being reread from memory multiple times in the loop.
Altera_Forum
Honored Contributor
12 years ago
--- Quote Start ---
You will need to understand the object code generated by the compiler.
Compiling with (IIRC) -S --verbose-asm will give an annotated assembly output.
You might either find that you've miscounted the number of instructions, or that there are some pipeline stalls because values read from memory can't be used for the next two clocks (ie there is a stall if either of the next two instructions use the read value).

A quick look at your code makes me think that you need to copy some values to locals. I suspect a lot of them are being reread from memory multiple times in the loop.
--- Quote End ---

I agree with DSL that the actual code generated needs to be analyzed to give a fair chance of correlating. I have been involve with benchmark tuning and that method will yield results ... analyzing what the code generator produced . I had access to a "scroll pipe" that was a trace of the pipeline execution of each instruction but we don't have that here.

Did you get over the contention to getting to the internal memory ? I have not tried dual porting but another approach may be to compute your coeffecients with NIOS and having two independent coeffiecient buffers for DMA to work on in a ping-pong fashion. This would require three internal memories.. the main one for NIOS and a dedicated ping and dedicated pong memory to fully decouple the DMA from the NIOS.

To view contention without side-effects, SignalTrace can be used else bring the AVALON DMA and NIOS data master reada and write signals out to probe then with a scope or Logic Analyzer.

Best Regards, Bob.

Forum Discussion

Tightly Coupled OnChip DMA

11 Replies

Recent Discussions

not able to use multiple niosV cores at the same time

Multiple NIOS V Implementation

Implementing many Nios® V cores on Agilex™ 7

SysID Timestamp

LPDDR4 not available in NIOSV/g linker script - Agilex-5, Quartus 26.1 Pro