Slow memcpy speed

Question

Hi all,

I have a design based upon the “Lab 4 - Linux FFT Application” from Rocketboard which runs on the Terasic DE0-Nano-SoC (Cyclone V SoC) evaluation board.

First the data is transferred from the FPGA to the HPS SDRAM using DMA. This transfer is fast: 8 kBytes (1k * 64 bit) takes 21 us => 380 Mbytes/s.

Doing HPS signal processing on the data while stored in sdram is a bit slow, so to increase the signal processing speed the 8 kBytes data is copied into an array using memcpy.

Now the signal processing is much faster, but the memcpy “penalty” is high: Transferring the 8 kBytes of data takes 500 us = 16 Mbytes/s using the compile flag O0, O2 or O3.

Using compile flag from O1 increases memcpy transfer rate to 188us = 42 Mbytes/s, but from what I have read this still seems to be at least 4 times slower than expected.

Has anyone done similar tests, or know if there are any other options that must be set to get a faster memcpy transfer?

All timing measurements are done using an oscilloscope (start/stop trigger signals are written from the HPS to the FPGA-GPIO).

OS: Angstrom v2015.12. Linux real time kernel version 4.1.22-ltsi-rt (PREEMPT RT)

altera_forum · Answer

An update:  When defining arrays like this int value[2048];  //source array int dest[2048] ; //destination array and running memcpy(dest,value,2048*4), memcpy speed is high: 446 Mbytes/s And the compile flag -Ofast give faster speed than -O1, as expected.  - - - - - -  My design is based upon the fpga_fft example from Rocketboard where  DMA transfers data from FPGA into HPS’s DRAM memory.  The memory space for these data (*value) is defined using mmap:   volatile unsigned int *value; volatile unsigned int dest[2048*4]; # define result_base (FFT_SUB_DATA_BASE + (int)mappedbase +(FFT_SUB_DATA_SPAN/2))  - - - - - - In main:  // we need to get a pointer to the LW_BRIDGE from the softwares point of view.   // need to open a file. /* Open /dev/mem */ if ((mem = open("/dev/mem", O_RDWR | O_SYNC)) == -1) fprintf(stderr, "Cannot open /dev/mem
"), exit(1); // now map it into lw bridge space: mappedbase = mmap(0, 0x1f0000, prot_read | prot_write, map_shared, mem, alt_lwfpgaslvs_ofst);  if (mappedBase == (void *)-1) {    printf("Memory map failed. error %i
", (int)mappedBase);    perror("mmap"); }  Run DMA and wait for completion ... ...   // And when the DMA is finnished  the data is available: value = (unsigned int *)((int)result_base);                - - - - - -  Now, when running memcpy(dest,value,2048*4) the speed is slow: only 42 Mbytes/s, and the compiler does not respond as expected to the -O compiler flags, i.e. -Ofast is slower that -O1. It seems that using mmap really slows down the access to memory.  Is it possible to speed this up?  Any help would be greatly appreciated!  Thanks,

altera_forum · Answer

I think my problem is related to the high address (ALT_LWFPGASLVS_OFST = ff200000) that is used, and this might have to be fixed in the kernel space…

While waiting for someone to fix this for me :) , I wrote an assembly version of the memcpy using the “NEON memory copy with preload” example from arm infocenter.

I had to add “SUBS r2,r2,#0x40” before the loop, if not the loop would go 64 bytes too far (thus overwriting memory).

Using this "neon memcpy" I got a bit more speed (62 MBytes/s), and I could use the -Ofast flag to optimize the rest of the code.

This function is called the same way as memcpy, but the data must be 64 bytes aligned:

void *neon_memcpy(void *ut, const void *in, size_t n)

neon_memcpy.S:

.arch armv7-a

.fpu neon

.global neon_memcpy

.type neon_memcpy, %function

neon_memcpy:

SUBS r2,r2,#0x40

neon_copy_loop:

PLD [r1,# 0xC0]

VLDM r1!,{d0-d7}

VSTM r0!,{d0-d7}

SUBS r2,r2,#0x40

BGE neon_copy_loop

bx lr

Forum Discussion

Slow memcpy speed

2 Replies

Recent Discussions

NIOS-V QSYS Warning Properties (associatedClock) have been set on

DK-DEV-AGI027-RA: JTAG chain broken after Nios V Hello, FPGA recovery fails

Where is FreeRTOS-Plus-TCP Design

NIOS V: Systick based timeouts not available when using internal timer

Ashling RISC Free IDE fails to download ELF file