Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
10 years ago

Slow memcpy speed

Hi all,

I have a design based upon the “Lab 4 - Linux FFT Application” from Rocketboard which runs on the Terasic DE0-Nano-SoC (Cyclone V SoC) evaluation board.

First the data is transferred from the FPGA to the HPS SDRAM using DMA. This transfer is fast: 8 kBytes (1k * 64 bit) takes 21 us => 380 Mbytes/s.

Doing HPS signal processing on the data while stored in sdram is a bit slow, so to increase the signal processing speed the 8 kBytes data is copied into an array using memcpy.

Now the signal processing is much faster, but the memcpy “penalty” is high: Transferring the 8 kBytes of data takes 500 us = 16 Mbytes/s using the compile flag O0, O2 or O3.

Using compile flag from O1 increases memcpy transfer rate to 188us = 42 Mbytes/s, but from what I have read this still seems to be at least 4 times slower than expected.

Has anyone done similar tests, or know if there are any other options that must be set to get a faster memcpy transfer?

All timing measurements are done using an oscilloscope (start/stop trigger signals are written from the HPS to the FPGA-GPIO).

OS: Angstrom v2015.12. Linux real time kernel version 4.1.22-ltsi-rt (PREEMPT RT)

2 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    An update:

    When defining arrays like this

    int value[2048]; //source array

    int dest[2048] ; //destination array

    and running memcpy(dest,value,2048*4), memcpy speed is high: 446 Mbytes/s

    And the compile flag -Ofast give faster speed than -O1, as expected.

    - - - - - -

    My design is based upon the fpga_fft example from Rocketboard where DMA transfers data from FPGA into HPS’s DRAM memory.

    The memory space for these data (*value) is defined using mmap:

    volatile unsigned int *value;

    volatile unsigned int dest[2048*4];

    # define result_base (FFT_SUB_DATA_BASE + (int)mappedbase +(FFT_SUB_DATA_SPAN/2))

    - - - - - -

    In main:

    // we need to get a pointer to the LW_BRIDGE from the softwares point of view.

    // need to open a file.

    /* Open /dev/mem */

    if ((mem = open("/dev/mem", O_RDWR | O_SYNC)) == -1)

    fprintf(stderr, "Cannot open /dev/mem\n"), exit(1);

    // now map it into lw bridge space:

    mappedbase = mmap(0, 0x1f0000, prot_read | prot_write, map_shared, mem, alt_lwfpgaslvs_ofst);

    if (mappedBase == (void *)-1) {

    printf("Memory map failed. error %i\n", (int)mappedBase);

    perror("mmap");

    }

    Run DMA and wait for completion

    ...

    ...

    // And when the DMA is finnished the data is available:

    value = (unsigned int *)((int)result_base);

    - - - - - -

    Now, when running memcpy(dest,value,2048*4) the speed is slow: only 42 Mbytes/s, and the compiler does not respond as expected to the -O compiler flags, i.e. -Ofast is slower that -O1.

    It seems that using mmap really slows down the access to memory. Is it possible to speed this up?

    Any help would be greatly appreciated!

    Thanks,
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    I think my problem is related to the high address (ALT_LWFPGASLVS_OFST = ff200000) that is used, and this might have to be fixed in the kernel space…

    While waiting for someone to fix this for me :) , I wrote an assembly version of the memcpy using the “NEON memory copy with preload” example from arm infocenter.

    I had to add “SUBS r2,r2,#0x40” before the loop, if not the loop would go 64 bytes too far (thus overwriting memory).

    Using this "neon memcpy" I got a bit more speed (62 MBytes/s), and I could use the -Ofast flag to optimize the rest of the code.

    This function is called the same way as memcpy, but the data must be 64 bytes aligned:

    void *neon_memcpy(void *ut, const void *in, size_t n)

    neon_memcpy.S:

    .arch armv7-a

    .fpu neon

    .global neon_memcpy

    .type neon_memcpy, %function

    neon_memcpy:

    SUBS r2,r2,#0x40

    neon_copy_loop:

    PLD [r1,# 0xC0]

    VLDM r1!,{d0-d7}

    VSTM r0!,{d0-d7}

    SUBS r2,r2,#0x40

    BGE neon_copy_loop

    bx lr