I think my problem is related to the high address (ALT_LWFPGASLVS_OFST = ff200000) that is used, and this might have to be fixed in the kernel space…
While waiting for someone to fix this for me :) , I wrote an assembly version of the memcpy using the “NEON memory copy with preload” example from arm infocenter.
I had to add “SUBS r2,r2,#0x40” before the loop, if not the loop would go 64 bytes too far (thus overwriting memory).
Using this "neon memcpy" I got a bit more speed (62 MBytes/s), and I could use the -Ofast flag to optimize the rest of the code.
This function is called the same way as memcpy, but the data must be 64 bytes aligned:
void *neon_memcpy(void *ut, const void *in, size_t n)
neon_memcpy.S:
.arch armv7-a
.fpu neon
.global neon_memcpy
.type neon_memcpy, %function
neon_memcpy:
SUBS r2,r2,#0x40
neon_copy_loop:
PLD [r1,# 0xC0]
VLDM r1!,{d0-d7}
VSTM r0!,{d0-d7}
SUBS r2,r2,#0x40
BGE neon_copy_loop
bx lr