Altera_Forum
Honored Contributor
15 years agocopy performance in user space vs. kernel
Hi,
I am using Linux 2.6.32 with MMU with DDR SDRAM and I've run into some performance issues copying data. Copying from DDR to DDR using memcpy in an application only gets me 6-10MB/s. Copying from SRAM (via an mmapped buffer) is equally slow. Doing the same copy in the kernel is about 3 times faster. A memcpy on vmalloced or kmalloced buffers is about as fast as a DMA copy. I am using the binary toolchain for MMU Linux. 32KiB data and instruction caches. 8 uTLB entries for data and instructions, 256 TLB entries. Both kernel and application are compiled with -O2 and the process is set to real time priority (SCHED_FIFO). My tests:
# define BUFLEN 53600
source = (char*)malloc(BUFLEN);
dest = (char*)malloc(BUFLEN);
for(i = 0; i < BUFLEN; i++)
{
source = i % 100 + 20;
}
*tp = 1;
memcpy(dest, source, BUFLEN); // 5.3ms
*tp = 0;
char* source = kmalloc(53600, GFP_DMA|GFP_KERNEL);
char* dest = kmalloc(53600, GFP_DMA|GFP_KERNEL);
char* source_io = (char*)(ioremap_nocache((unsigned int)source, 53600));
char* dest_io = (char*)(ioremap_nocache((unsigned int)dest, 53600));
char* source_v = vmalloc(53600);
char* dest_v = vmalloc(53600);
volatile unsigned int* dma = (volatile unsigned int*)ioremap_nocache(DDR_TO_DDR_DMA_BASE, DDR_TO_DDR_DMA_SPAN);
int j;
dma = 0; // reset status
dma = (unsigned int)virt_to_phys(source);
dma = (unsigned int)virt_to_phys(dest);
dma = 53600;
dma = 0x84;
for(j = 0; j < 53600; j++)
{
source = j % 100 + 20;
}
*tp = 1;
memcpy(dest, source, 53600); // 1.6ms
*tp = 0;
mdelay(1);
*tp = 1;
memcpy(dest_v, source_v, 53600); // 1.8ms
*tp = 0;
mdelay(1);
*tp = 1;
memcpy(dest_io, source_io, 53600); // 4.7ms
*tp = 0;
mdelay(1);
*tp = 2;
dma = 0x8C;
while (!(dma & 1)); // 1.7ms
dma = 0x84;
*tp = 0; Measurements are done by putting out a signal on a GPIO pin (tp) (using mmap on /dev/mem to do this in user space, so no file operations are included in the measurement). I would expect the user space copy to be the same as the vmalloc copy (with some performance loss compared to the kmalloc copy because of non-contiguous memory), but it's 3 times slower. It's close, but actually even slower, to the uncached in-kernel copy. The results are consistent between runs. Any ideas to explain the discrepancy?