Altera_Forum
Honored Contributor
8 years agoExtreme Performance Drop while using local memory
I have two simple kernels both doing the same thing, but the only difference is on of them reads the data from global memory, and the other first copy data into local memory and then read it from there. Here are my two kernels:
#pragma OPENCL EXTENSION cl_khr_fp64: enable __kernel void Test11(__global float *data, __global float *rands, int index, int rand_max){ float2 temp; int gid = get_global_id(0); temp = data[gid]; # pragma unroll for (int i = 1; i < 500; i++) { temp = (float) rands * temp;}
data[gid] = temp.s0;
}
and,
#pragma opencl extension cl_khr_fp64: enable
__kernel void test11(__global float *data, __global float *rands, int index, int rand_max){
float2 temp;
__local float localrands[500];
int depth = 500;
int gid = get_global_id(0);
int lid = get_local_id(0);
int localworksize = get_local_size(0);
int workitemcopyportion = depth / localworksize;
event_t event = async_work_group_copy (localrands, &(rands[lid * workitemcopyportion]), (depth - lid*workitemcopyportion < workitemcopyportion) ? (depth - lid*workitemcopyportion) : workitemcopyportion, 0);
wait_group_events (1, &event);
temp = data[gid];
# pragma unroll
for (int i = 1; i < 500; i++) {
temp = (float) localrands * temp; } data[gid] = temp.s0; } Talking about the OpenCL parameters, I set the total number of work items as 1048576. Looking at the results I see the kernel which reads directly from global memory can achieve around 227 GFLOPS and the one which first copies data to local memory and then read it achieves around 23 GFLOPS. The group size is also set 128. Now doing the same thing on GPU, I can say the one utilizing the local memory achieves around 2.5x higher performance. But in FPGA I see severe performance drops. Can anyone help me what is going wrong in this design? Thanks,