--- Quote Start ---
Please attach the *kernel_name*.log and quartus_sh_compile.log files from the compilation folder.
--- Quote End ---
It seems I ran out of DSPs :(. The DSPs got over utilized at 162%. Apparently, the example was specifically targeted for the D8 chip as opposed to the AB chip. Can I remove the unrolling of the loops to minimize the DSP utilization? Here is a snippet of the code:
__kernel
__attribute__((reqd_work_group_size(NUM_THREADS,1,1)))
void black_scholes( int m, int n,
float drift,
float vol,
float S_0,
float K)
{
// running statistics -- use double precision for the accumulator
double sum = 0.0;
// loop over all simulations
for(int path=0;path<m;path++) {
float S = S_0;
float arithmetic_average = 0.0f; // We're not including the initial price in the average
for (int t_i=0; t_i<n/VECTOR; t_i++) {
float U[VECTOR], Z[VECTOR];
vec_float_ty U0 = read_channel_intel(RANDOM_STREAM_0);
vec_float_ty U1 = read_channel_intel(RANDOM_STREAM_1);
vec_float_ty U2 = read_channel_intel(RANDOM_STREAM_2);
vec_float_ty U3 = read_channel_intel(RANDOM_STREAM_3);
#pragma unroll vector_div4 for (int i=0; i<VECTOR_DIV4; i++) {
U
=u0;
U[i+1*VECTOR_DIV4]=U1
;
u[i+2*vector_div4]=u2;
U[i+3*VECTOR_DIV4]=U3
;
}
#pragma unroll vector_div2
for (int i=0; i<vector_div2; i++) {
float2 z = box_muller(u[2*i], u[2*i+1]);
z[2*i] = z.x;
z[2*i+1] = z.y;
}
#pragma unroll vector
for (int i=0; i<vector; i++) {
// convert uniform distribution to normal
float gauss_rnd = z;
// Simulate the path movement using geometric brownian motion
S *= drift * exp(vol * gauss_rnd);
arithmetic_average += S;
}
}
It took close to 24-hours to compile the example on a 16-core 3.3Ghz, 128Gig machine! :o
Thanks,
QG