Forum Discussion
Altera_Forum
Honored Contributor
8 years ago --- Quote Start --- And what about moving the "start_time = getCurrentTimestamp()" call? --- Quote End --- Same result.... code update snippet....
const double start_time = getCurrentTimestamp();
// Launch the kernels
for ( int kz=0; kz<kerns; kz++) {
status = clEnqueueTask(queue, kernel, 0, NULL, &task_event);
exitOnFail(status, "Failed to launch kernel");
}
results with 8 kernels.... $ bin/host 100000000 8 get_plat_info: Intel(R) FPGA SDK for OpenCL(TM) Reprogramming device [0] with handle 1 Task:0 complete (3437.884 ms) Task:1 complete (6875.553 ms) Task:2 complete (10313.318 ms) Task:3 complete (13751.042 ms) Task:4 complete (17188.784 ms) Task:5 complete (20626.527 ms) Task:6 complete (24064.256 ms) Task:7 complete (27501.988 ms) Time: 27501.995 ms (3437.749 ms / kernel) Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 and results with 4 kernels.... $ bin/host 100000000 4 Reprogramming device [0] with handle 1 Task:0 complete (3437.864 ms) Task:1 complete (6875.626 ms) Task:2 complete (10313.367 ms) Task:3 complete (13751.130 ms) Time: 13751.135 ms (3437.784 ms / kernel) Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 I have four boards. In other code, I can launch different numbers of kernels of each of the four boards. When I do this, I do see the speed up I'm looking for. e.g. If I run 1 kernel on each of four boards, it takes time X ms. But, When I run 4 kernels on one board, it takes approx. 4 * X ms (as shown above).