Yes, I understand it is all on CPU, I might have misexplained this, I meant that the iterative normal for loop takes less time than that which allows for parallelism (parallel_for). Doesn't parallel_for apply parallelism at the same time to all the rows in the buffers, why its performance is worse? this is mainly my question. The iterative for loop is on the host and the parallel_for is on the kernel (device).
I have tried to split the input into smaller ones by using parallel_for_work_group but it gave the same results. The iterative code no more than 40 seconds while the parallel one takes more than 7 minutes.
Thank you!