Forum Discussion
Altera_Forum
Honored Contributor
8 years agoFor my kernels using volatile I would normally have something arranged like this:
__kernel foo(__global const B_parallel * volatile W, int count){ B_parallel p = W[idx]; } which works for me, I'm not exactly sure why it is having you require your variable to be volatile as well, unless it is a pointer to a struct which doesn't appear to be the case. The memory blocks used will probably be always higher than the number of ideals bits calculated since there is a limit of bits per block, knowing how much more is difficult to find out as the compiler determines what it thinks is optimal. The 1 read and 1 write says the number of accesses that are made to local memory at a time. If you want to have more accesses, you will need to unroll the loop or have more accesses to local memory in the code. The best practices guide recommends limiting it to four accesses to optimal performance, however, having more accesses will likely create a more complex memory structure and most likely cause duplication which will consume a large amount of memory blocks so that the memory blocks can be widely accessed in parallel. In addition to the memory replication, performance will likely suffer as well. The M20Ks are capable of operating at twice the clock speed of the FPGA clock which can lead to the memory being double pumped allowing it to support double the amount of accesses while keeping up with the FPGA clock. Banking memory I usually stick with the default as that ends up handling what I need pretty well, in some cases tweaking with the banking can provide some increased performance but that involves some playing around with. I haven't tried banking with a size of 256 parallel accesses but I'd imagine there will be a limit when the memory replication will start impacting performance. There are several emulator limitations, but it does seem to have "unlimited" memory, or however much your CPU is willing to use since it is actually running on the CPU instead of the FPGA. Although it won't fit on the board, I have ran into a number of bugs where the size was too large, I'm not sure if that size depends on your development environment but in anycase, at that size, it isn't practical to use for an OpenCL FPGA anyways. You do have a limitation on how much local memory you can effectively utilize, if you need more space, you would have to move it into global. I hope that helps.