--- Quote Start ---
At the moment the answer is no but we will take this into consideration since I agree being able to use block memory that is globally accessed is a beneficial thing.
--- Quote End ---
I understand that complying to the OpenCL conformance tests came first in the list. However, ones should also note that while OpenCL is a cross-platform programming model, it was not originally developed with FPGAs in mind, otherwise it would carry in its specification constructs that are 'hardware-aware'.
AFAIK, BRAMs are used only when __local buffers are instantiated, right? IMHO, I can think of two possible ways of using BRAMs. The first by expanding the concept of __local buffers within the AOCL spectrum only. For instance, one could define a global variable residing on __local memory. This way, an initialization kernel could be defined to explicitly move data from DRAM to BRAM. A naive approach, I assume.
Alternatively, a cl_mem object lying in the DRAM could be mapped onto a cl_mem object declared lying in the BRAM, by using existing clEnqueueMapBuffer function, although the function relates device with host memory space and not device with device memory. Instead of a DMA access from device to host, some mechanism would fetch data from DRAM and place in accordingly on the BRAM.
Either way, I think that it might take quite an effort to pull an explicit BRAM usage within the OpenCL 'boundaries'. How can we state how many blocks do we need and want our kernels to access within OpenCL? None of the above solutions addresses such case.