--- Quote Start ---
What you are describing is not an FPGA-specific problem; the same problem applies to all accelerators. Many people have worked on streaming/pipelining computation between a CPU and GPU, there should also be multiple examples of doing this on FPGAs (probably not with OpenCL, though). Usually this is done by double- or multi-buffering, where input is partitioned into multiple chunks, one chunk is processed by the CPU and then written to buffer A on the accelerator. While the accelerator is processing that buffer A, the CPU processes the second chunk and writes to buffer B on the accelerator. Then the accelerator switches to buffer B when buffer A is done, and the CPU switches from buffer B to buffer A and so on. You can use OpenCL events, or custom locks/flags, to synchronize the accelerator and the CPU in this case.
The concept of host channels have also been recently added to Altera's compiler that allows you to stream data directly from the host to the FPGA, but that is only available on Altera's reference board.
--- Quote End ---
Thanks for the suggestions,
Yes, I see there is host channel in the document, and it is limited to one input stream and one output stream if I remember correctly.
There are also io stream, which seems to be used for simulation. I may explore them later.
For now, I will try the double buffering scheme and see if this helps to improve my design.
Cheng Liu