Altera_Forum
Honored Contributor
8 years agoFast N-Body application in embedded RAM (Cyclone 10 GX)
Hey everyone,
I'm working on a prototype for complex image processing tasks based on the new Cyclone 10 GX family. The pipeline I'm implementing consists of some pre-processing steps (smoother, transformation) and needs to write each line individually into a SRAM to initiate the next step. This next step consists of some kind of N-Body application, where each pixel needs to be compared to each other pixel in a relative range to find a line position, where some specific properties are matched. The algorithm requires to analyze each pixel sequentially, so there is no possibility to make this step out-of-order. In the current implementation, I wait until the line buffer is filled, and after that I read the first pixel (as the reference pixel) and all other pixels (the compare pixels) sequentially and push them into the pipeline. Now here's the problem: If I have images of width 1280 pixels and a relative range of 64 pixels, then each line requires 1280 * 64 cycles to complete, slowing the previous pre-processing steps extremely down. Two solutions exist: I can either process each line in a multiplexed scheme, effectively doubling RAM and logic requirements, or I can read multiple compare pixels parallel out of each line buffer to parallelize the pixel comparison step. Since I can't implement 64 multiplexed line pipelines to achieve full performance, I need to use additionally fine-grained parallelization. The next problem is: If I want to read multiple pixels in parallel out of a line buffer, I can either duplicate the line buffers to allow multiple read pointers, or increase frequency to speedup reads (and move them into slower clock domain in a second step). Both ways are risky, since SRAM is quite constrained, and base frequency is already 200 MHz. Now the question: How would you solve such a problem?