Fast N-Body application in embedded RAM (Cyclone 10 GX)

Question

Hey everyone,

I'm working on a prototype for complex image processing tasks based on the new Cyclone 10 GX family. The pipeline I'm implementing consists of some pre-processing steps (smoother, transformation) and needs to write each line individually into a SRAM to initiate the next step. This next step consists of some kind of N-Body application, where each pixel needs to be compared to each other pixel in a relative range to find a line position, where some specific properties are matched. The algorithm requires to analyze each pixel sequentially, so there is no possibility to make this step out-of-order. In the current implementation, I wait until the line buffer is filled, and after that I read the first pixel (as the reference pixel) and all other pixels (the compare pixels) sequentially and push them into the pipeline. Now here's the problem: If I have images of width 1280 pixels and a relative range of 64 pixels, then each line requires 1280 * 64 cycles to complete, slowing the previous pre-processing steps extremely down. Two solutions exist: I can either process each line in a multiplexed scheme, effectively doubling RAM and logic requirements, or I can read multiple compare pixels parallel out of each line buffer to parallelize the pixel comparison step. Since I can't implement 64 multiplexed line pipelines to achieve full performance, I need to use additionally fine-grained parallelization. The next problem is: If I want to read multiple pixels in parallel out of a line buffer, I can either duplicate the line buffers to allow multiple read pointers, or increase frequency to speedup reads (and move them into slower clock domain in a second step). Both ways are risky, since SRAM is quite constrained, and base frequency is already 200 MHz. Now the question: How would you solve such a problem?

altera_forum · Answer

Can you describe your N-Body algorithm in more detail? Perhaps share some links to relevant papers, blogs, ...

Assuming that the processing module accepts all 64 pixel at the same time, and is fully pipelined, all you have to do is to create a ping-pong buffer where you write the pixel data one (or perhaps 2 or 4 ) at the time but read 64 pixels on the other side. Assuming 8-bit pixel data, this would require 16 RAM-blocks.

Regards,

Josy

Forum Discussion

Fast N-Body application in embedded RAM (Cyclone 10 GX)

1 Reply

Recent Discussions

Power-Down Sequence Requirements for the Agilex 7 F-Series(2x F-Tile) Devices

Regarding Power-Up Sequence for Agilex 5

Cyclone V SoC 5CSXC6 Series GXB Utilization and Limitations

How to tell Quartus my Arria10 target system CLKUSR frequency is 100MHz?

Agilex 3 PLL in Source Synchronous mode ?