ContributionsMost RecentMost LikesSolutionsRe: Question about M20K block packing Thank you for your reply. My issue is not that the tool does not properly infer the RAMs. I am using Intel's IP named "On-Chip Memory (RAM or ROM) IP" which is found in the Platform Designer. The issue is that, if I use pipeline stages and interconnects to connect different agents to different SRAMs, the M20K block packing to implement these SRAMs has an efficiency of 80%, i.e. 20% of the available M20K capacity is wasted, if I am using 32-bit wide ports. I noticed that the efficiency was nearly 100% when I had no pipeline stages and if only one agent connected to each SRAM. Efficiency dropped to 80% if multiple agents connect to each SRAM (i.e. an interconnect is needed) and/or I add pipeline stages between the agent and the SRAM ports. Question about M20K block packing I am using Intel Quartus Prime 21.1, targeting the Stratix 10 MX 2100 device. I have several read/write avalon memory-mapped interfaces from a Load/Store unit that are connected to True Dual-Port RAMs. I am using double-buffering, so each interface is connected to 2 such RAMs, with a simple demultiplexer interconnect, which lies between the load/store unit and the RAMs. The RAMs are implemented using "On-Chip Memory (RAM or ROM) IP" from Platform Designer. The ports of the RAMs are 32-bit wide and the size of the RAMs is 16384 Bytes. My simple design has only 28 such RAMs for now. Since M20K blocks using 32-bit wide ports are configured at the 512x32 mode, it means that a total of 8 M20K blocks are needed to implement each RAM. This leads to an 80% utilization of the available block memory bits, as 100% utilization requires the 512x40 operating mode. Nonetheless, the compiler is able to optimize the M20K packing and allocate 8 M20Ks for some of the RAMs, but fewer for the others, boosting block memory bit utilization to roughly 99%. However, if I add a pipeline stage for the interface signals between the Load/Store unit and the RAMs (more specifically between the Load/Store unit and the demultiplexer interconnect, the compiler uses 8 M20Ks for all RAMs, dropping block memory bit utilization down to 80%. My assumption is that the fitter does this in order to improve timing. I tried to force a Synthesis setting of maximum number of M20Ks to be used to what was before adding the pipeline stage, but it gets ignored by the Fitter. Do you know of a way that I can control this packing and guide the compiler to always try and maximize M20K block memory bit utilization? Saving this significant number of M20Ks will greatly help me in fitting my final design. Double buffering with Intel HLS Hello, I am trying to implement double buffering using Intel HLS, in order to overlap communication with computation. I know that it can be done with Vivado HLS (with the dataflow pragma), but I have not yet figured out a way to do it with Intel HLS. Basically, I have a for loop that looks like this: for (batch=0; batch<BATCHES; batch++){ load_inputs(input, BATCH_SIZE, batch); compute(input, output); write_outputs(output,batch); } The input data are fetched from DRAM via a master interface and stored in the input array. The compute() unit does computations on this data and stores results in the output array. The results are written back to DRAM via a master interface. The input and output arrays are implemented as hls_memory variables internal to the component. I am using Intel HLS version 19.3 and the hls_max_concurrency pragma to create 2 private copies of these variables. However, the compiler does not automatically implement the double buffering and I do not know if there is a way to do so (similarly to the dataflow pragma of Vivado HLS). Alternatively, I am thinking of defining the load_inputs(), compute() and write_outputs() functions as tasks in a system of tasks and have better control of launching them. However, this will introduce a lot of hassle, as I will no longer be able to keep the inputs and outputs arrays as local hls_memory variables (tasks cannot have them as input arguments) and I will have to define them as external memories in the Platform Designer, manually applying the optimizations that are easily controlled by HLS. The tasks will then access these memories via master interfaces. Is there a way that double buffering can be done with HLS by avoiding all the hassle? If not, does my alternative methodology sound viable? Any suggestions would be very helpful. Thank you in advance. Re: How to define separate bankbits on each memory replicate? Hello, Thanks for your reply. I am using HLS 19.3 Pro Edition. Best, Dimitris How to define separate bankbits on each memory replicate? I am using Intel HLS tools to design an FPGA accelerator. In my application, I have an input array of 512 elements to my component. In each iteration of a loop, there are 3 concurrent loads to that array, without any stores. I am unrolling the loop by 8, so now there are 24 concurrent loads. By default, the compiler chooses to replicate the array in memory 12 times (2 ports per replicate, we need 24 ports in total). However, based on the access patterns, I have found that it can be optimized if we only have 3 replicates, and define different bankbits in each replicate, i.e replicate 1 of the array must have bankbits(0,1,2), replicate 2 must have bankbits(3,4,5) and replicate 3 must have bankbits(6,7,8). Stall-free banking cannot be implemented without replicating the memory in this case. I have gone through the documentation of HLS tools but I didn't find something helpful as to how I could implement that. What I basically want is to take an input array A and replicate it 3 times into arrays A1, A2 and A3 in local memory, where I can define separate bankbits for each. Does anyone have any ideas on this matter? Thank you in advance. Re: Intel HLS QRD Decomposition Tutorial example not running. https://www.intel.com/content/www/us/en/programmable/support/training/course/ohls7.html Hello, I have a quick followup. I tried changing the floating-point precision from double to single on the project that I am working on, and the simulator is working now. However, it gets very slow as the design grows large, for example by increasing the unrollin factor of my main loop, but I guess this is expected, as the amount of hardware that gets simulated rises exponentially in complexity. I do not know yet if the single precision is going to be satisfactory for the purpose of my design, so I wanted to ask if anyone else also gets the same behavior with double precision vs single precision. Also, is there any way of running the simulation on more than one cores? I will also retry running Part 7 of the tutorial with single precision and see if it works. I will update you once I have the results. Thanks Re: Intel HLS QRD Decomposition Tutorial example not running. https://www.intel.com/content/www/us/en/programmable/support/training/course/ohls7.html Hello, thanks for your response. 1) Modelsim I have is FPGA edition version 2019.2 2) GCC version is 8.3.0 Using the --simulator none flag, the compilation works correctly and the report.html is successfully generated. However, there is no executable generated, because the testbench is omitted. The compilation also works for me even when I specify the simulator to Modelsim, and a report.html file is also generated, without informaton about the verification statistics (latency etc.). What doesn't work is the execution of the testbench that is generated by the compilation: i++ MGS.cpp QRD_Testbench.cpp TestbenchHelpers.cpp -v -ffp-contract=fast -ffp-reassoc -march=Stratix10 -o -test-fpga To execute the testbench I then run: ./test-fpga However, the execution generates no outputs (aside from the printing of the first input matrix) and never terminates. The MGS.cpp, QRD_Testbench.cpp and TestbenchHelpers.cpp files are in Part 7 of the tutorial (QRD decomposition). Re: Intel HLS QRD Decomposition Tutorial example not running. https://www.intel.com/content/www/us/en/programmable/support/training/course/ohls7.html Hello, thank you for your reply. 1) The OS version is Debian GNU/Linux 9 (stretch) 64-bit, and the kernel version is 4.9.0-12-amd64. The system has an Intel Core i7 CPU 920 and 24GB or RAM. 2) The HLS compiler is version 19.3 pro edition. 3) The compilation has no error messages. The compilation completes successfully and outputs a reports.html file with all the information about resource utilization, Initiation intervals etc. However, it has no information about latency or memory arbitration, because the simulation has not yet run. To compile the simulation I used the Makefile provided. For compilation I used: i++ -march=Arria10 --fpc --fp-relaxed -o test-fpga (with the sources files included as well. I also tried Stratix10 which is my target platform. Also the compiler output a warning about the --fpc and --fp-relaxed flags, instructing me to replace them with -ffp-contract=fast and -ffp-reassoc, which i did. Nothing changed when I altered or completely removed the flags). Regarding the Modelsim simulator, it is properly set up (I already used it in the previous examples of the tutorial, which were all working perfectly). I am not using the -ghdl flag in the case of the QRD, so no waveform files are generated. However, when I run the simulation (both in the case of the QRD example in the tutorial and in the case of the project I am working now) there is no progress. No error messages are printed. In the case of the QRD, only output is the first input matrix, which is printed at the beginning of the testbench. No other output is printed, even after letting it run for hours. This indicates that the simulator is stuck somewhere. My question regarding the waveforms was if there is some way of seeing the logging signals on-the-run (while the simulation is runing), rather than having to wait until it is finished, in order to be able to see if it gets stuck somewhere and why. Intel HLS QRD Decomposition Tutorial example not running. https://www.intel.com/content/www/us/en/programmable/support/training/course/ohls7.html I completed the tutorial found here: https://www.intel.com/content/www/us/en/programmable/support/training/course/ohls7.html Everything worked fine, except for Part 7. Compilation (make test-fpga) works for me, however when I run the simulation (./test-fpga) the simulation never finishes (I had it running for hours with no results), even though the tutorial guide says it is only supposed to take a few minutes. I have the same problem in an application I am developing. It seems that the simulator might be getting stuck somewhere when the workload is somewhat larger. Has anyone else encountered that problem? I also wanted to ask if it is possible to view the simulation display signals in ModelSim during the time of execution to see if it gets stuck somewhere, because I don't see .wlf file being created during runtime and I don't know of any other methods to see the stage of simulation. Thank you in advance.