ContributionsMost RecentMost LikesSolutionsOpenCL Host-to-Memory Read Speed varies in linux kernels Board: Teriasic De10 Nano OpenCL for FPGA standard 18.1 Hi everyone, I am working on a project on De10 nano using OpenCL. I ran an official board test program from Terasic on two OS. One is the official img from Terasic and got a Host-global-memory read speed of 155 MB/s The other one is on a linux kernel 4.19 with Ubuntu and got a Host-global-memory read speed of 60 MB/s. Does anyone know are there any configurations in kernel complication that would affect the Host-global-memory read speed? The boardtest reports are attached. Thanks in advance! Re: FPGA Opencl caching for CPU access to shared memory through ACP port Thanks for your reply! However, I am satisfied with the performance of the kernel right now. And I really want to optimize the process where data is transferred to and from the shared memory. It costs much more time than the kernel running. FPGA Opencl caching for CPU access to shared memory through ACP port I am doing a project on CycloneV soc which involves transferring large amount of data from memory to FPGA. As DMA needs to work with physically contiguous memory, I copied data to the share memory first (allocated by clEnqueueMapBuffer) and then FPGA consumes the data accordingly. The problem I am having now is that moving data to the shared memory in user space is very time-consuming. I think it is due to the fact that Intel OpenCL library disables the caching for cpu access to the shared memory (as shown in the following pic). I can understand that it's hard to manage cache coherence in this case but it is not impossible to achieve! As the Intel OpenCL library is not open-source, it seems hard for us to do any changes and enable the cache. Can anyone tell me a way around this problem, plz? Re: OpenCL multi-thread error: HAL Kern Error: Read failed from addr 20 Not sure how many active users are still using this forum. For what is worth, I want to share my results for solving this problem. Hope this could help people who come across the same problem. The problem indeed lies in the MMD library. In all current version of csoc5 BSPs and RTEs, Version 14.1 MMD is used. The newest MMD is of version 18.1 which I found in Intel Opencl FPGA SDK Pro 20.4. Seems like this version was just resleased. In the MMD 18.1, it mentioned in the source file about fixing this bug in 14.1. // global variables used for handling multi-devices and its helper functions // Use a DeviceMapManager to manage a heap-allocated map for storing device information // instead of using a static global map because of a segmentation fault which occurs in // the following situation: // 1) Host program contains a global variable which calls clReleaseContext in its destructor. // When the program ends the global goes out of scope and the destructor is called. // 2) clReleaseContext calls a function in the MMD library which modifies the static global map in // the MMD library. // In this situation it was discovered that the destructor of the static global map is called before // the destructor of the global in the host program, thus resulting in a segmentation fault when // clReleaseContext calls a function that modifies the internal map after it has been destroyed. // Using a heap-allocated map avoids this issue as the lifetime of the map persists until it is // deleted or the process is completely terminated. So when I remove my global variables, everything works. Re: Getting "HAL Kern Error: Read/Write failed from addr x, read y expected z" when running multithread OpenCL application on CycloneV SoC Hi I am having a similar problem on the de10 nano board. After checking the corresponding source code, I found out that the MMD library is of version 14.1 and it does not have AOCL_MMD_CONCURRENT_READS and AOCL_MMD_CONCURRENT_WRITES defined. So I guess it might be the problem? And how shall we update this library? Is there any off-the-shelf code I could take Re: Getting "HAL Kern Error: Read/Write failed from addr x, read y expected z" when running multithread OpenCL application on CycloneV SoC Hi I am having a similar problem on the de10 nano board. After checking the corresponding source code, I found out that the MMD library is of version 14.1 and it does not have AOCL_MMD_CONCURRENT_READS and AOCL_MMD_CONCURRENT_WRITES defined. So I guess it might be the problem? And how shall we update this library? Is there any off-the-shelf code I could take advantage of? Thanks! OpenCL multi-thread error: HAL Kern Error: Read failed from addr 20 Platform: terasic De10-nano (CycloneV soc) Software version: OpenCL SDK 18.1, MMD version 14.1 Problem description: I am developing an application using multi-thread (while only one thread will run NDRange and execute the FPGA kernel) and got this error: HAL Kern Error: Read failed from addr 20, read -1237012464 expected 4 HAL Kern Error: Read failed from addr 20, read -1237012464 expected 4 HAL Kern Error: Write failed to addr 1000 with value 0, wrote -1237012464 expected 4 HAL Kern Error: Write failed to addr 20 with value be80a2dc, wrote -1237012464 expected 4 Sometimes the program will be killed and sometimes will not. I suspect it may due to multi-thread somehow, so I test the intel official example multithread_vector_operation (which is attached) A similar error comes out but with only read failed. HAL Kern Error: Read failed from addr 20, read -1234415600 expected 4 I used gdb to debug this and found out the SIG44 was received when two threads were at clWaitForEvents(). I notice that someone already asked a similar question and mentioned that it may be due to the obsolete MMD library. But I saw that all csoc5 RTE are with MMD 14.1. Can anyone shed a light on this problem, please? Thanks in advance! Re: Getting a lot of "HAL Kern Error: Read failed from addr x, read y expected z" errors – what to do? Hi, I am experiencing the same bug on de10 nano as well. Could you let me know if you have resolved it? Thanks! Re: OpenCL FPGA: actual results differ from emulation results I got this problem solved. The access of local memory B_sub is out of range when loading value to it. It works well when I increase the size of B_sub. It's strange though the emulator didn't complain any run-time error. OpenCL FPGA: actual results differ from emulation results Platform: DE10-nano soc, Intel FPGA SDK for OpenCL 18.1 I am designing a matrix multiplication kernel similar to this one: https://cnugteren.github.io/tutorial/pages/page8.html It uses 3D work items to basically multiply many sets of two-matrix pairs and output the results. The emulation passes, while the actual design on-chip didn't. When running on FPGA, only the first few digits match with correct results. I am thinking maybe it has something to do with the way the emulator emulates multiple work items. But I add barriers whenever I load values to local memory. Could anyone provide some insights on the difference between multiple work items implementation in emulation and actual design? #include "config.h" uint8_t gf_mu_x86(uint8_t a, uint8_t b) { uint8_t p = 0; /* the product of the multiplication */ #pragma unroll for (int i=0;i<8;i++){ // if (!(a && b)){ // break; // } if (b & 1) /* if b is odd, then add the corresponding a to p (final product = sum of all a's corresponding to odd b's) */ p ^= a; /* since we're in GF(2^m), addition is an XOR */ if (a & 0x80) /* GF modulo: if a >= 128, then it will overflow when shifted left, so reduce */ a = (a << 1) ^ 0x11D; /* XOR with the primitive polynomial x^8 + x^4 + x^3 + x + 1 (0b1_0001_1011) – you can change it but it must be irreducible */ else a <<= 1; /* equivalent to a*2 */ b >>= 1; /* equivalent to b // 2 */ } return p; } int address_interpretor(int x, int y, int offset, __global const uint8_t* restrict sample_idx){ // use x to find index of required packet (file space) in sample_idx uint8_t file_pkt_idx = sample_idx[offset+x]; // calculate idx of required data in file space return file_pkt_idx*PKT_SIZE + y; } // Use 2D register blocking (further increase in work per thread) __kernel // __attribute__((num_compute_units(CMP_UNIT))) // __attribute__((max_work_group_size(256))) __attribute__((reqd_work_group_size(TSM/WPTM, TSN/WPTN, 1))) // 8, 1, 1 void myGEMM6( __global const uint8_t* restrict A, __global const uint8_t* restrict B, __global uint8_t* restrict C, __global const uint8_t* restrict DEGREE_, __global const uint8_t* restrict sample_idx // cached ) { // Thread identifiers const int tidm = get_local_id(0); // Local row ID (max: TSM/WPTM == RTSM) const int tidn = get_local_id(1); // Local col ID (max: TSN/WPTN == RTSN) const int offsetM = TSM*get_group_id(0); // Work-group offset const int offsetN = TSN*get_group_id(1); // Work-group offset const int batch_id = get_global_id(2); // max: N_BATCH // Local memory to fit a tile of A and B __local uint8_t Asub[TSK][TSM]; __local uint8_t Bsub[TSN][TSK+2]; __local uint8_t degrees[MAX_NUM_BATCH]; // Allocate register space uint8_t Areg; uint8_t Breg[WPTN]; uint8_t acc[WPTM][WPTN]; int deg_offset = 0; uint8_t my_deg; // Initialise the accumulation registers #pragma unroll for (int wm=0; wm<WPTM; wm++) { #pragma unroll for (int wn=0; wn<WPTN; wn++) { acc[wm][wn] = 0; } } // load degrees and calculate offsets if(tidm == 0 && tidn == 0){ #pragma unroll for(int i=0;i<MAX_NUM_BATCH;i++){ degrees[i] = DEGREE_[i]; } } barrier(CLK_LOCAL_MEM_FENCE); for(int i=0;i<batch_id;i++){ deg_offset += degrees[i]; } my_deg = degrees[batch_id]; // Loop over all tiles const int numTiles = my_deg/TSK; barrier(CLK_LOCAL_MEM_FENCE); for(int t=0;t<numTiles;t++){ // Load one tile of A and B into local memory // #pragma unroll for (int la=0; la<LPTA; la++) { int tid = tidn*RTSM + tidm; int id = la*RTSN*RTSM + tid; int row = MOD2(id,TSM); int col = DIV2(id,TSM); // float row_ = MOD2(id,TSM); // float col_ = DIV2(id,TSM); // printf("%f,%f\n",row_,col_); int tiledIndex = TSK*t + col; int A_vec = address_interpretor(tiledIndex, offsetM + row, deg_offset,sample_idx); // Asub[col][row] = A[tiledIndex*PKT_SIZE + offsetM + row]; Asub[col][row] = A[A_vec]; Bsub[row][col]= B[tiledIndex*BATCH_SIZE + offsetN + row + deg_offset*BATCH_SIZE]; } // Synchronise to make sure the tile is loaded barrier(CLK_LOCAL_MEM_FENCE); // Loop over the values of a single tile // #pragma unroll for (int k=0; k<TSK; k++) { // Cache the values of Bsub in registers #pragma unroll for (int wn=0; wn<WPTN; wn++) { int col = tidn + wn*RTSN; Breg[wn] = Bsub[col][k]; } // Perform the computation #pragma unroll for (int wm=0; wm<WPTM; wm++) { int row = tidm + wm*RTSM; Areg = Asub[k][row]; #pragma unroll for (int wn=0; wn<WPTN; wn++) { acc[wm][wn] ^= gf_mu_x86(Areg , Breg[wn]); } } } // Synchronise before loading the next tile barrier(CLK_LOCAL_MEM_FENCE); } // Store the final results in C // #pragma unroll for (int wm=0; wm<WPTM; wm++) { int globalRow = offsetM + tidm + wm*RTSM; #pragma unroll for (int wn=0; wn<WPTN; wn++) { int globalCol = offsetN + tidn + wn*RTSN; C[globalCol*PKT_SIZE + globalRow + batch_id*PKT_SIZE*BATCH_SIZE] = acc[wm][wn]; } } } Solved