From Custom Instruction to Custom Peripheral

Question

I'm trying to simulate how a camera works by generating lines of data and then apply some kind of spatial 3x3 convolution filter to the data. I have done one method by using Custom Instruction where I assume a 2-D image array is already available in NIOS and I then I used two for-loops to send 8 pixels x 8 bits each time to be processed in a combinatorial hardware module. The module was basically this code(with the inputs combined to be 2x32 inputs): http://edge.kitiyo.com/2009/codes/sobel-core-verilog-module.html   Now instead of doing two for-loops with a Hardware Instruction for each pixel, I want to do be able to send an entire row to emulate a camera, and then wait for 3 entire rows before doing the processing in hardware.  I'm trying to do something like the Figure 2 in this paper. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.8743&amp;rep=rep1&amp;type=pdf   So I think my NIOS C code will look something like this. # define NUM_ROWS 4# define NUM_COLS 8
alt_u8 array_image = { /* fill in values */};
 
alt_u8* pointer8bit  = 0;
alt_u32* pointer32bit  = 0;  
int num_32bit_values = NUM_COLS / 4;   // Num 32bit values in one row.
 
for(row=0; row&lt;(NUM_ROWS); row++)     
{
   pointer8bit  = &amp;(array_image);  
   pointer32bit = (alt_u32*) pointer8bit;    // Convert pointer to interpret memory in chunks of 32 bits
   for(n=0; n&lt;(num_32bit_values); n++)   
   {
 
       HARDWARE_CUSTOM_INSTRUCTION(pointer32bit);  // Moves through array in 32bit steps transferring the row
   }
}
 
   But unlike with the Custom Instruction I did before, I need the HARDWARE_CUSTOM_INSTRUCTION to give a return value only when the transfer of a row has finished. The hardware module should also 'store' two rows of data at any time while it waits for the third one come and then do the processing.   For example, with my 4x8 array above, there has to be 6 32-bit data transfers (2x32-bit transfer per row) to occur before I can do the Sobel operation in hardware. My questions are:   - How do I store the 6 words in hardware and call part of each to do calculation? - How do I modify the original sobel code to achieve this? I need to have like a for-loop in hardware to go across the three rows. - After the initial 3 rows have arrived and calculations are done, then I need to drop the 'oldest' and bring it a new row. Any advice on how to do that?   I don't think I can use Custom Instructions now right? I have to use Custom Peripherals with Avalon MM for example? I have tried sketching some verilog code for this (see below) but I don't know how to do the buffers in hardware. I am not sure how to use the data_en signal either. I only added it because I was trying to make a Avalon MM component and it asked for a write_n. Can somebody please give some guidance on this?    
module my_sobel_test_mm (
 // Inputs
 clk,
 reset,
 data_in, //writedata
 data_en, //write_n
 // Outputs
 data_out //readdata
);
 
// Inputs
input clk;
input reset;
input  data_in;
input data_en;
// Outputs
output  data_out;
/*****************************************************************************
 *                 Internal wires and registers Declarations                 *
 *****************************************************************************/
 
 wire  data_in_buffer_1;
 wire  data_in_buffer_2;
 wire  data_in_buffer_3;
 
// Internal Registers
reg  line_1;
reg  line_2;
reg  line_3;
 
//11 bits because max value of gx and gy is 255*4 and last bit for sign      
reg signed  gx,gy;
//Find the absolute value of gx and gy     
reg signed  abs_gx,abs_gy;
//Max value is 255*8. here no sign bit needed.  
reg  sum;   
reg    result;
 
// Integers
integer    i;
 
/*****************************************************************************
 *                             Sequential logic                              *
 *****************************************************************************/
// Sobel Operator
// 
//                 
// Gx         Gy   
//                 
//
// |G| = |Gx| + |Gy|
always @(posedge clk)
begin
 if (reset == 1'b1)
 begin
  for (i = 2; i &gt;= 0; i = i-1)
  begin
   line_1 &lt;= 8'h000;
   line_2 &lt;= 8'h000;
   line_3 &lt;= 8'h000;
  end
  gx &lt;= 11'h000;
  gy &lt;= 11'h000;
  abs_gx &lt;= 11'h000;
  abs_gy &lt;= 11'h000;
 
  result    &lt;= 8'h000;
 end
 else if (data_en == 1'b1)
 begin 
 
 ////// Dont know how to do this section //////////////
 line_1 &lt;= data_in_buffer_1;line_1 &lt;= data_in_buffer_1;line_1 &lt;= data_in_buffer_1;
 line_2 &lt;= data_in_buffer_2;line_2 &lt;= data_in_buffer_2;line_2 &lt;= data_in_buffer_2;
 line_3 &lt;= data_in_buffer_3;line_3 &lt;= data_in_buffer_3;line_3 &lt;= data_in_buffer_3;
        //////////////////////////////////////////////////////
 
 //sobel mask for gradient in horizontal direction 
 gx &lt;=((line_1-line_1)+((line_2-line_2)&lt;&lt;1)+(line_3-line_3));
 //sobel mask for gradient in vertical direction 
 gy &lt;=((line_3-line_1)+((line_3-line_3)&lt;&lt;1)+(line_3-line_1)); 
 // Absolute value of gx 
 abs_gx &lt;= (gx? ~gx+1 : gx);
 // Absolute value of gy  
 abs_gy &lt;= (gy? ~gy+1 : gy); 
 // Sum 
 assign sum = (abs_gx+abs_gy);  
 // Max value 255   
 result &lt;= (|sum)?8'hff : sum; 
 end
end
/*****************************************************************************
 *                            Combinational logic                            *
 *****************************************************************************/
assign data_out = result; 
endmodule

altera_forum · Answer

Some use multiple calls to the custom instruction to do what you are trying to do. Before going down that road will you be doing this same calculation over an entire frame of data? If so I recommend that you implement this as a hardware accelerator that can master the memory since it should be much more efficient (and the CPU can be doing something else in the meantime). You can also build your hardware to perform just the transform and use DMAs to shoved data in and out of your hardware.

Here are some examples of what I'm talking about:

http://www.altera.com/support/examples/nios2/exm-accelerated-fir.html

http://www.altera.com/support/examples/nios2/exm-checksum-acc.html

http://www.altera.com/support/examples/nios2/exm-crc-acceleration.html

altera_forum · Answer

Thanks for the recommendation. Yes, I will have to do the same calculation over the entire frame eventually. But for now I want to work on 3 lines of data at a time because I want to emulate how a line-scan camera works. I am currently doing multiple calls (one for each loop iteration in software) on data generated from NIOS, but I want to do this by looping within hardware itself. Or probably I didn't understand what you meant by multiple calls.. could you please explain?

When this is done, then I will move to whole frame processing with DMA. Actually I had seen this Accelerating FIR with DMA example before, and my plan was to substitute the transform_block.v with my own. But because I didn't know exactly how to write this block due to my inexperience with hardware programming, I got stuck! At first I tried a ready-made hardware block that I got from the University Program (this is when I used your SGDMA suggestion) but I could not get it working. I have been able to do a DMA memory to memory transfer without any processing block in between, and verfied the data at tx and rx buffers etc. But what I need is the data to get transformed it between....

Then I thought I might as well learn from scratch and build up my knowledge slowly. So using that basic Sobel example, I tried PIOs first, and then used Custom Instruction. This was already a significant speed up, but still not yet to justify using FPGA over a normal computer. Until I get to the stage of me being able to write a sobel transform block to work with DMA (unlikely to happen soon), I have to stick to the custom instruction and I am now in a worrying impasse ... Any help on how to write this transform sobel block for dma usage is greatly appreciated.

altera_forum · Answer

If your hardware takes longer to process the data than Nios II can input the data then perhaps what you could do is put a FIFO in your custom instruction so that your code just keeps calling the custom instruction shoving operands into it without reading the results. Then you can start reading the results back out. At this point though I would have switched it over to be a memory mapped component.

In your HDL at the top one suggestion I have is break out your registers into separate always blocks. Anything that doesn't need to be a register I recommend coding as a wire using an assign statement. The way your HDL is coded currently it looks like it'll take 5 clock cycles to complete one result. With some buffering you can get it to perform a calculation every clock cycle (keep stuffing the result into the FIFO to be read later). Also assuming your logic is functionally correct you are practically 90% of the way to having a streaming component.

altera_forum · Answer

Thanks BadOmen again for the input. While searching a bit on FIFO and shift registers, I came across the altshift_taps component and after reading its documentation, I thought it might be useful for me in this case. But before I go down that road, I want someone to give me an advice if this component is worth using for my purpose and if yes, is my following argument correct?

a) Assume my image to be processed consists of 4 rows x 12 columns 8-bit values stored in SSRAM or SDRAM of my development board. I use the altshift_taps with parameters Width = 32 bits, Number of taps = 4 and tap distance = 3.

b) Then I start 'feeding' the values from SDRAM into the altshift component in chunks of 32-bits.

c) Now when all the 48 8-bit (or 12 32-bit) values are inside the shift register, at each of the next 3 clock cycles, the output taps will each give 32 bits (i.e repsenting 4 8-bit values of each row), or 128 bits in total.

d) These 128 bits are in fact a '2-D' array of 4x4 8-bit values to which I can apply 4 Sobel filter instantiations to give 4 ouput 8-bit values. This value is then sent back as a 32-bit chunk to memory.

Is my reasoning above any good? Will I still be able to use the DMA transfer method in this way? Will I be able to extend this 4 rows to full frame later?

Also concerning the hardware logic design itself, how do I know when all the values are inside the altshift components and hence taps ready to be used?

Sorry for basic questions again...but I don't have any expertise around me to ask to and I need your help :)

PS: I don't have the 'Create groups for each tap output' option in my altshift_taps MegaWizard. Why is this?

altera_forum · Answer

Hi,

Here is my situation,

I have an electronic board with an EP2C20F484I8N and EPCS4N. The programmation occurs properly and when i test the output pins, all of them are tristated. When i put to usb blaster on his connector, it works correctly and i received some good signals.

The problem occurs when the Usb Blaster is not connected to the board.

Could you help me?

Thanks for advance

Forum Discussion

From Custom Instruction to Custom Peripheral

9 Replies

Recent Discussions

Regarding Power-Up Sequence for Agilex 5

Cyclone V SoC 5CSXC6 Series GXB Utilization and Limitations

How to tell Quartus my Arria10 target system CLKUSR frequency is 100MHz?

Agilex 3 PLL in Source Synchronous mode ?

writing a word to cfm1 using on chip flash ip on max10