Short vs Int vs Floating Point usage in Kernels

Question

I'm just wondering if anyone else has successfully used smaller fixed point representations for data in their kernels compared to floating point. By that I mean I'm playing around with changing my algorithm from floating point to fixed point integer. This obviously lowered the resource usage as it is a simpler operation for the FPGA, however the issue I'm finding is when I attempt to move smaller than 32-bit integers. I actually only need about 8 bits of resolution for my data so I was hoping to save available logic blocks by changing my kernel data from using 'int' to 'short' and finally 'char', however I'm finding the exact opposite is happening. Whenever I move to a smaller data type, it takes up more resources. I wanted to isolate the issue to make sure there was no overhead from my algorithm so I came up with a few tests and got some weird results. I created a simple kernel to do matrix addition and/or matrix dot product:

# define N_VALUES 100

# define TYPE float

typedef TYPE DATATYPE;

__kernel __attribute__((reqd_work_group_size(N_VALUES,1,1)))

__attribute__((num_compute_units(50)))

__attribute__((num_simd_work_items(1)))

void DotProductKernel(__global DATATYPE * restrict a, __global DATATYPE * restrict b, __global DATATYPE * restrict result) {

int id = get_global_id(0);

// Now let's just do the dot product

(result[id]) = (a[id]) * (b[id]);

}

I then amp'd up the number of compute units so that the hardware usage would be significant enough to see and changed the '#define TYPE float' to be 'int', 'short', and 'char'. Here are the results I found during compilation:

Data Type

Logic Elements

Flip Flops

RAMS

DSPs

Logic Utilization %

Dedicated Logic Register %

Memory Block %

DSP %

Floating Point Matrix Addition

168561

221098

2200

0

88%

39%

101%

0%

Floating Point Matrix Dot Product

155611

206598

2100

50

84%

37%

97%

20%

Integer Matrix Addition

141311

181498

2000

0

79%

34%

93%

0%

Integer Matrix Dot Product

140661

192398

2000

100

80%

35%

93%

39%

Short Matrix Addition

176911

208448

2000

0

89%

37%

93%

0%

Short Matrix Dot Product

176511

209248

2000

50

89%

37%

93%

20%

Char Matrix Addition

252311

265398

2000

0

109%

43%

93%

0%

Char Matrix Dot Product

252111

265798

2000

50

109%

43%

93%

20%

Is there some overhead being introduced by the compiler that is using default 32-bit integers and then masking away the excess bits? Or can anyone explain why the resource usage is so much higher for the smaller bit width operations?

altera_forum · Answer

Is the resource usage gotten after the "place and route"? or just the estimation?

the kernel only has one arithmetic operation, whose resource usage is much smaller than that of the kernel overhead. Try more arithmetic operations in the kernel. BTW, more CUs cannot help.

altera_forum · Answer

Thanks for your reply wzk6_3_8.

--- Quote Start ---

Is the resource usage gotten after the "place and route"? or just the estimation?

--- Quote End ---

This is just the estimation. I'm in the process of running the full compilations to compare, however in my experience kernels with estimated usages of over 100% typically fail hardware compilation.

--- Quote Start ---

the kernel only has one arithmetic operation, whose resource usage is much smaller than that of the kernel overhead. Try more arithmetic operations in the kernel. BTW, more CUs cannot help.

--- Quote End ---

But if everything else remains constant (only the type of the data changes), why would the overhead of the kernel change?

altera_forum · Answer

--- Quote Start ---

Thanks for your reply wzk6_3_8.

This is just the estimation. I'm in the process of running the full compilations to compare, however in my experience kernels with estimated usages of over 100% typically fail hardware compilation.

But if everything else remains constant (only the type of the data changes), why would the overhead of the kernel change?

--- Quote End ---

The width of the arithmetic operations impacts the DSP usage as seen in your experiments.

The difference in ALM usage (i.e. logic) is not in the kernel datapath, it is in the load/store units that access the memory. The alignment of loads/stores impacts the ALM usage. With char* pointers, each load/store access is only 1-byte aligned and this does not allow much optimization. With short* pointers, each address is 2-byte aligned (i.e. the least significant address bit is zero) and this allows Quartus to perform some optimizations. The difference for each load/store unit is a few hundred ALMs (depends on the alignment). With 3 load/store * 50 copies, this overhead becomes big, considering there is nothing else in the kernel.

altera_forum · Answer

--- Quote Start ---

The width of the arithmetic operations impacts the DSP usage as seen in your experiments.

The difference in ALM usage (i.e. logic) is not in the kernel datapath, it is in the load/store units that access the memory. The alignment of loads/stores impacts the ALM usage. With char* pointers, each load/store access is only 1-byte aligned and this does not allow much optimization. With short* pointers, each address is 2-byte aligned (i.e. the least significant address bit is zero) and this allows Quartus to perform some optimizations. The difference for each load/store unit is a few hundred ALMs (depends on the alignment). With 3 load/store * 50 copies, this overhead becomes big, considering there is nothing else in the kernel.

--- Quote End ---

That would also explain why the same algorithm with the boolean operator exploded in size. Would it be better to optimize fixed point kernels by loading/storing them as 32-bit integers (as 4 chars packed together) and then separating them only for the internal arithmetic of the kernel to keep the alignment at 4 bytes? Or would a char4 vector data type accomplish the same task?

altera_forum · Answer

In the optimization guide there is a section on fixed point operation, page 14, which suggests statically masking your 32-bit integers to the desired precision. If I understood this right, the AOC would be able to disregard the extra bits during hardware generation thus will reduce the amount of logic (minimal in this case) but still a reduction. You may want to look into that and see if it helps you out any. I was curious so I ran the example listed in the guide for 17-bit precision. I had an increase in logic for the fixed point version over the straight 32-bit version and I don't think this should be the case. I wouldn't think the load/store units that access the memory wouldn't be an issue as Outku suggested in the case of the original poster. Any insight would be appreciated.

Thanks,

Rudy

Forum Discussion

Short vs Int vs Floating Point usage in Kernels

10 Replies

Recent Discussions

New Quartus Prime License - Missing Device and Features

IP Base Suite (NCO, FFT & FIR) not included in ASAP licenses

Connection bit order between hierarchy

A5EG013BB18A OPN is visible in Quartus but not listed in Program File Generator

USB Blaster II Problem