Fixed point optimization

Question

Hello,

I have written two kernels to notice the difference in fixed and floating point operations.

a)

__kernel

__attribute__((task))

void test_multiplier(global char *restrict in, global char *restrict weights, global int *restrict out) {

int output = 0;

# pragma unroll 100

for(int i=0; i<VEC_SIZE; i++){

output += in * weights;

}

*out = output;

}

b)

__kernel

__attribute__((task))

void test_multiplier(global float *restrict in, global float *restrict weights, global float *restrict out) {

int output = 0;

# pragma unroll 100

for(int i=0; i<VEC_SIZE; i++){

output += in * weights;

}

*out = output;

}

Both the kernels give me the same number of DSPs, i.e 100 (unroll factor). I was expecting 25 DSPs in the 8 bit (char argument) case. Does aoc compiler optimize well for fixed point quantizations?

altera_forum · Answer

Quartus/AOC v16.1.2 and below do not seem to be able to infer 8-bit and 16-bit operations correctly. Your first code example only uses 50 DSPs in 17.0.2 and above. However, it is probably best to define "out" and "output" as short rather than int.

altera_forum · Answer

I have used aoc 17.1.2. Initial report after static analysis has predicted 50DSPs. After synthesis the quartus compilation report shows the following :-

Kernel 1 - 8 bit (char) resource usage according to quartus

Total registers 68810

Total pins 173 / 960 ( 18 % )

Total virtual pins 0

Total block memory bits 1,983,656 / 55,562,240 ( 4 % )

Total DSP Blocks 100 / 1,518 ( 7 % )

Total HSSI RX channels 8 / 72 ( 11 % )

Total HSSI TX channels 8 / 72 ( 11 % )

Total PLLs 78 / 144 ( 54 % )

Kernel 2 - 32 bit (float) resource usage according to quartus

Logic utilization (in ALMs) 128,593 / 427,200 ( 30 % )

Total registers 157318

Total pins 173 / 960 ( 18 % )

Total virtual pins 0

Total block memory bits 10,365,736 / 55,562,240 ( 19 % )

Total DSP Blocks 100 / 1,518 ( 7 % )

Total HSSI RX channels 8 / 72 ( 11 % )

Total HSSI TX channels 8 / 72 ( 11 % )

Total PLLs 78 / 144 ( 54 % )

Why does the resource usage increase from static analysis to synthesis? Are there like any directives to restrict the number of DSPs?

altera_forum · Answer

I see, I remember someone else also reported a similar situation before. This is indeed strange. Try using short or char for "output" and "out" and see what happens. I would expect using int for these variables might "promote" all the multiplications to int, since the output is int. Furthermore, you can take a look at "Intel FPGA SDK for OpenCL Best Practices Guide, Section 3.3.1 Floating-Point versus Fixed-Point Representations" and follow the guidelines to mask out bits to see if you can get the desired results. If none helped, I recommend opening a ticket with Altera directly.

Forum Discussion

Fixed point optimization

3 Replies

Recent Discussions

Quartus did not start

The quartus license works with version 25.0 but not with version 17.0

Docker image for Quartus Pro 26.1 missing ?

Timing analysis - long combinational path

timing violation fix