Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
7 years ago

Fixed point optimization

Hello,

I have written two kernels to notice the difference in fixed and floating point operations.

a)

__kernel

__attribute__((task))

void test_multiplier(global char *restrict in, global char *restrict weights, global int *restrict out) {

int output = 0;

# pragma unroll 100

for(int i=0; i<VEC_SIZE; i++){

output += in * weights;

}

*out = output;

}

b)

__kernel

__attribute__((task))

void test_multiplier(global float *restrict in, global float *restrict weights, global float *restrict out) {

int output = 0;

# pragma unroll 100

for(int i=0; i<VEC_SIZE; i++){

output += in * weights;

}

*out = output;

}

Both the kernels give me the same number of DSPs, i.e 100 (unroll factor). I was expecting 25 DSPs in the 8 bit (char argument) case. Does aoc compiler optimize well for fixed point quantizations?

3 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Quartus/AOC v16.1.2 and below do not seem to be able to infer 8-bit and 16-bit operations correctly. Your first code example only uses 50 DSPs in 17.0.2 and above. However, it is probably best to define "out" and "output" as short rather than int.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    I have used aoc 17.1.2. Initial report after static analysis has predicted 50DSPs. After synthesis the quartus compilation report shows the following :-

    Kernel 1 - 8 bit (char) resource usage according to quartus

    Total registers 68810

    Total pins 173 / 960 ( 18 % )

    Total virtual pins 0

    Total block memory bits 1,983,656 / 55,562,240 ( 4 % )

    Total DSP Blocks 100 / 1,518 ( 7 % )

    Total HSSI RX channels 8 / 72 ( 11 % )

    Total HSSI TX channels 8 / 72 ( 11 % )

    Total PLLs 78 / 144 ( 54 % )

    Kernel 2 - 32 bit (float) resource usage according to quartus

    Logic utilization (in ALMs) 128,593 / 427,200 ( 30 % )

    Total registers 157318

    Total pins 173 / 960 ( 18 % )

    Total virtual pins 0

    Total block memory bits 10,365,736 / 55,562,240 ( 19 % )

    Total DSP Blocks 100 / 1,518 ( 7 % )

    Total HSSI RX channels 8 / 72 ( 11 % )

    Total HSSI TX channels 8 / 72 ( 11 % )

    Total PLLs 78 / 144 ( 54 % )

    Why does the resource usage increase from static analysis to synthesis? Are there like any directives to restrict the number of DSPs?
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    I see, I remember someone else also reported a similar situation before. This is indeed strange. Try using short or char for "output" and "out" and see what happens. I would expect using int for these variables might "promote" all the multiplications to int, since the output is int. Furthermore, you can take a look at "Intel FPGA SDK for OpenCL Best Practices Guide, Section 3.3.1 Floating-Point versus Fixed-Point Representations" and follow the guidelines to mask out bits to see if you can get the desired results. If none helped, I recommend opening a ticket with Altera directly.