Thanks HRZ for the nice and fast response. And sorry for this larger than expected reply.
OK, I get it. I thought that by the end of the compilation & synthesis flow, the report.html would have been updated with the final results. Seems it's not the case.
Don't know what you folks think, but I can envision (meaning I would like to have) here having 2 different reports (the estimative one for early feedback, and the post-place-and-route) so we can check both under the HTML GUI (and ideally integrating the profiler information). But this is another story.
I have some other questions on these area results:
(1) I guess the information found in the acl_quartus_report.txt file is "just" for the kernel logic (including the associated W-I ID dispatching etc.) but without considering Board Interface and Global Interconnect contribution, right? I am asking this since I'd like to know the specific resource consumption of my kernel "logic", i.e., the part implementing the actual functionality. Something nice would be getting results on the actual datapath pipeline separated from everything else (LSUs, and the like, i.e., all those other elements within the kernel that are actually there to provide system level integration). The best way to describe what I mean is imagining a standard vhdl entity of a given circuit that could be plugged into the kernel to do the computation and which should then be integrated somehow at system level (this can actually be done through the OpenCL library in AOC): well, these are the post-place-and-route area results I'd like to know for just-the-functional-part-of-my-kernel, for comparison purposes with other tools.
(2) I guess in case of having more than one kernel (either in one or several .cl files) compiled into the same .aocx the result in this quartus report file would be the aggregation of all kernels, right? (so far I have gone only for 1 kernel). In this case, the solution would be going to Quartus and check the resource utilization by entity, but this approach does clearly not scale.
Now, regarding latency, I get it too. What I was looking to was the actual "capability" of the synthesized circuit (pipeline), regardless of external factors such as bus contention, memory bottlenecks and the like. Basically, just knowing the number of clock cycles needed for an input data to traverse the whole pipeline until reaching the output: this is what I would be doing in RTL and logic simulation. To check if I got it right, please let me formulate the following: from the point of view of an RTL implementation, to compute the latency of a datapath (pipeline) not considering potential stalls caused by external factors (meaning one new data could be available each clock cycle), the minimum (ideal) latency can be computed the same way for any pipeline as: clock_cycles_to_fill_pipeline + N_clock_cycles. So far so good. So, considering the following case in OpenCL (computing a set of dot products):
__kernel void foo (global char * restrict in, global char * restrict out, const uchar m)
{
# define LENGTH 3
// actual ff values don't matter
char ff = {
{ 1, 2, 3},
{ 5, 5, 6},
{ 7, 8, 9},
{ 10, 11, 12}
};
char f;
//fill 'f' according to a kernel parameter, 'm'
# pragma unroll
for (int i = 0; i < FILTER_LENGTH; i++){
// 'm' is a kernel input paramenter
f = ff;
}
// hold input samples in a shift reg. structure
char tmp;
# pragma unroll
// 'in' is a kernel input parameter (data to be filtered: SIZE > LENGTH)
for(int j = 0; j < LENGTH; j++){
tmp = in;
}
// filter
# pragma unroll 1 // prevents unrolling
for (int j = 0; j < SIZE+1; j++) {
// accumulator
short acc = 0;
// dot product
# pragma unroll
for(int i=0; i<LENGTH; i++){
acc += f * tmp;
}
// write output: 'out' is a kernel output parameter
out = acc;
// updates tmp buffer
// ...shift samples
# pragma unroll
for(int i = 0; i < LENGTH-1; i++){
tmp = tmp;
}
// ...get new sample
tmp = in;
}
}
here things change a bit, the difference coming from the contribution to the latency of fully unrolled loops and non unrolled loops. In order to add up the latency of the different blocks, all the blocks corresponding to fully unrolled loops would just count once (add their latency once) and loops not unrolled at all would be multiplied by their loop trip count (same for partially unrolled loops, which would add up as original_trip_count/unroll_factor). Is this right?
As before, hopefully Intel might do this computation for us at some point, of course clearly stating that this would be the ideal situation without pipeline stalls, which can't be predicted (at least not completely). However, since the access pattern to memory is clear from the kernel code and if that's the only kernel in your system and not any other piece of code is accessing the main memory to input data to the kernel and you compile/synthesize against a given board, I guess it shouldn't be impossible for the tool to compute the number of clock cycles required to complete the whole computing work (given that all constants and loop bounds are known at compile time). How does this sound to you?
Cheers,
Rubén