The number of "compute units" in the context of the OpenCL standard, and "compute units" in the context of Altera's compiler, despite having the same name, are two completely different things. I haven't tried this personally, but I believe querying the number of compute units using the respective OpenCL function for FPGAs will always return 1, since there are no fixed and pre-defined compute units on the FPGA, unlike a standard GPU or CPU. Though if this number is queried after the FPGA is programmed with the kernel, the function might report the correct number (though it probably won't).
There is no restriction other than the limited FPGA area, for using multiple compute units by adding __attribute__(num_compute_unit()) to the kernel. Still, achieving performance improvement by using this attribute requires at least two conditions to be met:
1- Altera recommends having at least 3 times more work groups than compute units, to be able to fully utilize the circuit. Less work groups will probably not result in much of a performance improvement.
2- Since having multiple compute units results in more memory ports and significant memory contention between the units, the memory bandwidth requirement of the units in total should be relatively lower than the off-chip memory bandwidth to achieve speed-up. For example, if one compute unit is already memory-bound on the FPGA, using more units not only will not improve performance, it will also decrease it.