Most common algorithms for computing the fourier transform:
1) Cooley–Tukey FFT
2) Rader's FFT
3) Bluestein's FFT
4) Computing DFT directly (slow since removing 'Fast' from FFT)
The Cooley–Tukey algorithm I believe is widely used for computing the FFT for powers of 2 due to it's efficiency and complexity ( for the order of O(N log N) ). Using the size N as a power of two enables the use of the twiddle factors. The design examples uses this algorithm to my knowledge. Many implementations I believe have gone into optimizing the FFT by decomposing it into two parts (radix-2 form) to allow for several tricks such as the butterfly and bit-reversal.
Other optimized implementations would be Rader's FFT for prime sizes or similarly Bluestein's FFT algorithm for arbitrary sizes ( O(N log N) for prime sizes otherwise it's slower in terms of complexity ).
Zero-padding to the next power of two may be your best bet for optimal performance for the FFT. The design example supports quite a large size FFT if I remember correctly. The design example could be further optimized so that it is more tuned to your target size. The clfft library I'm assuming uses some flavor of the FFTs above (but targeted for GPU, so probably not optimal on the FPGA).
The second dimension is added in by computing two FFTs, one along the rows and another along the columns. The design example reuses the 1D FFT by computing the FFT along one dimension, performing a non-Hermitian transpose, FFT along the same dimension, and then performing another non-Hermitian transpose thus giving it the same effect.
The FFT IP Core could possibly be used as an Altera OpenCL Library which allows the use of integrating HDL code into OpenCL but I'm not too familiar with that process or its complexity for implementing.