SOOOOO many thanks for your pic PARRADO. I've just got a completely new idea for my correlator, it will use much less the FPGA cores, however a bit slower.
I made a calculation: if I want to correlate 1000 samples from the ADC with 66 points reference signal, it would take (1000+66)*66 = 70356 clock cycles to complete the correlation. If a 40Mhz system clock is used, than the correlation frequency can get upto 40,000,000/70356 = 568 Hz. Is my calculation correct or not ? If it is, then 568 is fast enough.
Again, thanks a lot for your help, much appreciate.