No, no DLL at all. Just a single PLL with four outputs.
What speeds? If I were to guess you plan on capturing your data input with 8 phase of the clock and having some logic to determine the right phase? The delay of a NOT gate is negligible. It's not done through a LUT but through a dedicated inversion going into the LAB, so it's probably less than 20ps. There are all sorts of other things that are going to hurt you.
First, the PLL has to drive global clock trees. So if you drive four global clock trees, they will naturally have a decent amount of on-die variation, just because they are so large. Next, there are only three clock lines per LAB, so at most you're get 3/8 clocks into a LAB. So you'll really be driving three different LABs. There will be sizable delays in your datapath to each LAB, much larger than the NOT gate. Finally, if you can get it all correct and have minimal variation, you'll need to lock down the placement and routing. All very difficult.
Is your IO standard LVDS? Could you use the dedicated altlvds silicon which can overclock it at a very high rate, and then once it's parallelized you can do what you want with the logic?