Altera spent money developing Stratix III with a new "fracturable LUT" It can do any function of 6 inputs. If they do not push that, then they wasted their money.
Long comb paths generally are due to long strings of if/else in the HDL whereas the LUT is a simple 2 port memory that has the same access time no matter how complex the function. Edge triggered regs are used because synthesis can only handle them.
So the less function between regs means more wasted time for clock skew and setup/hold times. Those that think pipelining performance is simply a matter of clock speed are sadly mistaken. If the total clocks to do a function times the clock period is not less than before a stage was added then there is no gain with more power used.
Long comb paths generally are due to long strings of if/else in the HDL whereas the LUT is a simple 2 port memory that has the same access time no matter how complex the function. Edge triggered regs are used because synthesis can only handle them.
So the less function between regs means more wasted time for clock skew and setup/hold times. Those that think pipelining performance is simply a matter of clock speed are sadly mistaken. If the total clocks to do a function times the clock period is not less than before a stage was added then there is no gain with more power used.
You are so hung up over the asynch notion that I cannot believe it. Long before TTL and edge triggered flip flops there were multiple clock pulses per machine cycle and regs were simply latches. Somehow the embedded memory blocks are not truly edge triggered but yet synthesized, so I am using the memories with a multiclock cycle to essentially do a latched data flow, so that is why you don't see dedicated regs. Use whatever is available from the technology.