Even very small blocks have inputs and outputs and if you compile/simulate them for a real FPGA the clock has to cover the distance from input to output hence the skew.
For speed-estimation purposes may I suggest to double-register both the inputs and the outputs. Then Quartus can place the internal registers close to the logic-under-test and achieve the best timing. However if your are using TimeQuest and specify a realistic timing requirement as you did, the fitter will not try hardest. You will have to push it by specifying a more stringent requirement which it can not easily meet.