I am not sure about simulink generated code. I use DSPBuilder which is based on simulink blocks but specific blocks for target devices (altera or xilinx). DSPBuilder works but simulink is not used in the industry to generate code since there is DSPBuilder ...etc.
Now the question is why the order of blocks is not right. The second issue is that of NCO which was first title posted If you suspect NCO then you can rule that out by either rescaling your FIR coeffs or you can use random data or even just one impulse (1,0,0,0..etc) at required level, pass it through model then through hardware and see the difference. At the moment you will be checking model Versus hardware so if the model is wrong functionally it will not be relevant, you only want to check model against implementation. Once done then you start to question if model is correct. You are saying that simulation works ?? how could it be if the blocks are in the wrong order... I think you need to explain your setup and results further.