The post-map netlist of your "design" clarifies, that you can implement the complete GF_mult in two Stratix LUTs (one for each bit). It also works with Cylone 4-input LUTs. I don't see a reasonable purpose of preventing this optimization in a real design.
If you want to cut the FPGA feature of implementing complex logic expressions in a single LUT, though. Keeping the intermediate nodes as logic cells doesn't work in a function, I fear, because functions involve a higher level of behavioural description, that abstracts from logic cells. But it should be possible by using a component instead.