The only difference (besides the signals names) I can make out in your code sequence is that you replace [clk'event and clk = '1'] (which was proposed by Altera documentation) by [rising_edge(clk)]. I tried that to, but it ends up with the same result.
As a next step, I placed EXACTLY the mentioned Altera example in my test design to look how it deals with that. Now I found out that, if using the same input signal source for more than one DSP element, the register is shared an located outside. When making sure each MAC gets individual signals, a_reg and b_reg will be placed inside the DSP block as expected (for both, my code and the Altera example).
However, even when using exactly the Altera Example, the adder is still placed outside the DSP block in nearby LAB cell (even a completely isolated one, having all I/Os directly connected to FPGA I/Os). I have no problem in close the timing so far, and as it is a test design on the StarterKit, the FPGA is far from congested.
Why worry further? It's because I get an uneasy feeling if I can't make the tool putting a simple MAC entirely into the DSP block which was designed to be used for that.