If you already have the 32bit value (or in general a data whose width matches the port width) the single write would be efficient.
But if you must manually build the 32bit word from single bytes (as in my example), you generally need cpu work, unless your bytes are already packed in memory in the correct order.
Your idea is correct, but you must take care to declare the 16 byte data so that in memory it is equivalent to 4 32-bit words. Then you'll make a trick with pointers without requiring any cpu effort.
For example:
alt_u8 array8b[16];
alt_u32 * array32b;
array32b = (alt_u32*)array8b;
Now any reference to array32b[n] is equivalent to
(array8b[n*4] | (array8b[n*4+1]<<8) | (array8b[n*4+2]<<16) | (array8b[n*4+3]<<24))