Firstly make sure that the memory area you are using for SHM isn't also used by some other part of the system.
I'd then use a sequence something like:
1) CPUB: write C0 to SHM(0)
2) CPUB: wait for SHM(1) to be C1
3) CPUA: wait for SHM(0) to be C0
4) CPUA: Acquire mutex
5) A: write SMH(1) C1
6) B: Acquire mutex (should spin)
7) A: Loop for a moderate fraction of a second
8) A: Write SMH(0) C2
9) A: Release mutex - B should run
10) B: Check SHM(1) is C1 and SMH(0) C2
11) B: release mutex
I've not used the Altera mutex functions, and I wouldn't consider using and of the IOWR_32DIRECT() family of functions in any code I ever wrote - they are far, far, far, far too error prone.
You really need to use C pointers and structures (but not bit fields of enums) to map hardware registers. You might need access functions (to do cache bypass, or because the 'pointer' actually references a different memory space), but you need the verificatiion that the offsets match the device pointer.