Kazu,
With No-MMU, Futex is of very limited interest, as the no-MMU toolchain uses gcc3 while the MMU toolchain uses gcc4. Gcc4 can't compile for no-MMU hardware and gcc3 can't do TLS (Thread loacl storage: "_thread" variables in C) and thus is quite unusable for a decent threaded application, and does not support NPTL (Native Posix Thread Library).
As Linux allows the user land code to create as many Futexes as it desires, the Futex variables need to be located in normal RAM, and thus the standard Futex user-library and Kernel code needs to be used. The only thing that needs to be done architecture-specific is the userland atomic instructions.
Here Hippo suggested to follow the way that the BlackFin people, featuring the same problem, go: defining an "atomic region" that holds the appropriate functions and gets special handling by the Kernel's interrupt code.
This method is well proven and works without any hardware support.
Some hardware support (disabling the interrupt for a limited time that is automatically cut by the hardware) might be constructed at a later time if desired and optionally be enabled when configuring the system. But this should be not discussed right now.
-Michael