First DIRECT has nothing to do with cache. IOWR and IOWR_*DIRECT both bypass the cache. The main difference between IOWR and IOWR_32DIRECT is that IOWR_32DIRECT uses an offset as parameter that is directly added to the base value to calculate the address, whereas IOWR uses a register number that is multiplied by 4 before being added to the base value to calculate the address. The macros are described in this document (http://www.altera.com/literature/hb/nios2/n2sw_nii52007.pdf).
Second, strings in C don't work the way that you expect in those examples. It isn't specific to the Nios platform, it is just how strings work in C.
When you have this line:
IOWR(ONCHIP_MEMORY2_1_BASE, 0x3fe, "THIS IS TO TEST HOW MUCH DATA CAN A MEMORY SPACE CONTAIN");
The C compiler will reserve 57 bytes in the .data section and put the string in there. If you use the default settings for the linker script, the .data section will end up in main RAM with your software code.
Then the IOWR macro will be called with the *address* of the string. Not the string itself. It is the address that is copied in the onchip memory, not the string.
When you do
alt_printf("%s\n", IORD(ONCHIP_MEMORY2_1_BASE, 0x3fe));
The *address* of the string is read back from the onchip memory and given to alt_printf(), that will read the string from the .data section and print it.
The string itself is never copied to the on_chip ram, and the address will always take 4 bytes on the Nios 2 platform, whatever the size of the string is.