I know about DMA and that it would be more efficient. However, in this case the goal is different - neither efficiency, no consistency are the issues.
pgprot_noncached() is equivalent to setting _PAGE_PCD | _PAGE_PWT flags in CentOS 5 for x86:
include/asm-x86_64/pgtable.h:314:#define pgprot_noncached(prot) (__pgprot(pgprot_val(prot) | _PAGE_PCD | _PAGE_PWT))
I verified that these flags are not set when mmap() is called/executed.
I think it might be a HW issue. I found the following document that says that a particular PCIe device implementation is not cacheable due to implementation limitation:
TI document "KeyStone Architecture Peripheral Component Interconnect Express (PCIe)":
--- Quote Start ---
No support for addressing modes other than incremental for burst transactions.
Thus, the PCIe addresses cannot be in cacheable memory space
--- Quote End ---
Apparently, PCI had support for cache line access:
Wikipedia article about Conventional_PCI#Burst_addressing
So, I am wondering whether Stratix IV implementation of PCIe has any limitations/issues that does not allow it to work correctly with processor caching.
P.S. Mmapping /dev/mem at PCIe BAR address was also tried with the same result: reads work as expected, writes cause system reboot.