Hi Martin,
If I understand you correctly, the problem appears more frequent if you remove your TX part with all functionality required for read completion and DMA transmission. This makes me think your – missing/wrong – completions migth be the cause for the problems.
Sending CA as a Cpl is typically faster than sending a CplD. You always have the problem of managing the reception of non-posted requests while you are still handling the last Cpl/CplD, probably slowed down even more by de-assertion of tx_ready by the Hard/Soft IP. Altera’s approach to this problem by adding rx_st_mask is nice but rather useless as you have to buffer a whoppin’ 14 more read requests in 64-bit AST mode, and you have to do it inside the application which contradicts the idea of a wrapper IP.
What you actually could do is play with the max_payload_size and RX buffer space allocation performance parameters, actually try to make the PCIe buffer parameters as default as possible.