Just came across your post. I've encountered this situation several times before when communicating via TCP between a Windows computer and some kind of embedded device.
It seems many (most? all?) embedded RTOSs maintain a very small TCP transmit queue. Essentially they block a new packet from going out until the previous packet has been acknowledged, regardless of the TCP_NODELAY option. Note that this blocking occurs in the RTOS, not in the MAC/hardware. Embedded RTOSs do this to minimize the size of memory buffers needed to support auto-retransmitting of TCP packets in case a packet gets lost. But the embedded system waiting for an ACK, in combination with the delayed ACKs due to the Nagle algorithm on the Windows side, can cause the long delays you were seeing.
(The problem is made even worse if the Windows end of the pipe has not set the TCP_NODELAY option. In this case
both transmit packets and empty ACK packets can be delayed.)
One work-around is to ensure that the Windows end sends out at least one byte of dummy data every time it receives a packet from the embedded side. Windows will immediately send out an ACK along with the dummy byte (assuming you're using TCP_NODELAY), thus avoiding the Nagle delay. Note that the embedded end must be smart enough to throw away those dummy bytes. Obviously this only works if you have full control over the software on both ends of the pipe.
Another work-around is to use UDP packets instead of TCP. Obviously this also works only when you control software at both ends of the pipe.
Paul