Actually we are planning for putting all into hardware but assuming that in software we can get some good results based on that we'll just convert the logic into hardware..In returning data to USB I am taking one packet at a time from sgdma warp it. I also tried 5 packets wrapped into one usb packet and send. But the issue here is the receive packet queue gets full before forwarding all of them to USB. So I think my usb is getting slower compared to TSE and so resulting in loss of packets.
I used profiler and high performance counter also but the execution done only once and in that nothing looks like bottleneck or any of USB or TSE is eating more cpu.
Another option now I am trying is to poll the sgdma instead of interrupt so that packet loss can be controlled at the cost of bandwidth. Do you think is this the viable solution or putting all into hardware is the only way?