Forum Discussion
Altera_Forum
Honored Contributor
20 years agoI know that this is a little bit late, but I have done some detailed speed measurement and tuning that may be relevant. The environment is a 50MHz NiosII on a board with a Cyclone 1c20 and a 91c111 - very similar to the 1c20 demo board. Static RAM is used for program and data during execution. A flash is used for boot only. Builds have been done with the 1.1 tools and the Beta 5.0 tools.
The basic system operation is bulk data collection and transfer to a host PC. Data typically arrives in 4K chunks via DMA. The data is quickly reduced to a 2K chunk using a copy operation. The DMA is transferring some useless bits. Headers are then prepended and the data is sent to the PC via TCP. The data streams out with no application level acknowledgement. Some of the code optimizations have been in for a long time, and I don't have a good baseline measurement without them. I started this work with toolkit 1.0. The optimizations were as follows: memcpy - increase the level of loop unrolling, inet/chksum - unroll the inner loop, 91c111 driver unroll the inner loop of the transmission algorithm. I am considering also unrolling the 91c111 driver's receive inner loop and getting rid of the rx thread entirely. These latter two steps haven't been taken, yet. I have found that -O3 produces very much the same results as -O2, though it does generate significantly larger code. Space is an issue in this system, so I just use -O2. Using the 1.1 toolkit, I was able to get about 1.5ms per data chunk, which corresponds to about 11Mbps. With the 5.0 toolkit, I slowed down to about 1.9ms per data chunk, or about 8.5Mbps. Though I bemoan the slowdown with the 5.0 toolkit, I have found that the 1.1.0 lwIP in 5.0 is more robust under packet loss than the 0.7.2 lwIP in the 1.1 toolkit. The timing measurements were taken with a logic analyzer. I modified os_cpu_c.c to put in some outputs to a port that I could observe. I used one port bit per task, so I was able to get a nice waveform showing active task times. I also added bits to track the time spent in memcpy, the checksum, tcp_write, and tcp_output, though these weren't strictly necessary. If you really need speed, plan on doing some tuning. Make provisions for measuring time to guide your tuning efforts. If you can get an ethernet controller that operates as a DMA bus master, you should. It's silly to be transcribing data to a fast ethernet chip the way we do with the 91c111. If you do plan to use UDP rather than TCP, consider just doing it yourself without involving the stack. It's not a big deal. If the data link is one hop over an ethernet, you could also consider dispensing with the UDP checksum. Since you are protected by the ethernet CRC, the UDP checksum adds little. Avoid transcriptions to the extent possible. Wherever you have a loop processing your bulk data, make sure that it is unrolled. If you leave time for this in your project, you'll probably enjoy doing it. Speed tuning is kind of fun if you're not under the gun when you're doing it. Good luck!