Forum Discussion
Altera_Forum
Honored Contributor
21 years agoI have to agree with BadOmen on this one. The nature of the inner loop makes this achievable in a page or two of Verilog code, at most. If you wanted more control it could be parameterized quite easily with control registers that are driven by Nios.
The peripheral would look like this: - An Avalon slave port, with register set that the CPU can access. The first loop would require a bit of setup on the part of the CPU: loading in the source & destination buffer base addresses, number of times through the loop, what to do the logic operation against, and kicking off the transfer with a control register. Subsequent transfers (assuming that none of these startup parameters changed) would be done by again telling the control register to start -- very low overhead on the CPU's part. This leaves it free to run other threads to sustain the system. - An Avalon master port, read only, that reads the source buffer (address & read outputs, readdata input) - An Avalon master port, write only, that writes to the destination buffer (address, write, and writedata outputs) - Control registers as described above in the Avalon slave, selected with read/write & address - Counters to increment the pointers - A comparator to look at the readdata, and it with your coefficient, etc. I think that this is something we (Altera) have to do a better job of hammering on: we're in an fpga! Our largest value proposition is that we aren't limited to an instruction set & processor architecture to solve a problem... if we take the processor-centric view of this we are left with increasing clock speed, memory cost, etc. to increase the bandwidth of the transfer (just like you'd do in a non-FPGA-based processor system). However, by spending a small number of LEs (my guess for a peripheral like this, including Avalon logic that would be generated with it: about 150 LEs) this inner loop would fly along at one transfer per clock, including the comparison to see whether the transfer could occur (assuming that your program & video memories can sustain that throughput). Here are some resources that may be of interest if you think this is worth studying more. The first two are excellent articles written by a colleague of mine that demonstrate thing such as CRC calculation in C code and converting that to hardware. The last I wrote, is kind of dated now (from the early Nios I days) but still covers the fundamentals: http://www.embedded.com/showarticle.jhtml?...17500157&pgno=1 (http://www.embedded.com/showarticle.jhtml?articleid=17500157&pgno=1) http://www.embedded.com/showarticle.jhtml?...icleid=12800116 (http://www.embedded.com/showarticle.jhtml?articleid=12800116) http://www.altera.com/education/events/nor...002/pldf097.pdf (http://www.altera.com/education/events/northamerica/sdr_forum_2002/pldf097.pdf) (look for the checksum-portion of the article) In this last one, look for Nios vs. custom peripheral comparison at the end. Actually Nios II turned out to be much faster at this algorithm than Nios I (which the article is based on) as it was so math-intensive and the 32-bit instruction set helps with that.. but the custom hardware -- again the Verilog equivalent of the C code -- still performs at 10x the Nios II speed at the same clock speed: http://www.altera.com/literature/wp/wp_qrd.pdf (http://www.altera.com/literature/wp/wp_qrd.pdf)