Forum Discussion
NIOS isn't a particularly fast processor in comparison to, say, an ARM CORTEX M4 @ 150 MHz sorts of speeds or something like a CORTEX A8/A9, most modern DSP chips, or pretty much any X86 CPU. It is convenient for SOC designs that need some programmability, and which can live with 20-100 MIPS level performance levels give or take some. The way that a NIOS based CPU can really shine in terms of performance, however, is the ability to create custom FPGA logic based hardware / algorithms and interface them tightly to the NIOS. There are two ways to do that -- the first is that you can create a soft FPGA 'peripheral' which will be memory mapped to the NIOS address space, ane you may use it as any other "off CPU" peripheral via memory mapped I/O. The second way is to use the custom instruction mechanism offered by NIOS to more tightly couple a hardware calculation engine into the NIOS execution by having the black box act as the instruction execution engine for such a special user defined instruction. There are recent threads you should look at concerning the benefits and trade-offs concerning the use of custom instructions versus generic peripherals that are memory mapped.
Some algorithms are so structured that they don't parallelize well either in software or in hardware, so there will be severe serial execution based performance limits to executing them either in software via a sequence of assembly instructions, or in hardware via some specially crafted state machine or synchronous logic + look up table based implementation. Some algorithms simply need so many registers and memory blocks and so on that they aren't practical to implement "in hardware" other than as via some kind of sequenced state machine similar to a CPU which achieves efficiency by using RAM/FLASH for much of the program and data storage when the resources for registers and ALU resources are exhausted. If you have an algorithm that is limited in performance by serial paths that cannot be parallelized, but you need to independently calculate that algorithm independently on many distinct inputs, you may be able to calculate in parallel y1 = f(x1) ; y2 = f(x2) ; y3 = f(x3) .... yn = f(xn) and then achieve a speed up of N parallel executions even for a serial algorithm provided that you have the parallel FPGA memory/register/ALU resources available to do that. If you can efficiently parallelize the algorithm within a single invocation y = f(x) then there is a possible high performance scenario in which can calculate the result very quickly, depending only on having enough FPGA resources operating at high enough speeds to accomplish the parallel calculation until you're serial execution limited or resource / timing limited. There is a maximum rate at which NIOS can execute a given instruction such as a simple y = f(x) and load a new X and store a new Y result for each iteration. If you can't usefully get your algorithm to execute faster than that limiting rate even with a fully hardware implemented solution, maybe using NIOS as an engine to feed data into your algorithm and store the results and possibly help in calculating the non performance critical aspects will be beneficial. If on the other hand you really would be held back in performance by the NIOS and it is not performing essential functions for you, maybe it is best to just use a custom hardware / state machine implementation. You should understand how to efficiently partition your algorithm into hardware logic, state machine, and memory / register resources in order to make an informed decision of how best to accelerate its implementation. You can't neglect the fetching of input data, and the storage of output data, since the block RAM or DRAM which you might use will also be a throughput limiting factor in high performance algorithms. First you probably ought to implement the algorithm in C on X86/SSE, possibly also MATLAB/SCILAB, learn about its algorithmic complexity / bottlenecks via analysis, then construct a verilog implementation of it such that it uses the resources available within your target FPGA by compromising resource vs speed efficiency intelligently depending on the performance criticality of various pieces of the algorithm. Then when that works you should easily be able to tie the "black box" into either a custom instruction/peripheral or a fully indpendent hardware engine not using NIOS. If performance is ALL you care about, and the FPGA is just a means to that end, you might be disappointed with the performance vs cost of FPGA solutions versus what you can garner from using X86 or GPGPU implementations. For many classes of problems the dedicated architecture but software programmable silicon of the latter will exceed the performance of FPGA solutions espeically given factors like RAM bus bandwidth / speed, quantity of memory available, gigahertz clock speeds -> MIPS for algorithms that can run core loops in L1/L2 cache, et. al. FPGAs will generally win for cases where there are no efficient mappings of a given core ALU operation to CPU/GPU instructions or fast cache/register based look up tables. FPGAs are also good at narrow width data variables like operating on 1-bit, 2-bit, 4-bit data, et. al. For 32-bit, 64-bit, floating point, et. al. data types, the mainstream CPUs tend to be pretty highly optimized versus what you can synthesize on a medium sized FPGA.