Hi all, My name is Vivek Gujari. I have a custom built software in C# (approx. 50k lines of code) in visual studio. I want to design a near real time system by accelerating the algorithms of the software. My new high performance laptop takes around 12 min to process certain amount of data. I would like to have a near real time system which can process the same amount of data in like 2 min. Please recommend me the best Intel product to implement the design. Any suggestions or even tutorials on the implementation will be very helpful. Thank you.

Look into OpenCL or HLS (high-level synthesis), easy enough to find both at altera.com, including documentation and training.

@vivelgujari, I've optimized many algorithms over the years. One thing I've learned over that time is that the best way to optimize an algorithm depends upon the algorithm. There have been algorithms I've worked with that were memory limited, and so could not be optimized with a (cheap) FPGA. Other algorithms were easier to build within an FPGA than in software in the first place. For these reasons, OpenCL may work on some problems, but will probably not work on others--it depends both on the problem and on the hardware. Dan

@Dan - Thak you for the response. The part of the software I am trying to implement is very iterative. There are lots of matrix operation! I think I will need an FPGA with a big memory! Do you know a good documentation or source for algorithm implementation on FPGA? Thank you!

Is it a good idea to use task subroutine in the RTL? In the algorithm, we frequently used matrix multiplication. So I was thinking of implementing a task for matrix multiplication and calling it wherever required!

Hardware acceleation of a custom software | Altera Community

10 Replies

Altera_Forum
Honored Contributor
8 years ago
Look into OpenCL or HLS (high-level synthesis), easy enough to find both at altera.com, including documentation and training.
Altera_Forum
Honored Contributor
8 years ago
@vivelgujari,

I've optimized many algorithms over the years. One thing I've learned over that time is that the best way to optimize an algorithm depends upon the algorithm.

There have been algorithms I've worked with that were memory limited, and so could not be optimized with a (cheap) FPGA.

Other algorithms were easier to build within an FPGA than in software in the first place.

For these reasons, OpenCL may work on some problems, but will probably not work on others--it depends both on the problem and on the hardware.

Dan
Altera_Forum
Honored Contributor
8 years ago
@sstrell- Thank you!
Altera_Forum
Honored Contributor
8 years ago
@Dan - Thak you for the response.
The part of the software I am trying to implement is very iterative. There are lots of matrix operation! I think I will need an FPGA with a big memory!
Do you know a good documentation or source for algorithm implementation on FPGA?

Thank you!
Altera_Forum
Honored Contributor
8 years ago
Is it a good idea to use task subroutine in the RTL? In the algorithm, we frequently used matrix multiplication. So I was thinking of implementing a task for matrix multiplication and calling it wherever required!
Altera_Forum
Honored Contributor
8 years ago
I would tend to avoid them myself. Very few algorithms need the resources that will be generated by using them.
Altera_Forum
Honored Contributor
8 years ago
Hi Dan, thank you for your response.
So just creating a module for matrix multiplication is the best way to do it? Also, do you know any book or any source where you can learn the art of implementing algorithms in FPGA?
Altera_Forum
Honored Contributor
8 years ago
Many students try to build a one-clock matrix multiply in logic. While this is doable, it will cost you at least N^2 hardware multiplies, and a whole lot of logic to do the additions. If you aren't careful, you'll find most of the FPGA capability that you paid for used up by this approach.

Then, when you step back from the excitement of being able to specify a matrix that can perform one calculation per clock, you'll start to realize that you can't feed data into that matrix multiply at one matrix per clock or even one vector per clock. When you then adjust your algorithm, you'll discover you are using much less of the FPGA than before. Indeed, depending on your application, you may manage to build your entire matrix operation using only one multiply, leaving most of your chip unused and available for other parts of your application.

At one time I counseled someone who wished to implement a wavelet lifting step within an FPGA. He was excited about the possibilities of working with an FPGA only until he realized that the FPGA he was working with only allowed him to read 16-bits from memory per clock. Then, as I worked with him, it slowly became apparent to him that a soft-core CPU was just about as fast as his algorithm for the specific reason that his particular algorithm was memory bound--it could run no faster than the memory, no matter how much logic was thrown at the problem.

As for books or sources, I know of a couple web references you might like. One of them is fpga4fun.com--they do a good job of making some very complicated concepts into fun projects. A second website I might recommend is my own, zipcpu.com (http://zipcpu.com). I've tried to dedicate the website to keeping newbies out of fpga hell (http://zipcpu.com/blog/2017/05/19/fpga-hell.html), but I'll let you be the one to decide whether I've met my goal or not. Particular articles others have found valuable include a description of why the clock is so important (http://zipcpu.com/blog/2017/09/18/clocks-for-sw-engineers.html), a discussion of the fpga design process and how it often differs from what is taught in the class room (http://zipcpu.com/blog/2017/06/02/design-process.html), as well as a description of several strategies (http://zipcpu.com/blog/2017/08/14/strategies-for-pipelining.html) that can be used when pipelining logic. I'm currently, but slowly, working through a series of FIR filter implementation, so you should be able to find valuable examples of filters there as well.

Dan
Altera_Forum
Honored Contributor
8 years ago
Hi Dan,

Thank you for the valuable response. I have read the blog on Clock for Software Engineers and its very helpful. I will also go through some of your projects as well. This will give me a good guidance.

You have a great weekend and a happy new year!
BShar16
New Contributor
6 years ago
Hello,
You have made a great software I was trying to build the same but, I failed. Know I would try to do it again following your steps and the above answers.
Thank you for the suggestions.
Artificial Intelligence Course In Hyderabad

Forum Discussion

Hardware acceleation of a custom software

10 Replies

Recent Discussions

Dedicated Clock Pins for MAX 10

How to Simulate the ADC IP from MAX 10

Arria 10 QSPI controller hangs after U-Boot shell while SPL boots successfully

Regarding Cyclone 10 LP AS Configuration Timing Parameters

Clarification on Agilex 3 W vs Y Device Variants and Security Feature Mapping