From what I heard you need to write your C code in a special way to limit the poor performance of the generated hardware (i.e. write HDL in C ;). In the best cases it generated not-so-efficient hardware (faster than the software version, but a lot slower than a real HDL implementation) and in the other cases it wouldn't recognize a part of the C code structure and would fail (with a more or less understandable error message). Besides, as others said here it isn't maintained any more and doesn't work with QSys.
I think your two best options for a better performance are either to run the software on a hardcode processor (either outside the FPGA, or by using one of the new Cyclone V with ARM cores) or take the time to convert it to HDL. Converting an algorithm written in C to HDL isn't very straightforward, because you often need to rethink completely the algorithm implementation. With hardware you can have more parallelization, and have a more efficient flow by using pipelining, but on the other hand the order of execution and the data flow can be quite different.
Using a profiler is indeed the first thing to do. If you see that some functions are used a lot more than others, you can start thinking about what kind of hardware could replace them.