Many students try to build a one-clock matrix multiply in logic. While this is doable, it will cost you at least N^2 hardware multiplies, and a whole lot of logic to do the additions. If you aren't careful, you'll find most of the FPGA capability that you paid for used up by this approach.
Then, when you step back from the excitement of being able to specify a matrix that can perform one calculation per clock, you'll start to realize that you can't feed data into that matrix multiply at one matrix per clock or even one vector per clock. When you then adjust your algorithm, you'll discover you are using much less of the FPGA than before. Indeed, depending on your application, you may manage to build your entire matrix operation using only one multiply, leaving most of your chip unused and available for other parts of your application.
At one time I counseled someone who wished to implement a wavelet lifting step within an FPGA. He was excited about the possibilities of working with an FPGA only until he realized that the FPGA he was working with only allowed him to read 16-bits from memory per clock. Then, as I worked with him, it slowly became apparent to him that a soft-core CPU was just about as fast as his algorithm for the specific reason that his particular algorithm was memory bound--it could run no faster than the memory, no matter how much logic was thrown at the problem.
As for books or sources, I know of a couple web references you might like. One of them is
fpga4fun.com--they do a good job of making some very complicated concepts into fun projects. A second website I might recommend is my own,
zipcpu.com (
http://zipcpu.com). I've tried to dedicate the website to keeping newbies out of fpga hell (
http://zipcpu.com/blog/2017/05/19/fpga-hell.html), but I'll let you be the one to decide whether I've met my goal or not. Particular articles others have found valuable include a description of why the clock is so important (
http://zipcpu.com/blog/2017/09/18/clocks-for-sw-engineers.html), a discussion of the fpga design process and how it often differs from what is taught in the class room (
http://zipcpu.com/blog/2017/06/02/design-process.html), as well as a description of several strategies (
http://zipcpu.com/blog/2017/08/14/strategies-for-pipelining.html) that can be used when pipelining logic. I'm currently, but slowly, working through a series of FIR filter implementation, so you should be able to find valuable examples of filters there as well.
Dan