Thank you for your answers.
I increased the instruction and the data caches to the max and gained some fps. I think the bottleneck is the AVALON bus.
Before writing my own transport protocol, I want to try to separate the data and the instruction busses to reduce this bottleneck.
I didn't find any documentation on google for that, and when I try with a simple example (a NIOS, a jtag_uart, and two on_chip memories, with a hello_world_small template) it doesn't work.
Did I miss some documentation ? Do I have to change something in the configuration of the nios compiler ?