You are right that you can have an Avalon MM bus with a 1024-bit data bus size, and if you connect a master and a slave that both have a 1024-bit data bus size, they will be able to transfer 1024 bits on each clock cycle (see, you aren't dumb ;) ). You need to be careful when using such a bus size, as you increase considerably the resources used on the FPGA.
The problem in your case is that the CPU itself is 32 bit. So even if you connect it to a component with a 1024 bit data bus, the CPU will only be able to read or write 32 bits at a time (using byte enables).
If you have a lot of operations to perform on those big words, an idea could be to implement your IP as a full ALU with a bank of 1024 bit registers. Then the CPU would only need to transfer the actual values at the beginning and end of the algorithm, but for all the intermediary steps it would only need to send instructions, and the values itself would stay in your IP component. Of course it depends entirely on what you want to do and how many intermediary values you need.