--- Quote Start ---
I know the multiply unroll 64 times, so I need to get 64 datas(64*16=1024 bits) in 1 clock,
but since the local memory optimize to 1 read and each read read 1024 bits, therefor I use only 1 RAM block not 64 RAM blocks, right?
--- Quote End ---
No, even without taking replication into account, your buffer has a size of 576 * 1024 = 589824 bits, which, considering the size of the Block RAMs being 20kb, you need at least 30 blocks just to fit the buffer. Furthermore, each Block RAM has two 32-bit ports; obviously, you cannot read 1024 bits per clock from a 32-bit port. The write port has to be connected to every Block RAM used to implement the buffer and the 1024-bit read port is split between them which requires a minimum of 32 Block RAMs to provide enough ports. Adding other overheads (address calculation, routing, etc.), the compiler ends up using 64 Block RAMs. This configuration is optimal and is unlikely to be improvable.