So far I can imagine the input side:
1) you receive your input bits, store the first 256 bits in logic or memory(no problem) [input frame# 1).
2) you now do two things: process above frame 1 while receiving next 256 bits in another location(no problem)(input frame# 2).
you need 192 clks(at input rate I presume) to finish processing input frame# 1 to end up with 3 internal frames(256 x 3). This should be no problem if your input frame# 2 takes 256 clks to arrive completely(if you are receiving it serially), otherwise you obviously have to go faster than input rate.
3) you then convert every 4 bits of frame 768 ??? to an IQ pair, add noise and get 4 probabilty values for each pair from somewhere...thereafter it is not clear to me. If you describe your work sequentially in short bullets may help.
Is it one matrix(size 8x256x32) that you need or two? still why not lower probability from 32 bits to less because you might be just wasting the precision internally.