You need one filter per branch to remove the f1+f2 (sum term) leaving signal at dc (f1-f2). Any LPF will do (including rrc if you wish).
The loop itself needs error detector and filter to smooth out the error. This loop filter can be an IIR with variable gain.
The IIR filter represents the integral part of your loop control system.
You may need further control terms such as proportional term.
The filter error is then used to shift phase(or frequency) of NCO in the opposite sense.
It takes sometime for the loop to lock. You observe the error and this should converge and settle with low jitter.
I notice your bb_signal_I has plenty of zeros in it but not Q
I wouldn't advice you to do any decimation inside loop as it could get complicated on you.