My problem with the performance counter block is that it is a 16bit slave - so it takes a lot of accesses to do anything.
A 32bit version would be much more friendly.
The lack of 'add with carry' could well be a problem, but one of the cmp instructions would do the job.
So code like:
sum = a + b + carry;
carry = sum < a;
is amost right (needs an extra check for b[i] + carry == 0).
So 5 instructions (+ memory access) including 1 branch that can be staticly predicted 'not taken'.
With care and a combinatorial custom instruction (or 2) could be 3 instruction (I think).