Without seeing the algorithm I can't say for certain it's the __private memory causing this but it's a possiblity. You method of debugging the issue is sound, just keep in mind that if you break up your kernel into pieces that the footprint of each of those pieces will not necessarily add up to the same sum as the kernel as a whole.
Are you using a fixed work-group size, or know how large your work-group size will need to be? The compiler assumes a work-group size of 256 so if you don't need one that large or know it's going to be a fixed size you can specify attributes to let the compiler know this. Often the compiler will create smaller hardware with hints like these.