1) I don't typically choose my memory type based on resources, private memory is for variables that only the work-item has visibility into. Local memory is for variables that work-items within the same work-group have access to. If my algorithm benefits from work-items sharing data then I use local memory, otherwise I use private memory and I always size my work-group accordingly to trade off resources vs compute efficiency. In general private memory tends to have more bandwidth than local memory because each work item has their own copy but there are always exceptions to those rules depending on the algorithm
2) I can't really answer that since it's algorithm dependent. I wouldn't worry too much about it and just focus on making the smaller pieces even smaller since that should have the same impact on the overall kernel.
3) It's hard to explain but there are times when this has benefits for both area and processing speed. It's algorithm specific and since I haven't seen the kernel it's difficult for me to explain why the utilization remained the same. By specifying a work-group size of 1,1,1 you are saying each work-group has only one work-item. Typically you only do this when there is no need to share data/resources within a work-group.
If you are able to share your code through a service request I think the quality of the answers will improve since it's incredibly difficult to make suggestions like these without seeing the kernel or at a minimum code fragments. There is no general rule of thumb of optimizations that fit all kernels so how you optimize a kernel is very much an algorithm specific thing (regardless of what hardware you use OpenCL on).