#1 Yes, it's telling you how many work-items per second are estimated to progress through the compute unit(s). This number is much more meaningful than FLOPs or other benchmarks because at the end of day you are offloading an NDRange amount of work for the hardware to perform and that is what the work-items/s represents.
# 2 This kernel is derated due to the kernel needing more global memory bandwidth than is available on the Nallatech A7 board. You may be able to influence the derating or eliminate it all together through kernel optimizations. The most common way to tackle this problem is to determine if there is any data being read from global memory that can be temporarly stored in local memory and re-used instead of reloading it over and over again. By temporarly buffering global memory contents into local memory (commonly referred to as a scatch pad), you reduce the memory bandwidth required by your kernel and if you can reduce it below the maximum bandwidth of the board SDRAM then you shouldn't see a derating factor.
Sometimes you'll see a derating factor due to local memories as well and these are caused by the kernel hardware either performing many local memory accesses where it can't keep up (rare because FPGAs have a lot of local memory bandwidth) or the local memory interconnect become complex enough that kernel operating frequency needs to drop for the hardware to meet timing (this is handled by the compiler automatically).
I'm not sure what type of sorting you are doing, but it might be possible to break the problem down to a bunch of smaller sorts on a work-group granularity where you perform the "sub-sorts" in local memory instead of accessing global memory (SDRAM) directly. Given the report above it doesn't look like you are using local memory since it is reporting no local memory banks are in use.