The code looks good. I still don't know what domains your input/out signals are based in(start, count, out, etc.) but am assuming they're synchronous, except perhaps start which is double-registered and is probably fine if it's not synchronous.
Note that your failures don't look timing related. Timing issues(like when you had hold issues), tend to just miss the window they're targeting. If you're supposed to count to 3 but count to 7 or 8, i.e. are missing by 40-50ns, then it's not timing on those paths. But if everything's in the clock domain I really don't think it's a timing issue.
You can do a simulation to check your logic. I did a real quick one and it counted to 3 and then stopped, so it looked right. You could also try SignalTap, which allows you to look at logic "live" while the device is running, and see what is physically happening.