Long Delays with 300 Hz Linux Timer
Hello,
Here's an odd problem that we would like to understand better:
We have occasionally been experiencing extremely long unexplained thread
delays in our application. The delays are about 72 seconds. I don't know
exactly what's happening, but I've written a test program to try to isolate
the problem. What I've determined so far is that the problem occurs when
the kernel is built with a 300 Hz timer, but apparently does not occur with
a 100 Hz (10 ms) timer.
We originally switched to the 300 Hz timer in an attempt to get better
granularity on "usleep" calls (e.g., with a 100 Hz timer a call like
"usleep(1000)", which should nominally delay for 1 millisecond, actually
sleeps for about 10 milliseconds).
When using the 300 Hz timer the 72-second delays are unpredictable and not
specific to any particular part of the code or to any particular thread.
The problem may be somehow related to I/O using "printf", but that's not clear.
The test program creates 20 threads. Each thread just repeats a sequence of
"printf" and "usleep" calls in a loop, and keeps track of the elapsed time
each time through the loop.
Here' an example of the timing data from a recent 3-day run with the 300 Hz
timer:
Thread Loop Minimum Time Average Time Maximum Time Total Time Start Time Stop Time
Number Count (milliseconds) (milliseconds) (milliseconds) (seconds) (HH:MM:SS) (HH:MM:SS)
------ ---------- -------------- -------------- -------------- ---------- ---------- ----------
1 518400 26.663 524.256 72139.948 271774.210 06:57:49 10:27:59
2 518400 26.666 524.392 1151.931 271844.867 06:57:49 10:29:12
3 518400 26.661 524.683 1154.496 271995.478 06:57:49 10:31:42
4 518400 27.253 524.924 72099.242 272120.595 06:57:49 10:33:47
5 518400 26.653 524.685 1154.547 271996.836 06:57:49 10:31:45
6 518400 26.668 524.565 72086.599 271934.680 06:57:49 10:30:42
7 518400 27.403 524.874 1120.679 272094.928 06:57:49 10:33:22
8 518400 29.938 524.968 1219.240 272143.434 06:57:49 10:34:10
9 518400 30.534 524.759 1198.509 272035.156 06:57:49 10:32:22
10 518400 27.342 525.030 1152.042 272175.764 06:57:49 10:34:43
11 518400 26.691 525.042 72549.944 272181.803 06:57:49 10:34:49
12 518400 28.400 525.097 1146.683 272210.231 06:57:49 10:35:17
13 518400 29.987 524.980 1150.365 272149.708 06:57:49 10:34:17
14 518400 27.798 525.036 1150.462 272178.832 06:57:49 10:34:46
15 518400 30.774 525.026 72070.929 272173.279 06:57:49 10:34:41
16 518400 30.786 525.025 1146.696 272172.777 06:57:49 10:34:40
17 518400 29.714 525.095 1148.486 272209.251 06:57:49 10:35:16
18 518400 27.867 525.021 1151.910 272171.144 06:57:49 10:34:38
19 518400 30.881 525.061 1152.283 272191.771 06:57:49 10:34:59
20 518400 27.827 525.066 1147.158 272194.057 06:57:49 10:35:01
(Sorry the columns didn't line up properly after I pasted the data into the topic window.)
Anyway, notice that 5 of the threads (1, 4, 6, 11, and 15) show maximum
times a little over 72000 milliseconds (roughly 72 seconds). However, for
all threads the average time is about 0.5 seconds per loop iteration, and
for the threads that did not experience the anomalous 72-second interruption,
the worst case time is about 1.2 seconds.
All of the threads are executing the same code.
Although the table above doesn't show it, there were actually a total of 8
72-second delays in the 3-day test run (a few of the threads saw multiple
72-second delays) -- so, as you can see, this doesn't happen very often.
Running the same test with the 100 Hz timer does not show any problems.
For example, here's the data from a 16-hour run:
Thread Loop Minimum Time Average Time Maximum Time Total Time Start Time Stop Time
Number Count (milliseconds) (milliseconds) (milliseconds) (seconds) (HH:MM:SS) (HH:MM:SS)
------ ---------- -------------- -------------- -------------- ---------- ---------- ----------
1 100000 28.666 537.322 1144.769 53732.186 06:31:41 21:27:21
2 100000 28.653 538.288 1180.917 53828.823 06:31:41 21:28:57
3 100000 29.482 540.014 1181.092 54001.416 06:31:41 21:31:50
4 100000 21.679 540.361 1186.108 54036.148 06:31:41 21:32:24
5 100000 28.949 540.056 1184.370 54005.621 06:31:41 21:31:54
6 100000 21.394 540.365 1224.884 54036.465 06:31:41 21:32:25
7 100000 29.413 540.219 1179.373 54021.871 06:31:41 21:32:10
8 100000 28.979 540.397 1175.199 54039.686 06:31:41 21:32:28
9 100000 28.941 540.313 1146.820 54031.288 06:31:41 21:32:19
10 100000 29.876 540.199 1144.930 54019.860 06:31:41 21:32:08
11 100000 29.090 540.241 1185.743 54024.079 06:31:41 21:32:12
12 100000 29.567 540.149 1148.273 54014.862 06:31:41 21:32:03
13 100000 29.531 539.851 1180.193 53985.051 06:31:41 21:31:33
14 100000 29.876 539.879 1178.381 53987.865 06:31:41 21:31:36
15 100000 29.100 539.447 1178.365 53944.686 06:31:41 21:30:53
16 100000 28.992 539.600 1189.743 53959.999 06:31:41 21:31:08
17 100000 29.485 540.172 1147.875 54017.214 06:31:41 21:32:06
18 100000 29.136 539.647 1180.218 53964.660 06:31:41 21:31:13
19 100000 28.906 539.481 1174.130 53948.052 06:31:41 21:30:56
20 100000 29.540 539.752 1180.538 53975.156 06:31:41 21:31:24
All of the tests were done using round robin scheduling (SCHED_RR), but I've
verified that the symptoms are more or less the same when using SCHED_OTHER
(i.e., there are occasional 72-second delays with the 300 Hz timer and
SCHED_OTHER).
A few other facts:
(1) The problem is not related to having 20 threads. Our main
application has 10 threads, and it exhibits the problem.
(2) I'm not positive but I suspect the problem is not related to
any issues with stdio functions either being or not being
thread-safe. The reason I say that is that at one point I put
a mutex lock around "printf" calls and still saw the problem.
(3) I know that the problem is not somehow related to serial and/or
socket I/O specifically. I.e., we've seen the problem when
connected via ethernet, and also when the I/O ("printf" calls)
is directed to a UART.
(4) I have not been able to make the problem occur when the threads
are executing floating-point operations (sin, cos, sqrt calls)
instead of printf calls.
I'd be glad to post or make available the test program source code if anyone
is interested, but I haven't included it here to keep this post to a somewhat
reasonable length.
The question I have is: Does anyone understand what's
causing these huge delays. It's clearly not something we can live with
in an electronic instrument. When a user pushes a button, it's not really
acceptable to respond to the button push 72 seconds later, but that's exactly
what was happening with our instrument!
During these delays the linux kernel appears to be alive and well. E.g.,
we can connect with telnet and do normal stuff like "ps", "ls", etc.
The symptoms are as if the linux scheduler has simply decided not to let a
certain thread run for 72 seconds, but we have no idea why that would be the case.
We can live with the 100 Hz scheduler timer, but it doesn't seem all that
unusual to have a multi-threaded embedded application, and want better than
10 millisecond granularity on "usleep" calls, so I'm a bit puzzled that
the problem hasn't been discussed before (at least I haven't noticed it in
the forum).
Is anyone else building kernels using the 300 Hz timer and running multi-threaded applications?
Thanks for any help.
-- Matt