Long Delays with 300 Hz Linux Timer

Question

Hello,Here&amp;#39;s an odd problem that we would like to understand better:We have occasionally been experiencing extremely long unexplained threaddelays in our application. The delays are about 72 seconds. I don&amp;#39;t knowexactly what&amp;#39;s happening, but I&amp;#39;ve written a test program to try to isolatethe problem. What I&amp;#39;ve determined so far is that the problem occurs whenthe kernel is built with a 300 Hz timer, but apparently does not occur witha 100 Hz (10 ms) timer.We originally switched to the 300 Hz timer in an attempt to get bettergranularity on "usleep" calls (e.g., with a 100 Hz timer a call like"usleep(1000)", which should nominally delay for 1 millisecond, actuallysleeps for about 10 milliseconds).When using the 300 Hz timer the 72-second delays are unpredictable and notspecific to any particular part of the code or to any particular thread.The problem may be somehow related to I/O using "printf", but that&amp;#39;s not clear.The test program creates 20 threads. Each thread just repeats a sequence of"printf" and "usleep" calls in a loop, and keeps track of the elapsed timeeach time through the loop.Here&amp;#39; an example of the timing data from a recent 3-day run with the 300 Hztimer:Thread Loop Minimum Time Average Time Maximum Time Total Time Start Time Stop TimeNumber Count (milliseconds) (milliseconds) (milliseconds) (seconds) (HH:MM:SS) (HH:MM:SS)------ ---------- -------------- -------------- -------------- ---------- ---------- ----------1 518400 26.663 524.256 72139.948 271774.210 06:57:49 10:27:592 518400 26.666 524.392 1151.931 271844.867 06:57:49 10:29:123 518400 26.661 524.683 1154.496 271995.478 06:57:49 10:31:424 518400 27.253 524.924 72099.242 272120.595 06:57:49 10:33:475 518400 26.653 524.685 1154.547 271996.836 06:57:49 10:31:456 518400 26.668 524.565 72086.599 271934.680 06:57:49 10:30:427 518400 27.403 524.874 1120.679 272094.928 06:57:49 10:33:228 518400 29.938 524.968 1219.240 272143.434 06:57:49 10:34:109 518400 30.534 524.759 1198.509 272035.156 06:57:49 10:32:2210 518400 27.342 525.030 1152.042 272175.764 06:57:49 10:34:4311 518400 26.691 525.042 72549.944 272181.803 06:57:49 10:34:4912 518400 28.400 525.097 1146.683 272210.231 06:57:49 10:35:1713 518400 29.987 524.980 1150.365 272149.708 06:57:49 10:34:1714 518400 27.798 525.036 1150.462 272178.832 06:57:49 10:34:4615 518400 30.774 525.026 72070.929 272173.279 06:57:49 10:34:4116 518400 30.786 525.025 1146.696 272172.777 06:57:49 10:34:4017 518400 29.714 525.095 1148.486 272209.251 06:57:49 10:35:1618 518400 27.867 525.021 1151.910 272171.144 06:57:49 10:34:3819 518400 30.881 525.061 1152.283 272191.771 06:57:49 10:34:5920 518400 27.827 525.066 1147.158 272194.057 06:57:49 10:35:01(Sorry the columns didn&amp;#39;t line up properly after I pasted the data into the topic window.)Anyway, notice that 5 of the threads (1, 4, 6, 11, and 15) show maximumtimes a little over 72000 milliseconds (roughly 72 seconds). However, forall threads the average time is about 0.5 seconds per loop iteration, andfor the threads that did not experience the anomalous 72-second interruption,the worst case time is about 1.2 seconds.All of the threads are executing the same code.Although the table above doesn&amp;#39;t show it, there were actually a total of 872-second delays in the 3-day test run (a few of the threads saw multiple72-second delays) -- so, as you can see, this doesn&amp;#39;t happen very often.Running the same test with the 100 Hz timer does not show any problems.For example, here&amp;#39;s the data from a 16-hour run:Thread Loop Minimum Time Average Time Maximum Time Total Time Start Time Stop TimeNumber Count (milliseconds) (milliseconds) (milliseconds) (seconds) (HH:MM:SS) (HH:MM:SS)------ ---------- -------------- -------------- -------------- ---------- ---------- ----------1 100000 28.666 537.322 1144.769 53732.186 06:31:41 21:27:212 100000 28.653 538.288 1180.917 53828.823 06:31:41 21:28:573 100000 29.482 540.014 1181.092 54001.416 06:31:41 21:31:504 100000 21.679 540.361 1186.108 54036.148 06:31:41 21:32:245 100000 28.949 540.056 1184.370 54005.621 06:31:41 21:31:546 100000 21.394 540.365 1224.884 54036.465 06:31:41 21:32:257 100000 29.413 540.219 1179.373 54021.871 06:31:41 21:32:108 100000 28.979 540.397 1175.199 54039.686 06:31:41 21:32:289 100000 28.941 540.313 1146.820 54031.288 06:31:41 21:32:1910 100000 29.876 540.199 1144.930 54019.860 06:31:41 21:32:0811 100000 29.090 540.241 1185.743 54024.079 06:31:41 21:32:1212 100000 29.567 540.149 1148.273 54014.862 06:31:41 21:32:0313 100000 29.531 539.851 1180.193 53985.051 06:31:41 21:31:3314 100000 29.876 539.879 1178.381 53987.865 06:31:41 21:31:3615 100000 29.100 539.447 1178.365 53944.686 06:31:41 21:30:5316 100000 28.992 539.600 1189.743 53959.999 06:31:41 21:31:0817 100000 29.485 540.172 1147.875 54017.214 06:31:41 21:32:0618 100000 29.136 539.647 1180.218 53964.660 06:31:41 21:31:1319 100000 28.906 539.481 1174.130 53948.052 06:31:41 21:30:5620 100000 29.540 539.752 1180.538 53975.156 06:31:41 21:31:24All of the tests were done using round robin scheduling (SCHED_RR), but I&amp;#39;veverified that the symptoms are more or less the same when using SCHED_OTHER(i.e., there are occasional 72-second delays with the 300 Hz timer andSCHED_OTHER).A few other facts:(1) The problem is not related to having 20 threads. Our mainapplication has 10 threads, and it exhibits the problem.(2) I&amp;#39;m not positive but I suspect the problem is not related toany issues with stdio functions either being or not beingthread-safe. The reason I say that is that at one point I puta mutex lock around "printf" calls and still saw the problem.(3) I know that the problem is not somehow related to serial and/orsocket I/O specifically. I.e., we&amp;#39;ve seen the problem whenconnected via ethernet, and also when the I/O ("printf" calls)is directed to a UART.(4) I have not been able to make the problem occur when the threadsare executing floating-point operations (sin, cos, sqrt calls)instead of printf calls.I&amp;#39;d be glad to post or make available the test program source code if anyoneis interested, but I haven&amp;#39;t included it here to keep this post to a somewhatreasonable length.The question I have is: Does anyone understand what&amp;#39;scausing these huge delays. It&amp;#39;s clearly not something we can live within an electronic instrument. When a user pushes a button, it&amp;#39;s not reallyacceptable to respond to the button push 72 seconds later, but that&amp;#39;s exactlywhat was happening with our instrument!During these delays the linux kernel appears to be alive and well. E.g.,we can connect with telnet and do normal stuff like "ps", "ls", etc.The symptoms are as if the linux scheduler has simply decided not to let acertain thread run for 72 seconds, but we have no idea why that would be the case.We can live with the 100 Hz scheduler timer, but it doesn&amp;#39;t seem all thatunusual to have a multi-threaded embedded application, and want better than10 millisecond granularity on "usleep" calls, so I&amp;#39;m a bit puzzled thatthe problem hasn&amp;#39;t been discussed before (at least I haven&amp;#39;t noticed it inthe forum).Is anyone else building kernels using the 300 Hz timer and running multi-threaded applications?Thanks for any help.-- Matt

altera_forum · Answer

&lt;div class='quotetop'&gt;QUOTE (Matt Nicholas @ Sep 11 2009, 02:04 AM) &lt;{post_snapback}&gt; (index.php?act=findpost&amp;pid=23810)&lt;/div&gt;  --- Quote Start ---  We originally switched to the 300 Hz timer in an attempt to get better granularity on "usleep" calls[/b]  --- Quote End ---   I suppose it would be better to find a way to avoid the usleep() calls, as doing a tight timing in a user land process is not a good idea.   Maybe this can be done by doing an appropriate Kernel driver or by some custom "hardware" IP-core.  Do you use the new MMU-enable ,distribution ? Same is based on gcc4 and the thread support is a lot better (working gdb, TLS ("__thread"-keyword), NTPL, ...) There is no FUTEX support yet, but we are working on that issue, and I&amp;#39;ll not start my threaded project, before  FUTEX is in place.  -Michael

altera_forum · Answer

Hello

I'm quite sensitive susceptible to this kind of problems because I'm experiencing a similar one.

My problem is with a 1000 Hz timer not working against a 100 Hz timer working with a threaded application. To be precise, even with 100 Hz timer the reactivity of the system seems worse than older kernel versions (unchanged application source code, unchanged FPGA project).

Dear mdn, just to understand if we are using the same software, which distribution are you using?

mine: uClinux from nios2-linux-20090730.tar with 2.6.30 kernel (problems were detected also with the previous tar.gz with 2.6.28 kernel)

altera_forum · Answer

Futex is the top prio on my To Do list. But I am short of time at this moment.  gabrigob, could you give a sample code for benchmark?  - Hippo

altera_forum · Answer

--- Quote Start ---

Futex is the top prio on my To Do list. But I am short of time at this moment.

--- Quote End ---

While - as you know - I am very interested in FUTEX and definitely eager to help, I must admit that (other than in the recent months) for some weeks I will be extremely busy with completely unrelated stuff, so I only can be of limited help right now :mad:.

-Michael

altera_forum · Answer

--- Quote Start ---

Futex is the top prio on my To Do list. But I am short of time at this moment.

gabrigob, could you give a sample code for benchmark?

- Hippo

--- Quote End ---

Thanks for interest,

for a few days I'm busy with other projects. Next week I'll post some benchmarks and some code to repeat the test on other systems. I could also be interested in repeating mdn's test, if time permits.

I warn in advance that I can not post the complete application both for the huge amount of code and for the presence of third party code I am not authorized to share. I'll build a testbench only with the interesting parts.

Forum Discussion

Long Delays with 300 Hz Linux Timer

7 Replies

Recent Discussions

Error generating BSP

NIOS V/m dbg_reset_out signal (Q25.1 Std, MAX10)

Where is FreeRTOS-Plus-TCP Design

NIOS-V QSYS Warning Properties (associatedClock) have been set on

DK-DEV-AGI027-RA: JTAG chain broken after Nios V Hello, FPGA recovery fails