Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
16 years ago

Long Delays with 300 Hz Linux Timer

Hello,

Here's an odd problem that we would like to understand better:

We have occasionally been experiencing extremely long unexplained thread

delays in our application. The delays are about 72 seconds. I don't know

exactly what's happening, but I've written a test program to try to isolate

the problem. What I've determined so far is that the problem occurs when

the kernel is built with a 300 Hz timer, but apparently does not occur with

a 100 Hz (10 ms) timer.

We originally switched to the 300 Hz timer in an attempt to get better

granularity on "usleep" calls (e.g., with a 100 Hz timer a call like

"usleep(1000)", which should nominally delay for 1 millisecond, actually

sleeps for about 10 milliseconds).

When using the 300 Hz timer the 72-second delays are unpredictable and not

specific to any particular part of the code or to any particular thread.

The problem may be somehow related to I/O using "printf", but that's not clear.

The test program creates 20 threads. Each thread just repeats a sequence of

"printf" and "usleep" calls in a loop, and keeps track of the elapsed time

each time through the loop.

Here' an example of the timing data from a recent 3-day run with the 300 Hz

timer:

Thread Loop Minimum Time Average Time Maximum Time Total Time Start Time Stop Time

Number Count (milliseconds) (milliseconds) (milliseconds) (seconds) (HH:MM:SS) (HH:MM:SS)

------ ---------- -------------- -------------- -------------- ---------- ---------- ----------

1 518400 26.663 524.256 72139.948 271774.210 06:57:49 10:27:59

2 518400 26.666 524.392 1151.931 271844.867 06:57:49 10:29:12

3 518400 26.661 524.683 1154.496 271995.478 06:57:49 10:31:42

4 518400 27.253 524.924 72099.242 272120.595 06:57:49 10:33:47

5 518400 26.653 524.685 1154.547 271996.836 06:57:49 10:31:45

6 518400 26.668 524.565 72086.599 271934.680 06:57:49 10:30:42

7 518400 27.403 524.874 1120.679 272094.928 06:57:49 10:33:22

8 518400 29.938 524.968 1219.240 272143.434 06:57:49 10:34:10

9 518400 30.534 524.759 1198.509 272035.156 06:57:49 10:32:22

10 518400 27.342 525.030 1152.042 272175.764 06:57:49 10:34:43

11 518400 26.691 525.042 72549.944 272181.803 06:57:49 10:34:49

12 518400 28.400 525.097 1146.683 272210.231 06:57:49 10:35:17

13 518400 29.987 524.980 1150.365 272149.708 06:57:49 10:34:17

14 518400 27.798 525.036 1150.462 272178.832 06:57:49 10:34:46

15 518400 30.774 525.026 72070.929 272173.279 06:57:49 10:34:41

16 518400 30.786 525.025 1146.696 272172.777 06:57:49 10:34:40

17 518400 29.714 525.095 1148.486 272209.251 06:57:49 10:35:16

18 518400 27.867 525.021 1151.910 272171.144 06:57:49 10:34:38

19 518400 30.881 525.061 1152.283 272191.771 06:57:49 10:34:59

20 518400 27.827 525.066 1147.158 272194.057 06:57:49 10:35:01

(Sorry the columns didn't line up properly after I pasted the data into the topic window.)

Anyway, notice that 5 of the threads (1, 4, 6, 11, and 15) show maximum

times a little over 72000 milliseconds (roughly 72 seconds). However, for

all threads the average time is about 0.5 seconds per loop iteration, and

for the threads that did not experience the anomalous 72-second interruption,

the worst case time is about 1.2 seconds.

All of the threads are executing the same code.

Although the table above doesn't show it, there were actually a total of 8

72-second delays in the 3-day test run (a few of the threads saw multiple

72-second delays) -- so, as you can see, this doesn't happen very often.

Running the same test with the 100 Hz timer does not show any problems.

For example, here's the data from a 16-hour run:

Thread Loop Minimum Time Average Time Maximum Time Total Time Start Time Stop Time

Number Count (milliseconds) (milliseconds) (milliseconds) (seconds) (HH:MM:SS) (HH:MM:SS)

------ ---------- -------------- -------------- -------------- ---------- ---------- ----------

1 100000 28.666 537.322 1144.769 53732.186 06:31:41 21:27:21

2 100000 28.653 538.288 1180.917 53828.823 06:31:41 21:28:57

3 100000 29.482 540.014 1181.092 54001.416 06:31:41 21:31:50

4 100000 21.679 540.361 1186.108 54036.148 06:31:41 21:32:24

5 100000 28.949 540.056 1184.370 54005.621 06:31:41 21:31:54

6 100000 21.394 540.365 1224.884 54036.465 06:31:41 21:32:25

7 100000 29.413 540.219 1179.373 54021.871 06:31:41 21:32:10

8 100000 28.979 540.397 1175.199 54039.686 06:31:41 21:32:28

9 100000 28.941 540.313 1146.820 54031.288 06:31:41 21:32:19

10 100000 29.876 540.199 1144.930 54019.860 06:31:41 21:32:08

11 100000 29.090 540.241 1185.743 54024.079 06:31:41 21:32:12

12 100000 29.567 540.149 1148.273 54014.862 06:31:41 21:32:03

13 100000 29.531 539.851 1180.193 53985.051 06:31:41 21:31:33

14 100000 29.876 539.879 1178.381 53987.865 06:31:41 21:31:36

15 100000 29.100 539.447 1178.365 53944.686 06:31:41 21:30:53

16 100000 28.992 539.600 1189.743 53959.999 06:31:41 21:31:08

17 100000 29.485 540.172 1147.875 54017.214 06:31:41 21:32:06

18 100000 29.136 539.647 1180.218 53964.660 06:31:41 21:31:13

19 100000 28.906 539.481 1174.130 53948.052 06:31:41 21:30:56

20 100000 29.540 539.752 1180.538 53975.156 06:31:41 21:31:24

All of the tests were done using round robin scheduling (SCHED_RR), but I've

verified that the symptoms are more or less the same when using SCHED_OTHER

(i.e., there are occasional 72-second delays with the 300 Hz timer and

SCHED_OTHER).

A few other facts:

(1) The problem is not related to having 20 threads. Our main

application has 10 threads, and it exhibits the problem.

(2) I'm not positive but I suspect the problem is not related to

any issues with stdio functions either being or not being

thread-safe. The reason I say that is that at one point I put

a mutex lock around "printf" calls and still saw the problem.

(3) I know that the problem is not somehow related to serial and/or

socket I/O specifically. I.e., we've seen the problem when

connected via ethernet, and also when the I/O ("printf" calls)

is directed to a UART.

(4) I have not been able to make the problem occur when the threads

are executing floating-point operations (sin, cos, sqrt calls)

instead of printf calls.

I'd be glad to post or make available the test program source code if anyone

is interested, but I haven't included it here to keep this post to a somewhat

reasonable length.

The question I have is: Does anyone understand what's

causing these huge delays. It's clearly not something we can live with

in an electronic instrument. When a user pushes a button, it's not really

acceptable to respond to the button push 72 seconds later, but that's exactly

what was happening with our instrument!

During these delays the linux kernel appears to be alive and well. E.g.,

we can connect with telnet and do normal stuff like "ps", "ls", etc.

The symptoms are as if the linux scheduler has simply decided not to let a

certain thread run for 72 seconds, but we have no idea why that would be the case.

We can live with the 100 Hz scheduler timer, but it doesn't seem all that

unusual to have a multi-threaded embedded application, and want better than

10 millisecond granularity on "usleep" calls, so I'm a bit puzzled that

the problem hasn't been discussed before (at least I haven't noticed it in

the forum).

Is anyone else building kernels using the 300 Hz timer and running multi-threaded applications?

Thanks for any help.

-- Matt

7 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    <div class='quotetop'>QUOTE (Matt Nicholas @ Sep 11 2009, 02:04 AM) <{post_snapback}> (index.php?act=findpost&pid=23810)</div>

    --- Quote Start ---

    We originally switched to the 300 Hz timer in an attempt to get better granularity on "usleep" calls[/b]

    --- Quote End ---

    I suppose it would be better to find a way to avoid the usleep() calls, as doing a tight timing in a user land process is not a good idea.

    Maybe this can be done by doing an appropriate Kernel driver or by some custom "hardware" IP-core.

    Do you use the new MMU-enable ,distribution ? Same is based on gcc4 and the thread support is a lot better (working gdb, TLS ("__thread"-keyword), NTPL, ...) There is no FUTEX support yet, but we are working on that issue, and I&#39;ll not start my threaded project, before FUTEX is in place.

    -Michael
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hello

    I'm quite sensitive susceptible to this kind of problems because I'm experiencing a similar one.

    My problem is with a 1000 Hz timer not working against a 100 Hz timer working with a threaded application. To be precise, even with 100 Hz timer the reactivity of the system seems worse than older kernel versions (unchanged application source code, unchanged FPGA project).

    Dear mdn, just to understand if we are using the same software, which distribution are you using?

    mine: uClinux from nios2-linux-20090730.tar with 2.6.30 kernel (problems were detected also with the previous tar.gz with 2.6.28 kernel)

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Futex is the top prio on my To Do list. But I am short of time at this moment.

    gabrigob, could you give a sample code for benchmark?

    - Hippo
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    Futex is the top prio on my To Do list. But I am short of time at this moment.

    --- Quote End ---

    While - as you know - I am very interested in FUTEX and definitely eager to help, I must admit that (other than in the recent months) for some weeks I will be extremely busy with completely unrelated stuff, so I only can be of limited help right now :mad:.

    -Michael
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    Futex is the top prio on my To Do list. But I am short of time at this moment.

    gabrigob, could you give a sample code for benchmark?

    - Hippo

    --- Quote End ---

    Thanks for interest,

    for a few days I'm busy with other projects. Next week I'll post some benchmarks and some code to repeat the test on other systems. I could also be interested in repeating mdn's test, if time permits.

    I warn in advance that I can not post the complete application both for the huge amount of code and for the presence of third party code I am not authorized to share. I'll build a testbench only with the interesting parts.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    two weeks ago I did a Futex testing and timing program that with multiple threads uses as semaphores some coarsely carved homebrew functions based on atomic functions, pthread_mutex from glibc, and System 5 semaphores.

    I'll send the source code on request.

    -Michael
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hi Michael,

    That would be helpful. Please send me your code.

    Cheers,

    Hippo