[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1255165747.6385.117.camel@doodleydee>
Date: Sat, 10 Oct 2009 02:09:07 -0700
From: Blaise Gassend <blaise@...lowgarage.com>
To: linux-kernel@...r.kernel.org
Cc: Jeremy Leibs <leibs@...lowgarage.com>
Subject: ERESTARTSYS escaping from sem_wait with RTLinux patch
The attached python program, in which 500 threads spin with microsecond
sleeps, crashes with a "sem_wait: Unknown error 512" (conditions
described below). This appears to be due to an ERESTARTSYS generated
from futex_wait escaping to user space (libc). My understanding is that
this should never happen and I am trying to track down what is going on.
Questions that would help me make progress:
-------------------------------------------
1) Where is the ERESTARTSYS being prevented from getting to user space?
The only likely place I see for preventing ERESTARTSYS from escaping to
user space is in arch/*/kernel/signal*.c. However, I don't see how the
code there is being called if there no signal pending. Is that a path
for ERESTARTSYS to escape from the kernel?
The following comment in kernel/futex.h in futex_wait makes me wonder if
two threads are getting marked as ERESTARTSYS. The first one to leave
the kernel processes the signal and restarts. The second one doesn't
have a signal to handle, so it returns to user space without getting
into signal*.c and wreaks havoc.
(...)
/*
* We expect signal_pending(current), but another thread may
* have handled it for us already.
*/
if (!abs_time)
return -ERESTARTSYS;
(...)
2) Why would this be happening only with RT kernels?
3) Any suggestions on the best place to patch/workaround this?
My understanding is that if I was to treat ERESTARTSYS as an EAGAIN,
most applications would be perfectly happy. Would bad things happen if I
replaced the ERESTARTSYS in futex_wait with an EAGAIN?
Crash conditions:
-----------------
- RTLinux only.
- More cores seems to make things worse. Lots of crashes on a dual-quad
core machine. None observed yet on dual core. At least one crash on a
dual-quad core when run with "taskset -c 1"
- Various versions, including 2.6.29.6-rt23, and whatever the latest was
earlier today.
- Seen on both ia64 and x86
- Ubuntu hardy and jaunty
- Sometimes hapens within 2 seconds on a dual quad-core machine, other
times will go for up to 30 minutes to an hour without crashing. I
suspect a dependence on system activity, but haven't noticed an obvious
pattern.
- Time to crash appears to drop fast with more CPU cores.
View attachment "threadprocs8.py" of type "text/x-python" (223 bytes)
Powered by blists - more mailing lists