lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <7BFDACCD6948EF4D8FE8F4888A91596AA7DFA0@tx14exm60.ds.mot.com>
Date:	Tue, 10 Apr 2007 09:16:11 -0500
From:	"Hawkes Steve-FSH016" <Steve.Hawkes@...orola.com>
To:	<linux-kernel@...r.kernel.org>
Subject: TIF_SIGPENDING set, but signals not being serviced

I am running into an issue where processing of signals appears to be
misbehaving at the kernel level. One or more multithreaded processes are
ending up in a state where a signal is pending for the process, the
signal is blocked at the application level in all threads in the
process, and that one thread has the signal temporarily unblocked in the
kernel since it is in a sigwaitinfo() call, but that thread is not being
awakened to process the signal. Eventually the signal queue limit is hit
as more realtime signals are delivered to the process.

The kernel being used is 2.6.10-based on a PPC 8548 target
(unfortunately, I'm not able to move to a later kernel revision
currently, but I have not found changes in the latest kernel that appear
to be related to this issue).
 
Here is an example of 'ps' output in the failure scenario.
 
ps -emwo pid,tid,sig_pend,sig_block,sig_ignore,sig_catch,stat,wchan
 
  PID   TID           SIGNAL          BLOCKED          IGNORED
CATCHED STAT WCHAN
 1652     - 2000000000000000                -                -
- -    -
    -  1652 0000000000000000 6000000000000000 0000000000001004
9ffffff0fff8accb Sl   futex_poll
    -  1653 0000000000000000 4000000000000000 0000000000001004
9ffffff0fff8accb Sl   rt_sigtimedwait
    -  1654 0000000000000000 6000000000000000 0000000000001004
9ffffff0fff8accb Sl   futex_poll
    -  1655 0000000000000000 6000000000000000 0000000000001004
9ffffff0fff8accb Sl   futex_poll
    -  1656 0000000000000000 6000000000000000 0000000000001004
9ffffff0fff8accb Sl   futex_poll
    -  1657 0000000000000000 6000000000000000 0000000000001004
9ffffff0fff8accb Sl   futex_poll
    -  1658 0000000000000000 6000000000000000 0000000000001004
9ffffff0fff8accb Sl   futex_poll
    -  1659 0000000000000000 6000000000000000 0000000000001004
9ffffff0fff8accb Sl   select
    -  1660 0000000000000000 6000000000000000 0000000000001004
9ffffff0fff8accb Sl   futex_poll
    -  1661 0000000000000000 6000000000000000 0000000000001004
9ffffff0fff8accb Sl   select
    -  1662 0000000000000000 6000000000000000 0000000000001004
9ffffff0fff8accb Sl   futex_poll
    -  1663 0000000000000000 6000000000000000 0000000000001004
9ffffff0fff8accb Sl   futex_poll
    -  1664 0000000000000000 6000000000000000 0000000000001004
9ffffff0fff8accb Sl   nanosleep
    -  1665 0000000000000000 6000000000000000 0000000000001004
9ffffff0fff8accb Sl   futex_poll
    -  1667 0000000000000000 6000000000000000 0000000000001004
9ffffff0fff8accb Sl   futex_poll
    -  1668 0000000000000000 6000000000000000 0000000000001004
9ffffff0fff8accb Sl   futex_poll
    -  2178 0000000000000000 6000000000000000 0000000000001004
9ffffff0fff8accb Sl   futex_poll
    -  2410 0000000000000000 ffffffbf7ffbf235 0000000000001004
9ffffff0fff8accb Sl   futex_poll

The signal is pending for the process, as shown by the second line, and
the signal is unblocked in thread 1653, but that thread is not being
awakened.
 
I wrote a kernel module which displays more internal signal
information--as shown below--and it confirms that thread 1653 has the
signal temporarily unblocked and shared_pending has the signal pending
for the process. Oddly enough, it also shows the TIF_SIGPENDING flag is
set for multiple threads in the process (the column labeled "info") even
though those threads do not have and have never had the signal
unblocked.
 
  pid:  tgid stat info           pending #pnd   shared pending #shpnd
blocked     real blocked
 1652:  1652    2    0  0000000000000000    0 2000000000000000      2
6000000000000000 0000000000000000
 1653:  1652    2    4  0000000000000000    0 2000000000000000      2
4000000000000000 6000000000000000
 1654:  1652    2    4  0000000000000000    0 2000000000000000      2
6000000000000000 0000000000000000
 1655:  1652    2    4  0000000000000000    0 2000000000000000      2
6000000000000000 0000000000000000
 1656:  1652    2    0  0000000000000000    0 2000000000000000      2
6000000000000000 0000000000000000
 1657:  1652    2    0  0000000000000000    0 2000000000000000      2
6000000000000000 0000000000000000
 1658:  1652    2    0  0000000000000000    0 2000000000000000      2
6000000000000000 0000000000000000
 1659:  1652    2    0  0000000000000000    0 2000000000000000      2
6000000000000000 0000000000000000
 1660:  1652    2    0  0000000000000000    0 2000000000000000      2
6000000000000000 0000000000000000
 1661:  1652    2    0  0000000000000000    0 2000000000000000      2
6000000000000000 0000000000000000
 1662:  1652    2    0  0000000000000000    0 2000000000000000      2
6000000000000000 0000000000000000
 1663:  1652    2    4  0000000000000000    0 2000000000000000      2
6000000000000000 0000000000000000
 1664:  1652    2    0  0000000000000000    0 2000000000000000      2
6000000000000000 0000000000000000
 1665:  1652    2    0  0000000000000000    0 2000000000000000      2
6000000000000000 0000000000000000
 1667:  1652    2    0  0000000000000000    0 2000000000000000      2
6000000000000000 0000000000000000
 1668:  1652    2    4  0000000000000000    0 2000000000000000      2
6000000000000000 0000000000000000
 2178:  1652    2    0  0000000000000000    0 2000000000000000      2
6000000000000000 0000000000000000
 2410:  1652    2    0  0000000000000000    0 2000000000000000      2
ffffffbf7ffbf235 0000000000000000

The problem is difficult to analyze because analysis tools are causing
changes in the behavior, but some of the tests I've performed seem to
indicate that TIF_SIGPENDING is being left set on several threads which
do not have the signal unblocked before the system ends up in the state
where the signal in question is left as pending for the process.

I've tried using LTT and kernel instrumentation to determine how the
system gets in this state, but have not been successful so far. Note
that the problem seems to be timing related since even small changes
such as one more custom LTT event or running LTT while the system is
going through the phase where the problem normally appears can cause the
problem to not occur or to occur in a different process after the LTT
trace daemon terminates. The problem is difficult to isolate because we
only see it during the complex initialization sequence of our system and
it does not always occur in the same process.

I'm stumped.
Steve
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ