[<prev] [next>] [day] [month] [year] [list]
Message-ID: <8ef39357-06c9-0959-71da-fba80e1fa934@yeslogic.com>
Date: Sun, 23 Jul 2017 12:14:04 +1000
From: Michael Day <mikeday@...logic.com>
To: linux-kernel@...r.kernel.org
Subject: signal not interrupting futex
We have hit an apparent kernel bug where a signal is not interrupting a
futex, leading to a deadlock in our code. Here is the relevant strace
output just before it blocks (complete strace log is attached):
14069 set_robust_list(0x7f7b3e7ee9e0, 24 <unfinished ...>
14061 futex(0x7f7b46721fd8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
14069 <... set_robust_list resumed> ) = 0
14069 futex(0x7f7b46721fd8, FUTEX_WAKE_PRIVATE, 1) = 1
14061 <... futex resumed> ) = 0
14061 futex(0x1585ea0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
14069 tgkill(14061, 14061, SIGPWR) = 0
14069 futex(0x1586280, FUTEX_WAIT_PRIVATE, 0, NULL
Thread '69 sends SIGPWR to thread '61, but it is never delivered and we
have not been able to figure out why.
Background information: this deadlock is experienced by our customer
running Prince on CentOS 7. The bug happens every time on their system,
but we have not been able to reproduce it on ours yet. They have tried
two different kernel versions:
3.10.0-327.28.2.el7.x86_64
3.10.0-514.26.2.el7.x86_64
Over the past two years we have heard similar deadlock issues from other
customers, always on CentOS and typically involving PHP, although these
are of course very popular systems.
This issue appears to be unrelated to the earlier futex bug affecting
Haswell processors, but could there be another bug along these lines
affecting futexes or signal delivery?
What can we do to help debug this issue?
Best regards,
Michael
--
Prince: Print with CSS!
http://www.princexml.com
View attachment "prince.strace" of type "text/plain" (52230 bytes)
Powered by blists - more mailing lists