[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zz95qDPU2wcEp26r@localhost.localdomain>
Date: Thu, 21 Nov 2024 19:19:20 +0100
From: Frederic Weisbecker <frederic@...nel.org>
To: Anthony Mallet <anthony.mallet@...s.fr>
Cc: Anna-Maria Behnsen <anna-maria@...utronix.de>,
linux-kernel@...r.kernel.org, Oleg Nesterov <oleg@...hat.com>,
Dmitry Vyukov <dvyukov@...gle.com>, Marco Elver <elver@...gle.com>,
Thomas Gleixner <tglx@...utronix.de>
Subject: Re: posix timer freeze after some random time, under pthread
create/destroy load
Hi Anthony,
Le Wed, Nov 06, 2024 at 10:29:44PM +0100, Anthony Mallet a écrit :
> Hi,
>
> I'm facing an issue with posix timers configured to send SIGALRM
> signal upon expiry. The symptom is that the timer randomly freezes
> (the signal handler not triggered anymore). After analysis, this happens
> in combination with pthreads creation / destruction.
>
> I have attached a test case that can reliably reproduce my issue on
> affected kernels. It involves creating a timer that increments a
> global counter at each tick, while the main thread is spawning and
> destroying other threads. At some point, the counter gets stalled. In
> the context of this test case, I do heavy thread creation and
> destruction, so that the issue triggers almost immediately. Regarding
> the real-world issue, it happens in the context of aio(7) work, which
> also involves thread creation and destruction but presumably at a much
> lower rate, and the issue consequently triggers much less often.
>
> I could reproduce the issue reliably with mainline kernels from 6.4
> to 6.11 (included), and on several distributions, different hardware
> and glibc versions. Kernels earlier than 6.3 (included) do not exhibit
> the problem at all.
>
> Once the issue triggers, simply resetting the timer (with
> timer_settime(2)) makes it work again, until next
> stall. timer_gettime(2) does not show garbage and the values are still
> as expected. Only the signal handler is not called. Manually sending
> SIGALRM with raise(SIGALRM) also works and invokes the signal handler
> as expected.
>
> Also note that using setitimer(2) instead of a posix timer does not
> show any problem with the same test program.
>
> Before filling a proper bug report, I wanted to have your opinion
> about this. This e-mail is already probably too long for an
> introduction, but I can of course provide you with any missing detail
> that you would deem necessary.
>
> Thanks for you attention,
> Anthony Mallet
Thanks a lot for your report and the very helpful reliable reproducer.
I think this started with commit:
bcb7ee79029d (posix-timers: Prefer delivery of signals to the current thread)
The problem is that if the current task is exiting and has already been reaped,
its sighand pointer isn't there anymore. And so the signal is ignored even
though it should be queued to and handled by the thread group that has other
live threads to take care of it.
Can you test the following patch? I'm cooking another patch with changelog for
upstream that has seen recent changes in this area.
diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..4cadee618d4b 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1984,7 +1984,8 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
t = pid_task(pid, type);
if (!t)
goto ret;
- if (type != PIDTYPE_PID && same_thread_group(t, current))
+ if (type != PIDTYPE_PID && same_thread_group(t, current) &&
+ !(current->flags & PF_EXITING))
t = current;
if (!likely(lock_task_sighand(t, &flags)))
goto ret;
Powered by blists - more mailing lists