[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87bl0aidjv.fsf@email.froward.int.ebiederm.org>
Date: Mon, 17 Jan 2022 09:31:48 -0600
From: "Eric W. Biederman" <ebiederm@...ssion.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Alexey Gladkov <legion@...nel.org>,
Al Viro <viro@...iv.linux.org.uk>,
Kees Cook <keescook@...omium.org>,
Oleg Nesterov <oleg@...hat.com>
Subject: Re: [GIT PULL] signal/exit/ptrace changes for v5.17
Linus Torvalds <torvalds@...ux-foundation.org> writes:
> On Sat, Jan 15, 2022 at 2:00 AM Eric W. Biederman <ebiederm@...ssion.com> wrote:
>>
>> I am currently investigating to figure out if wake_up_interruptible
>> (instead of simply wake_up) ever makes sense outside of the signal code.
>
> It may not be a huge deal, but it *absolutely* makes sense.
>
> Any subsystem that uses "wait_event_interruptible()" (or variations of
> that) should by default only use "wake_up_interruptible()" to wake the
> wait queue.
>
> The reason? Being (interruptibly) on one wait-queue does *NOT* make it
> impossible that the very same process isn't waiting non-interruptibly
> on another one.
>
> It's not hugely common, but the Linux kernel wait-queues have very
> much been designed for the whole "one process can be on multiple wait
> queues for different reasons at the same time" model. That is *very*
> core.
>
> People sometimes think that is just a "poll/select()" thing, but
> that's not at all true. It's quite valid to do things like
>
> add_wait_queue(..)
> for (.. some loop ..) {
> set_current_state(TASK_INTERRUPTIBLE);
> ... do various things, checking state etc ..
> schedule();
> }
> set_current_state(TASK_RUNNABLE);
> remove_wait_queue();
>
> and part of that "do various things" is obviously checking for signals
> and other "exit this wait loop", but other things can be things like
> taking a page fault because you copied data from user space etc.
>
> And that in turn may end up having a nested wait on another waitqueue
> (for the IO), and the outer wait queue should basically strive to not
> wake up unrelated "deeper" sleeps, because that is pointless and just
> causes extra wakeups.
>
> So the page fault event will cause a nested TASK_UNINTERRUPTIBLE
> sleep, and when that IO has completed, it goes into TASK_RUNNABLE, so
> the outer (interruptible) loop above will have a "dummy schedule()"
> and loop around again to be put back into TASK_INTERRUPTIBLE sleep
> next time around.
>
> But note how it would be pointless to use a "wake_up()" on this outer
> wait queue - it would wake up the deeper IO sleep too, and that would
> just see "oh, the IO I'm waiting for hasn't completed, so I'll just
> have to go to sleep again".
>
> Would it still _work_? Yes. Is the above _common_? No. But it's a
> really fundamnetal pattern in the kernel, and it's fundamentally how
> wait queues work, and how they should work, and an interruptible sleep
> should generally be seen as pairing with an interruptible wake,
> because that's just how things are.
>
> Why would you want to change something basic like that? Yes, using
> "wake_up()" instead of "wake_up_interruptible()" would still result in
> a working kernel, but it's just _pointless_.
Thank you for the detailed reply. I am going to have to spend a little
bit digesting it.
They why is that I am digging into how to reliably test for and get
just the wakeups needed in the coredump code.
As a first approximation writing to files and causing dump_interrupted
to change for signal_pending to fatal_sending_pending worked like a
charm. Unfortunately the pipe code is still performing interruptible
waits and io_uring causes truncated coredumps.
I would like to have a version of pipe_write that sleeps in
TASK_KILLABLE. I want the I/O wake-ups and I want the SIGKILL wake ups
but I don't want any other wake-ups. Unfortunately the I/O wake-ups in
the pipe code are sent with wake_up_interruptible. So a task sleeping
in TASK_KILLABLE won't get them.
Which means that the obvious solution of changing
wait_event_interruptible to wake_event_killable breaks coredump support
(as I found out the hard way).
I understand things well enough that I can change the signal code and
not make the coredump code worse. I am still trying to figure out what
a clean maintainable solution for coredumps writing to pipes is going to
be. So I will dig in and read the code that has the nested wait queues
and see if I can understand that logic, and keep thinking about the
coredump support.
Eric
Powered by blists - more mailing lists