[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <deb24649-e19a-4117-a3ec-dd6f5ee52f48@amd.com>
Date: Sun, 9 Mar 2025 19:31:15 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Hillf Danton <hdanton@...a.com>, Oleg Nesterov <oleg@...hat.com>
CC: Mateusz Guzik <mjguzik@...il.com>, "Sapkal, Swapnil"
<swapnil.sapkal@....com>, Linus Torvalds <torvalds@...ux-foundation.org>,
<linux-fsdevel@...r.kernel.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still
full
Hello Hillf,
On 3/8/2025 5:26 AM, Hillf Danton wrote:
> On Fri, 7 Mar 2025 13:34:43 +0100 Oleg Nesterov <oleg@...hat.com>
>> On 03/07, Oleg Nesterov wrote:
>>> On 03/07, Hillf Danton wrote:
>>>> On Fri, 7 Mar 2025 11:54:56 +0530 K Prateek Nayak <kprateek.nayak@....com>
>>>>>> step-03
>>>>>> task-118766 new reader
>>>>>> makes pipe empty
>>>>>
>>>>> Reader seeing a pipe full should wake up a writer allowing 118768 to
>>>>> wakeup again and fill the pipe. Am I missing something?
>>>>>
>>>> Good catch, but that wakeup was cut off [2,3]
>>
>> Please note that "that wakeup" was _not_ removed by the patch below.
>>
> After another look, you did cut it.
>
> Link: https://lore.kernel.org/all/20250209150718.GA17013@redhat.com/
> Signed-off-by: Oleg Nesterov <oleg@...hat.com>
> Tested-by: K Prateek Nayak <kprateek.nayak@....com>
So that is not problematic because pipe_write() no longer increments the
head before a successful write. What also changed in that patch above is
the order in which we do copy_page_from_iter() and head increment - now,
the head is incremented only if copy_page_from_iter() actually manages to
write data into the buffer which eliminates the need to do a wakeup if
nothing was found in the buffer ...
> ---
> fs/pipe.c | 45 +++++++++------------------------------------
> 1 file changed, 9 insertions(+), 36 deletions(-)
>
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 2ae75adfba64..b0641f75b1ba 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -360,29 +360,9 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to)
Above this, we have:
if (!total_len)
break; /* common path: read succeeded */
if (!pipe_empty(head, tail)) /* More to do? */
continue;
And if a read is successful, one of them must be hit and with that
reordering in pipe_write(), and the readers must be able to wake a
writer if !pipe_empty() ...
> break;
> }
> mutex_unlock(&pipe->mutex);
> -
> /*
> * We only get here if we didn't actually read anything.
> *
> - * However, we could have seen (and removed) a zero-sized
> - * pipe buffer, and might have made space in the buffers
> - * that way.
> - *
> - * You can't make zero-sized pipe buffers by doing an empty
> - * write (not even in packet mode), but they can happen if
> - * the writer gets an EFAULT when trying to fill a buffer
> - * that already got allocated and inserted in the buffer
> - * array.
> - *
> - * So we still need to wake up any pending writers in the
> - * _very_ unlikely case that the pipe was full, but we got
> - * no data.
> - */
> - if (unlikely(wake_writer))
> - wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
> - kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> -
> - /*
> * But because we didn't read anything, at this point we can
> * just return directly with -ERESTARTSYS if we're interrupted,
> * since we've done any required wakeups and there's no need
> @@ -391,7 +371,6 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to)
> if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
> return -ERESTARTSYS;
>
> - wake_writer = false;
> wake_next_reader = true;
> mutex_lock(&pipe->mutex);
> }
>
>> "That wakeup" is another wakeup pipe_read() does before return:
>>
>> if (wake_writer)
>> wake_up_interruptible_sync_poll(&pipe->wr_wait, ...);
>>
>> And wake_writer must be true if this reader changed the pipe_full()
>> condition from T to F.
>>
> Could you read Prateek's comment again, then try to work out why he
> did so?
>
>> Note also that pipe_read() won't sleep if it has read even one byte.
and that is the key why Oleg's optimization that he highlighted above
which is why it cannot cause a hang (with my imagination). Now, let
us take a closer look at the whole sleep and wakeup mechanism. I'll
expand the one for pipe_write() since that was the problematic bit we
analyzed but you can do the same for pipe_read() side too. The
wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
in pipe_write() will boil down to these bits (I'll only highlight the
important ones):
ret = 0;
might_sleep()
if (!pipe_writable()) { /* First line of defense */
init_wait_entry(&__wq_entry, WQ_FLAG_EXCLUSIVE);
for (;;) {
/* Below is expansion of prepare_to_wait_event() success case*/
spin_lock_irqsave(&wq_head->lock, flags);
__add_wait_queue_entry_tail(wq_head, __wq_entry);
set_current_state(TASK_INTERRUPTIBLE);
spin_unlock_irqrestore(&wq_head->lock, flags);
/* Second line of defense */
if (pipe_writable())
break;
schedule();
}
finish_wait(&wq_head, &__wq_entry);
]
return 0;
The sequence for the writer wait is:
o Add on the wait queue.
o Set task state to TASK_INTERRUPTIBLE.
o Check if pipe_writable() one last time.
o Call schedule()
Now why can't we miss a wakeup? The reader increments the tail, and then
does a wakeup of waiting writers in all cases it finds pipe_full().
Wakeup accesses a the wait queues which does:
spin_lock_irqsave(&wq_head->lock, flags);
__wake_up_common();
spin_unlock_irqrestore(&wq_head->lock, flags);
so adding to wait queue and wakeup from wait queue are always done under
the wait queue lock.
Now there are two non trivial cases:
o This is the simple case:
writer reader
====== ======
if (!pipe_writable()) /* True */ {
for (;;) {
... wake_writers = pipe_full(); /* True */
tail = tail + 1
wake_up_interruptible_sync_poll(&pipe->wr_wait) {
/*wr_wait is empty */
}
...
/* Adds itself on the wait queue */
writer->__state = TASK_INTERRUPTIBLE;
if (pipe_writable()) /* True */ {
break;
/*
* Goes and does finish_wait()
*/
}
}
finish_wait() {
p->__state = TASK_RUNNING;
}
/*
* Goes and does a check under
* pipe->mutex to be sure.
*/
}
O This is slightly complicated case:
writer reader
====== ======
if (!pipe_writable()) /* True */ {
for (;;) {
/* Adds itself on the wait queue */
writer->__state = TASK_INTERRUPTIBLE;
if (pipe_writable()) /* False */ {
/* The break is not executed */
}
... wake_writers = pipe_full(); /* True */
tail = tail + 1
wake_up_interruptible_sync_poll(&pipe->wr_wait) {
default_wake_function() {
ttwu_runnable() {
/* Calls ttwu_do_wakeup() which does */
p->__state = TASK_RUNNING
}
}
}
...
schedule() {
if (!p->__state) /* False */ {
/*
* Never called since p->__state
* is RUNNING
*/
try_to_block_task();
}
}
... repeat the loop
if (pipe_writable()) /* True */ {
break;
/*
* Goes and does finish_wait()
*/
}
}
finish_wait() {
p->__state = TASK_RUNNING;
}
/*
* Goes and does a check under
* pipe->mutex to be sure.
*/
}
All in all, the writer will wither see a pipe_writable() and never fully
block, or it will be woken up it will add itself to the wit queue and
will be woken by a reader after moving pipe->tail if pipe_full()
returned true before. If I've missed something, please let me know and
I'll try to go back and convince myself on how that situation can happen
and get back to you but so far, my head cannot think of a situation
where the pipe can hang after the recent set of fixes and optimizations.
>>
>>>> [2] https://lore.kernel.org/lkml/20250304123457.GA25281@redhat.com/
>>>> [3] https://lore.kernel.org/all/20250210114039.GA3588@redhat.com/
>>>
>>> Why do you think
>>>
>>> [PATCH v2 1/1] pipe: change pipe_write() to never add a zero-sized buffer
>>> https://lore.kernel.org/all/20250210114039.GA3588@redhat.com/
>>>
>>> can make any difference ???
>>>
>>> Where do you think a zero-sized buffer with ->len == 0 can come from?
I audited the post_one_notification() code and the paths that lead up
to it:
o remove_watch_from_object() will have the data of size watch_sizeof(n)
so that beffer size is definitely !0 (unless (watch_sizeof(n) & 0x7f is
zero but what are the odds of that?)
o Other is __post_watch_notification() which has a early return for
((n->info & WATCH_INFO_LENGTH) >> WATCH_INFO_LENGTH__SHIFT) == 0
and also a WARN_ON() so it'll not add an empty buffer and will instead
scream at the users if someone tries to do it.
splice() cases too cannot do this based on my understanding so I don't
think we can run into the issue but I'm limited by my own imagination :)
>>
>> Oleg.
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists