[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <03a1f4af-47e0-459d-b2bf-9f65536fc2ab@amd.com>
Date: Mon, 3 Mar 2025 15:16:34 +0530
From: "Sapkal, Swapnil" <swapnil.sapkal@....com>
To: Oleg Nesterov <oleg@...hat.com>, K Prateek Nayak <kprateek.nayak@....com>
CC: Mateusz Guzik <mjguzik@...il.com>, Manfred Spraul
<manfred@...orfullife.com>, Linus Torvalds <torvalds@...ux-foundation.org>,
Christian Brauner <brauner@...nel.org>, David Howells <dhowells@...hat.com>,
WangYuli <wangyuli@...ontech.com>, <linux-fsdevel@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, "Shenoy, Gautham Ranjal"
<gautham.shenoy@....com>, <Neeraj.Upadhyay@....com>, <Ananth.narayan@....com>
Subject: Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still
full
Hi Oleg,
On 2/28/2025 10:03 PM, Oleg Nesterov wrote:
> And... I know, I know you already hate me ;)
>
Not at all :)
> but if you have time, could you check if this patch (with or without the
> previous debugging patch) makes any difference? Just to be sure.
>
Sure, I will give this a try.
But in the meanwhile me and Prateek tried some of the experiments in the weekend.
We were able to reproduce this issue on a third generation EPYC system as well as
on an Intel Emerald Rapids (2 X INTEL(R) XEON(R) PLATINUM 8592+).
We tried heavy hammered tracing approach over the weekend on top of your debug patch.
I have attached the debug patch below. With tracing we found the following case for
pipe_writable():
hackbench-118768 [206] ..... 1029.550601: pipe_write: 000000005eea28ff: 0: 37 38 16: 1
Here,
head = 37
tail = 38
max_usage = 16
pipe_full() returns 1.
Between reading of head and later the tail, the tail seems to have moved ahead of the
head leading to wraparound. Applying the following changes I have not yet run into a
hang on the original machine where I first saw it:
diff --git a/fs/pipe.c b/fs/pipe.c
index ce1af7592780..a1931c817822 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -417,9 +417,19 @@ static inline int is_packetized(struct file *file)
/* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
static inline bool pipe_writable(const struct pipe_inode_info *pipe)
{
- unsigned int head = READ_ONCE(pipe->head);
- unsigned int tail = READ_ONCE(pipe->tail);
unsigned int max_usage = READ_ONCE(pipe->max_usage);
+ unsigned int head, tail;
+
+ tail = READ_ONCE(pipe->tail);
+ /*
+ * Since the unsigned arithmetic in this lockless preemptible context
+ * relies on the fact that the tail can never be ahead of head, read
+ * the head after the tail to ensure we've not missed any updates to
+ * the head. Reordering the reads can cause wraparounds and give the
+ * illusion that the pipe is full.
+ */
+ smp_rmb();
+ head = READ_ONCE(pipe->head);
return !pipe_full(head, tail, max_usage) ||
!READ_ONCE(pipe->readers);
---
smp_rmb() on x86 is a nop and even without the barrier we were not able to
reproduce the hang even after 10000 iterations.
If you think this is a genuine bug fix, I will send a patch for this.
Thanks to Prateek who was actively involved in this debug.
--
Thanks and Regards,
Swapnil
> Oleg.
> ---
>
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 4336b8cccf84..524b8845523e 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -445,7 +445,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
> return 0;
>
> mutex_lock(&pipe->mutex);
> -
> +again:
> if (!pipe->readers) {
> send_sig(SIGPIPE, current, 0);
> ret = -EPIPE;
> @@ -467,20 +467,24 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
> unsigned int mask = pipe->ring_size - 1;
> struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
> int offset = buf->offset + buf->len;
> + int xxx;
>
> if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) &&
> offset + chars <= PAGE_SIZE) {
> - ret = pipe_buf_confirm(pipe, buf);
> - if (ret)
> + xxx = pipe_buf_confirm(pipe, buf);
> + if (xxx) {
> + if (!ret) ret = xxx;
> goto out;
> + }
>
> - ret = copy_page_from_iter(buf->page, offset, chars, from);
> - if (unlikely(ret < chars)) {
> - ret = -EFAULT;
> + xxx = copy_page_from_iter(buf->page, offset, chars, from);
> + if (unlikely(xxx < chars)) {
> + if (!ret) ret = -EFAULT;
> goto out;
> }
>
> - buf->len += ret;
> + ret += xxx;
> + buf->len += xxx;
> if (!iov_iter_count(from))
> goto out;
> }
> @@ -567,6 +571,7 @@ atomic_inc(&WR_SLEEP);
> mutex_lock(&pipe->mutex);
> was_empty = pipe_empty(pipe->head, pipe->tail);
> wake_next_writer = true;
> + goto again;
> }
> out:
> if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
>
View attachment "debug.diff" of type "text/plain" (8448 bytes)
Powered by blists - more mailing lists