lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <03a1f4af-47e0-459d-b2bf-9f65536fc2ab@amd.com>
Date: Mon, 3 Mar 2025 15:16:34 +0530
From: "Sapkal, Swapnil" <swapnil.sapkal@....com>
To: Oleg Nesterov <oleg@...hat.com>, K Prateek Nayak <kprateek.nayak@....com>
CC: Mateusz Guzik <mjguzik@...il.com>, Manfred Spraul
	<manfred@...orfullife.com>, Linus Torvalds <torvalds@...ux-foundation.org>,
	Christian Brauner <brauner@...nel.org>, David Howells <dhowells@...hat.com>,
	WangYuli <wangyuli@...ontech.com>, <linux-fsdevel@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, "Shenoy, Gautham Ranjal"
	<gautham.shenoy@....com>, <Neeraj.Upadhyay@....com>, <Ananth.narayan@....com>
Subject: Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still
 full

Hi Oleg,

On 2/28/2025 10:03 PM, Oleg Nesterov wrote:
> And... I know, I know you already hate me ;)
> 

Not at all :)

> but if you have time, could you check if this patch (with or without the
> previous debugging patch) makes any difference? Just to be sure.
> 

Sure, I will give this a try.

But in the meanwhile me and Prateek tried some of the experiments in the weekend.
We were able to reproduce this issue on a third generation EPYC system as well as
on an Intel Emerald Rapids (2 X INTEL(R) XEON(R) PLATINUM 8592+).

We tried heavy hammered tracing approach over the weekend on top of your debug patch.
I have attached the debug patch below. With tracing we found the following case for
pipe_writable():

   hackbench-118768  [206] .....  1029.550601: pipe_write: 000000005eea28ff: 0: 37 38 16: 1

Here,

head = 37
tail = 38
max_usage = 16
pipe_full() returns 1.

Between reading of head and later the tail, the tail seems to have moved ahead of the
head leading to wraparound. Applying the following changes I have not yet run into a
hang on the original machine where I first saw it:

diff --git a/fs/pipe.c b/fs/pipe.c
index ce1af7592780..a1931c817822 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -417,9 +417,19 @@ static inline int is_packetized(struct file *file)
  /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
  static inline bool pipe_writable(const struct pipe_inode_info *pipe)
  {
-	unsigned int head = READ_ONCE(pipe->head);
-	unsigned int tail = READ_ONCE(pipe->tail);
  	unsigned int max_usage = READ_ONCE(pipe->max_usage);
+	unsigned int head, tail;
+
+	tail = READ_ONCE(pipe->tail);
+	/*
+	 * Since the unsigned arithmetic in this lockless preemptible context
+	 * relies on the fact that the tail can never be ahead of head, read
+	 * the head after the tail to ensure we've not missed any updates to
+	 * the head. Reordering the reads can cause wraparounds and give the
+	 * illusion that the pipe is full.
+	 */
+	smp_rmb();
+	head = READ_ONCE(pipe->head);
  
  	return !pipe_full(head, tail, max_usage) ||
  		!READ_ONCE(pipe->readers);
---

smp_rmb() on x86 is a nop and even without the barrier we were not able to
reproduce the hang even after 10000 iterations.

If you think this is a genuine bug fix, I will send a patch for this.

Thanks to Prateek who was actively involved in this debug.

--
Thanks and Regards,
Swapnil

> Oleg.
> ---
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 4336b8cccf84..524b8845523e 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -445,7 +445,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>   		return 0;
>   
>   	mutex_lock(&pipe->mutex);
> -
> +again:
>   	if (!pipe->readers) {
>   		send_sig(SIGPIPE, current, 0);
>   		ret = -EPIPE;
> @@ -467,20 +467,24 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>   		unsigned int mask = pipe->ring_size - 1;
>   		struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
>   		int offset = buf->offset + buf->len;
> +		int xxx;
>   
>   		if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) &&
>   		    offset + chars <= PAGE_SIZE) {
> -			ret = pipe_buf_confirm(pipe, buf);
> -			if (ret)
> +			xxx = pipe_buf_confirm(pipe, buf);
> +			if (xxx) {
> +				if (!ret) ret = xxx;
>   				goto out;
> +			}
>   
> -			ret = copy_page_from_iter(buf->page, offset, chars, from);
> -			if (unlikely(ret < chars)) {
> -				ret = -EFAULT;
> +			xxx = copy_page_from_iter(buf->page, offset, chars, from);
> +			if (unlikely(xxx < chars)) {
> +				if (!ret) ret = -EFAULT;
>   				goto out;
>   			}
>   
> -			buf->len += ret;
> +			ret += xxx;
> +			buf->len += xxx;
>   			if (!iov_iter_count(from))
>   				goto out;
>   		}
> @@ -567,6 +571,7 @@ atomic_inc(&WR_SLEEP);
>   		mutex_lock(&pipe->mutex);
>   		was_empty = pipe_empty(pipe->head, pipe->tail);
>   		wake_next_writer = true;
> +		goto again;
>   	}
>   out:
>   	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
> 

View attachment "debug.diff" of type "text/plain" (8448 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ