lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z8dIKGCSRWqUqAEI@example.org>
Date: Tue, 4 Mar 2025 19:36:24 +0100
From: Alexey Gladkov <legion@...nel.org>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Alexander Viro <viro@...iv.linux.org.uk>,
	Christian Brauner <brauner@...nel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Oleg Nesterov <oleg@...hat.com>,
	Swapnil Sapkal <swapnil.sapkal@....com>,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	Jan Kara <jack@...e.cz>, Mateusz Guzik <mjguzik@...il.com>,
	Manfred Spraul <manfred@...orfullife.com>,
	David Howells <dhowells@...hat.com>,
	WangYuli <wangyuli@...ontech.com>, Hillf Danton <hdanton@...a.com>,
	"Gautham R. Shenoy" <gautham.shenoy@....com>,
	Neeraj.Upadhyay@....com, Ananth.narayan@....com
Subject: Re: [PATCH] fs/pipe: Read pipe->{head,tail} atomically outside
 pipe->mutex

On Tue, Mar 04, 2025 at 01:51:38PM +0000, K Prateek Nayak wrote:
> From: Linus Torvalds <torvalds@...ux-foundation.org>
> 
> pipe_readable(), pipe_writable(), and pipe_poll() can read "pipe->head"
> and "pipe->tail" outside of "pipe->mutex" critical section. When the
> head and the tail are read individually in that order, there is a window
> for interruption between the two reads in which both the head and the
> tail can be updated by concurrent readers and writers.
> 
> One of the problematic scenarios observed with hackbench running
> multiple groups on a large server on a particular pipe inode is as
> follows:
> 
>     pipe->head = 36
>     pipe->tail = 36
> 
>     hackbench-118762  [057] .....  1029.550548: pipe_write: *wakes up: pipe not full*
>     hackbench-118762  [057] .....  1029.550548: pipe_write: head: 36 -> 37 [tail: 36]
>     hackbench-118762  [057] .....  1029.550548: pipe_write: *wake up next reader 118740*
>     hackbench-118762  [057] .....  1029.550548: pipe_write: *wake up next writer 118768*
> 
>     hackbench-118768  [206] .....  1029.55055X: pipe_write: *writer wakes up*
>     hackbench-118768  [206] .....  1029.55055X: pipe_write: head = READ_ONCE(pipe->head) [37]
>     ... CPU 206 interrupted (exact wakeup was not traced but 118768 did read head at 37 in traces)
> 
>     hackbench-118740  [057] .....  1029.550558: pipe_read:  *reader wakes up: pipe is not empty*
>     hackbench-118740  [057] .....  1029.550558: pipe_read:  tail: 36 -> 37 [head = 37]
>     hackbench-118740  [057] .....  1029.550559: pipe_read:  *pipe is empty; wakeup writer 118768*
>     hackbench-118740  [057] .....  1029.550559: pipe_read:  *sleeps*
> 
>     hackbench-118766  [185] .....  1029.550592: pipe_write: *New writer comes in*
>     hackbench-118766  [185] .....  1029.550592: pipe_write: head: 37 -> 38 [tail: 37]
>     hackbench-118766  [185] .....  1029.550592: pipe_write: *wakes up reader 118766*
> 
>     hackbench-118740  [185] .....  1029.550598: pipe_read:  *reader wakes up; pipe not empty*
>     hackbench-118740  [185] .....  1029.550599: pipe_read:  tail: 37 -> 38 [head: 38]
>     hackbench-118740  [185] .....  1029.550599: pipe_read:  *pipe is empty*
>     hackbench-118740  [185] .....  1029.550599: pipe_read:  *reader sleeps; wakeup writer 118768*
> 
>     ... CPU 206 switches back to writer
>     hackbench-118768  [206] .....  1029.550601: pipe_write: tail = READ_ONCE(pipe->tail) [38]
>     hackbench-118768  [206] .....  1029.550601: pipe_write: pipe_full()? (u32)(37 - 38) >= 16? Yes
>     hackbench-118768  [206] .....  1029.550601: pipe_write: *writer goes back to sleep*
> 
>     [ Tasks 118740 and 118768 can then indefinitely wait on each other. ]
> 
> The unsigned arithmetic in pipe_occupancy() wraps around when
> "pipe->tail > pipe->head" leading to pipe_full() returning true despite
> the pipe being empty.
> 
> The case of genuine wraparound of "pipe->head" is handled since pipe
> buffer has data allowing readers to make progress until the pipe->tail
> wraps too after which the reader will wakeup a sleeping writer, however,
> mistaking the pipe to be full when it is in fact empty can lead to
> readers and writers waiting on each other indefinitely.
> 
> This issue became more problematic and surfaced as a hang in hackbench
> after the optimization in commit aaec5a95d596 ("pipe_read: don't wake up
> the writer if the pipe is still full") significantly reduced the number
> of spurious wakeups of writers that had previously helped mask the
> issue.
> 
> To avoid missing any updates between the reads of "pipe->head" and
> "pipe->write", unionize the two with a single unsigned long
> "pipe->head_tail" member that can be loaded atomically.
> 
> Using "pipe->head_tail" to read the head and the tail ensures the
> lockless checks do not miss any updates to the head or the tail and
> since those two are only updated under "pipe->mutex", it ensures that
> the head is always ahead of, or equal to the tail resulting in correct
> calculations.
> 
>   [ prateek: commit log, testing on x86 platforms. ]
> 
> Reported-and-debugged-by: Swapnil Sapkal <swapnil.sapkal@....com>
> Closes: https://lore.kernel.org/lkml/e813814e-7094-4673-bc69-731af065a0eb@amd.com/
> Reported-by: Alexey Gladkov <legion@...nel.org>
> Closes: https://lore.kernel.org/all/Z8Wn0nTvevLRG_4m@example.org/
> Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
> Signed-off-by: Linus Torvalds <torvalds@...ux-foundation.org>
> Tested-by: Swapnil Sapkal <swapnil.sapkal@....com>
> Reviewed-by: Oleg Nesterov <oleg@...hat.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>

With this patch, I'm also not reproducing the problem. Thanks!

> ---
> Changes are based on:
> 
>   git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs-6.15.pipe
> 
> at commit commit ee5eda8ea595 ("pipe: change pipe_write() to never add a
> zero-sized buffer") but also applies cleanly on top of v6.14-rc5.
> 
> The diff from Linus is kept as is except for removing the whitespaces in
> front of the typedef that checkpatch complained about (the warning on
> usage of typedef itself has been ignored since I could not think of a
> better alternative other than #ifdef hackery in pipe_inode_info and the
> newly introduced pipe_index union.) and the suggestion from Oleg to
> explicitly initialize the "head_tail" with:
> 
>     union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) }
> 
> I went with commit 8cefc107ca54 ("pipe: Use head and tail pointers for
> the ring, not cursor and length") for the "Fixes:" tag since pipe_poll()
> added:
> 
>     unsigned int head = READ_ONCE(pipe->head);
>     unsigned int tail = READ_ONCE(pipe->tail);
> 
>     poll_wait(filp, &pipe->wait, wait);
> 
>     BUG_ON(pipe_occupancy(head, tail) > pipe->ring_size);
> 
> and the race described can trigger that BUG_ON() but as Linus pointed
> out in [1] the commit 85190d15f4ea ("pipe: don't use 'pipe_wait() for
> basic pipe IO") is probably the one that can cause the writers to
> sleep on empty pipe since the pipe_wait() used prior did not drop the
> pipe lock until it called schedule() and prepare_to_wait() was called
> before pipe_unlock() ensuring no races.
> 
> [1] https://lore.kernel.org/all/CAHk-=wh804HX8H86VRUSKoJGVG0eBe8sPz8=E3d8LHftOCSqwQ@mail.gmail.com/
> 
> Please let me know if the patch requires any modifications and I'll jump
> right on it. The changes have been tested on both a 5th Generation AMD
> EPYC system and on a dual socket Intel Emerald Rapids system with
> multiple thousand iterations of hackbench for small and large loop
> counts. Thanks a ton to Swapnil for all the help.
> ---
>  fs/pipe.c                 | 19 ++++++++-----------
>  include/linux/pipe_fs_i.h | 39 +++++++++++++++++++++++++++++++++++++--
>  2 files changed, 45 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index b0641f75b1ba..780990f307ab 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -210,11 +210,10 @@ static const struct pipe_buf_operations anon_pipe_buf_ops = {
>  /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>  static inline bool pipe_readable(const struct pipe_inode_info *pipe)
>  {
> -	unsigned int head = READ_ONCE(pipe->head);
> -	unsigned int tail = READ_ONCE(pipe->tail);
> +	union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) };
>  	unsigned int writers = READ_ONCE(pipe->writers);
>  
> -	return !pipe_empty(head, tail) || !writers;
> +	return !pipe_empty(idx.head, idx.tail) || !writers;
>  }
>  
>  static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe,
> @@ -403,11 +402,10 @@ static inline int is_packetized(struct file *file)
>  /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>  static inline bool pipe_writable(const struct pipe_inode_info *pipe)
>  {
> -	unsigned int head = READ_ONCE(pipe->head);
> -	unsigned int tail = READ_ONCE(pipe->tail);
> +	union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) };
>  	unsigned int max_usage = READ_ONCE(pipe->max_usage);
>  
> -	return !pipe_full(head, tail, max_usage) ||
> +	return !pipe_full(idx.head, idx.tail, max_usage) ||
>  		!READ_ONCE(pipe->readers);
>  }
>  
> @@ -649,7 +647,7 @@ pipe_poll(struct file *filp, poll_table *wait)
>  {
>  	__poll_t mask;
>  	struct pipe_inode_info *pipe = filp->private_data;
> -	unsigned int head, tail;
> +	union pipe_index idx;
>  
>  	/* Epoll has some historical nasty semantics, this enables them */
>  	WRITE_ONCE(pipe->poll_usage, true);
> @@ -670,19 +668,18 @@ pipe_poll(struct file *filp, poll_table *wait)
>  	 * if something changes and you got it wrong, the poll
>  	 * table entry will wake you up and fix it.
>  	 */
> -	head = READ_ONCE(pipe->head);
> -	tail = READ_ONCE(pipe->tail);
> +	idx.head_tail = READ_ONCE(pipe->head_tail);
>  
>  	mask = 0;
>  	if (filp->f_mode & FMODE_READ) {
> -		if (!pipe_empty(head, tail))
> +		if (!pipe_empty(idx.head, idx.tail))
>  			mask |= EPOLLIN | EPOLLRDNORM;
>  		if (!pipe->writers && filp->f_pipe != pipe->w_counter)
>  			mask |= EPOLLHUP;
>  	}
>  
>  	if (filp->f_mode & FMODE_WRITE) {
> -		if (!pipe_full(head, tail, pipe->max_usage))
> +		if (!pipe_full(idx.head, idx.tail, pipe->max_usage))
>  			mask |= EPOLLOUT | EPOLLWRNORM;
>  		/*
>  		 * Most Unices do not set EPOLLERR for FIFOs but on Linux they
> diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
> index 8ff23bf5a819..3cc4f8eab853 100644
> --- a/include/linux/pipe_fs_i.h
> +++ b/include/linux/pipe_fs_i.h
> @@ -31,6 +31,33 @@ struct pipe_buffer {
>  	unsigned long private;
>  };
>  
> +/*
> + * Really only alpha needs 32-bit fields, but
> + * might as well do it for 64-bit architectures
> + * since that's what we've historically done,
> + * and it makes 'head_tail' always be a simple
> + * 'unsigned long'.
> + */
> +#ifdef CONFIG_64BIT
> +typedef unsigned int pipe_index_t;
> +#else
> +typedef unsigned short pipe_index_t;
> +#endif
> +
> +/*
> + * We have to declare this outside 'struct pipe_inode_info',
> + * but then we can't use 'union pipe_index' for an anonymous
> + * union, so we end up having to duplicate this declaration
> + * below. Annoying.
> + */
> +union pipe_index {
> +	unsigned long head_tail;
> +	struct {
> +		pipe_index_t head;
> +		pipe_index_t tail;
> +	};
> +};
> +
>  /**
>   *	struct pipe_inode_info - a linux kernel pipe
>   *	@mutex: mutex protecting the whole thing
> @@ -58,8 +85,16 @@ struct pipe_buffer {
>  struct pipe_inode_info {
>  	struct mutex mutex;
>  	wait_queue_head_t rd_wait, wr_wait;
> -	unsigned int head;
> -	unsigned int tail;
> +
> +	/* This has to match the 'union pipe_index' above */
> +	union {
> +		unsigned long head_tail;
> +		struct {
> +			pipe_index_t head;
> +			pipe_index_t tail;
> +		};
> +	};
> +
>  	unsigned int max_usage;
>  	unsigned int ring_size;
>  	unsigned int nr_accounted;
> 
> base-commit: ee5eda8ea59546af2e8f192c060fbf29862d7cbd
> -- 
> 2.34.1
> 

-- 
Rgrds, legion


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ