[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250304135138.1278-1-kprateek.nayak@amd.com>
Date: Tue, 4 Mar 2025 13:51:38 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Alexander Viro <viro@...iv.linux.org.uk>, Christian Brauner
<brauner@...nel.org>, Linus Torvalds <torvalds@...ux-foundation.org>, "Oleg
Nesterov" <oleg@...hat.com>, Swapnil Sapkal <swapnil.sapkal@....com>, "Alexey
Gladkov" <legion@...nel.org>, <linux-fsdevel@...r.kernel.org>,
<linux-kernel@...r.kernel.org>
CC: Jan Kara <jack@...e.cz>, Mateusz Guzik <mjguzik@...il.com>, Manfred Spraul
<manfred@...orfullife.com>, David Howells <dhowells@...hat.com>, WangYuli
<wangyuli@...ontech.com>, Hillf Danton <hdanton@...a.com>, "Gautham R.
Shenoy" <gautham.shenoy@....com>, <Neeraj.Upadhyay@....com>,
<Ananth.narayan@....com>, K Prateek Nayak <kprateek.nayak@....com>
Subject: [PATCH] fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex
From: Linus Torvalds <torvalds@...ux-foundation.org>
pipe_readable(), pipe_writable(), and pipe_poll() can read "pipe->head"
and "pipe->tail" outside of "pipe->mutex" critical section. When the
head and the tail are read individually in that order, there is a window
for interruption between the two reads in which both the head and the
tail can be updated by concurrent readers and writers.
One of the problematic scenarios observed with hackbench running
multiple groups on a large server on a particular pipe inode is as
follows:
pipe->head = 36
pipe->tail = 36
hackbench-118762 [057] ..... 1029.550548: pipe_write: *wakes up: pipe not full*
hackbench-118762 [057] ..... 1029.550548: pipe_write: head: 36 -> 37 [tail: 36]
hackbench-118762 [057] ..... 1029.550548: pipe_write: *wake up next reader 118740*
hackbench-118762 [057] ..... 1029.550548: pipe_write: *wake up next writer 118768*
hackbench-118768 [206] ..... 1029.55055X: pipe_write: *writer wakes up*
hackbench-118768 [206] ..... 1029.55055X: pipe_write: head = READ_ONCE(pipe->head) [37]
... CPU 206 interrupted (exact wakeup was not traced but 118768 did read head at 37 in traces)
hackbench-118740 [057] ..... 1029.550558: pipe_read: *reader wakes up: pipe is not empty*
hackbench-118740 [057] ..... 1029.550558: pipe_read: tail: 36 -> 37 [head = 37]
hackbench-118740 [057] ..... 1029.550559: pipe_read: *pipe is empty; wakeup writer 118768*
hackbench-118740 [057] ..... 1029.550559: pipe_read: *sleeps*
hackbench-118766 [185] ..... 1029.550592: pipe_write: *New writer comes in*
hackbench-118766 [185] ..... 1029.550592: pipe_write: head: 37 -> 38 [tail: 37]
hackbench-118766 [185] ..... 1029.550592: pipe_write: *wakes up reader 118766*
hackbench-118740 [185] ..... 1029.550598: pipe_read: *reader wakes up; pipe not empty*
hackbench-118740 [185] ..... 1029.550599: pipe_read: tail: 37 -> 38 [head: 38]
hackbench-118740 [185] ..... 1029.550599: pipe_read: *pipe is empty*
hackbench-118740 [185] ..... 1029.550599: pipe_read: *reader sleeps; wakeup writer 118768*
... CPU 206 switches back to writer
hackbench-118768 [206] ..... 1029.550601: pipe_write: tail = READ_ONCE(pipe->tail) [38]
hackbench-118768 [206] ..... 1029.550601: pipe_write: pipe_full()? (u32)(37 - 38) >= 16? Yes
hackbench-118768 [206] ..... 1029.550601: pipe_write: *writer goes back to sleep*
[ Tasks 118740 and 118768 can then indefinitely wait on each other. ]
The unsigned arithmetic in pipe_occupancy() wraps around when
"pipe->tail > pipe->head" leading to pipe_full() returning true despite
the pipe being empty.
The case of genuine wraparound of "pipe->head" is handled since pipe
buffer has data allowing readers to make progress until the pipe->tail
wraps too after which the reader will wakeup a sleeping writer, however,
mistaking the pipe to be full when it is in fact empty can lead to
readers and writers waiting on each other indefinitely.
This issue became more problematic and surfaced as a hang in hackbench
after the optimization in commit aaec5a95d596 ("pipe_read: don't wake up
the writer if the pipe is still full") significantly reduced the number
of spurious wakeups of writers that had previously helped mask the
issue.
To avoid missing any updates between the reads of "pipe->head" and
"pipe->write", unionize the two with a single unsigned long
"pipe->head_tail" member that can be loaded atomically.
Using "pipe->head_tail" to read the head and the tail ensures the
lockless checks do not miss any updates to the head or the tail and
since those two are only updated under "pipe->mutex", it ensures that
the head is always ahead of, or equal to the tail resulting in correct
calculations.
[ prateek: commit log, testing on x86 platforms. ]
Reported-and-debugged-by: Swapnil Sapkal <swapnil.sapkal@....com>
Closes: https://lore.kernel.org/lkml/e813814e-7094-4673-bc69-731af065a0eb@amd.com/
Reported-by: Alexey Gladkov <legion@...nel.org>
Closes: https://lore.kernel.org/all/Z8Wn0nTvevLRG_4m@example.org/
Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
Signed-off-by: Linus Torvalds <torvalds@...ux-foundation.org>
Tested-by: Swapnil Sapkal <swapnil.sapkal@....com>
Reviewed-by: Oleg Nesterov <oleg@...hat.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
---
Changes are based on:
git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs-6.15.pipe
at commit commit ee5eda8ea595 ("pipe: change pipe_write() to never add a
zero-sized buffer") but also applies cleanly on top of v6.14-rc5.
The diff from Linus is kept as is except for removing the whitespaces in
front of the typedef that checkpatch complained about (the warning on
usage of typedef itself has been ignored since I could not think of a
better alternative other than #ifdef hackery in pipe_inode_info and the
newly introduced pipe_index union.) and the suggestion from Oleg to
explicitly initialize the "head_tail" with:
union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) }
I went with commit 8cefc107ca54 ("pipe: Use head and tail pointers for
the ring, not cursor and length") for the "Fixes:" tag since pipe_poll()
added:
unsigned int head = READ_ONCE(pipe->head);
unsigned int tail = READ_ONCE(pipe->tail);
poll_wait(filp, &pipe->wait, wait);
BUG_ON(pipe_occupancy(head, tail) > pipe->ring_size);
and the race described can trigger that BUG_ON() but as Linus pointed
out in [1] the commit 85190d15f4ea ("pipe: don't use 'pipe_wait() for
basic pipe IO") is probably the one that can cause the writers to
sleep on empty pipe since the pipe_wait() used prior did not drop the
pipe lock until it called schedule() and prepare_to_wait() was called
before pipe_unlock() ensuring no races.
[1] https://lore.kernel.org/all/CAHk-=wh804HX8H86VRUSKoJGVG0eBe8sPz8=E3d8LHftOCSqwQ@mail.gmail.com/
Please let me know if the patch requires any modifications and I'll jump
right on it. The changes have been tested on both a 5th Generation AMD
EPYC system and on a dual socket Intel Emerald Rapids system with
multiple thousand iterations of hackbench for small and large loop
counts. Thanks a ton to Swapnil for all the help.
---
fs/pipe.c | 19 ++++++++-----------
include/linux/pipe_fs_i.h | 39 +++++++++++++++++++++++++++++++++++++--
2 files changed, 45 insertions(+), 13 deletions(-)
diff --git a/fs/pipe.c b/fs/pipe.c
index b0641f75b1ba..780990f307ab 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -210,11 +210,10 @@ static const struct pipe_buf_operations anon_pipe_buf_ops = {
/* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
static inline bool pipe_readable(const struct pipe_inode_info *pipe)
{
- unsigned int head = READ_ONCE(pipe->head);
- unsigned int tail = READ_ONCE(pipe->tail);
+ union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) };
unsigned int writers = READ_ONCE(pipe->writers);
- return !pipe_empty(head, tail) || !writers;
+ return !pipe_empty(idx.head, idx.tail) || !writers;
}
static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe,
@@ -403,11 +402,10 @@ static inline int is_packetized(struct file *file)
/* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
static inline bool pipe_writable(const struct pipe_inode_info *pipe)
{
- unsigned int head = READ_ONCE(pipe->head);
- unsigned int tail = READ_ONCE(pipe->tail);
+ union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) };
unsigned int max_usage = READ_ONCE(pipe->max_usage);
- return !pipe_full(head, tail, max_usage) ||
+ return !pipe_full(idx.head, idx.tail, max_usage) ||
!READ_ONCE(pipe->readers);
}
@@ -649,7 +647,7 @@ pipe_poll(struct file *filp, poll_table *wait)
{
__poll_t mask;
struct pipe_inode_info *pipe = filp->private_data;
- unsigned int head, tail;
+ union pipe_index idx;
/* Epoll has some historical nasty semantics, this enables them */
WRITE_ONCE(pipe->poll_usage, true);
@@ -670,19 +668,18 @@ pipe_poll(struct file *filp, poll_table *wait)
* if something changes and you got it wrong, the poll
* table entry will wake you up and fix it.
*/
- head = READ_ONCE(pipe->head);
- tail = READ_ONCE(pipe->tail);
+ idx.head_tail = READ_ONCE(pipe->head_tail);
mask = 0;
if (filp->f_mode & FMODE_READ) {
- if (!pipe_empty(head, tail))
+ if (!pipe_empty(idx.head, idx.tail))
mask |= EPOLLIN | EPOLLRDNORM;
if (!pipe->writers && filp->f_pipe != pipe->w_counter)
mask |= EPOLLHUP;
}
if (filp->f_mode & FMODE_WRITE) {
- if (!pipe_full(head, tail, pipe->max_usage))
+ if (!pipe_full(idx.head, idx.tail, pipe->max_usage))
mask |= EPOLLOUT | EPOLLWRNORM;
/*
* Most Unices do not set EPOLLERR for FIFOs but on Linux they
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 8ff23bf5a819..3cc4f8eab853 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -31,6 +31,33 @@ struct pipe_buffer {
unsigned long private;
};
+/*
+ * Really only alpha needs 32-bit fields, but
+ * might as well do it for 64-bit architectures
+ * since that's what we've historically done,
+ * and it makes 'head_tail' always be a simple
+ * 'unsigned long'.
+ */
+#ifdef CONFIG_64BIT
+typedef unsigned int pipe_index_t;
+#else
+typedef unsigned short pipe_index_t;
+#endif
+
+/*
+ * We have to declare this outside 'struct pipe_inode_info',
+ * but then we can't use 'union pipe_index' for an anonymous
+ * union, so we end up having to duplicate this declaration
+ * below. Annoying.
+ */
+union pipe_index {
+ unsigned long head_tail;
+ struct {
+ pipe_index_t head;
+ pipe_index_t tail;
+ };
+};
+
/**
* struct pipe_inode_info - a linux kernel pipe
* @mutex: mutex protecting the whole thing
@@ -58,8 +85,16 @@ struct pipe_buffer {
struct pipe_inode_info {
struct mutex mutex;
wait_queue_head_t rd_wait, wr_wait;
- unsigned int head;
- unsigned int tail;
+
+ /* This has to match the 'union pipe_index' above */
+ union {
+ unsigned long head_tail;
+ struct {
+ pipe_index_t head;
+ pipe_index_t tail;
+ };
+ };
+
unsigned int max_usage;
unsigned int ring_size;
unsigned int nr_accounted;
base-commit: ee5eda8ea59546af2e8f192c060fbf29862d7cbd
--
2.34.1
Powered by blists - more mailing lists