lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue,  9 May 2023 12:56:31 -0400
From:   Kent Overstreet <kent.overstreet@...ux.dev>
To:     linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-bcachefs@...r.kernel.org
Cc:     Kent Overstreet <kent.overstreet@...il.com>,
        Kent Overstreet <kent.overstreet@...ux.dev>,
        Jan Kara <jack@...e.cz>, "Darrick J . Wong" <djwong@...nel.org>
Subject: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

From: Kent Overstreet <kent.overstreet@...il.com>

This is used by bcachefs to fix a page cache coherency issue with
O_DIRECT writes.

Also relevant: mapping->invalidate_lock, see below.

O_DIRECT writes (and other filesystem operations that modify file data
while bypassing the page cache) need to shoot down ranges of the page
cache - and additionally, need locking to prevent those pages from
pulled back in.

But O_DIRECT writes invoke the page fault handler (via get_user_pages),
and the page fault handler will need to take that same lock - this is a
classic recursive deadlock if userspace has mmaped the file they're DIO
writing to and uses those pages for the buffer to write from, and it's a
lock ordering deadlock in general.

Thus we need a way to signal from the dio code to the page fault handler
when we already are holding the pagecache add lock on an address space -
this patch just adds a member to task_struct for this purpose. For now
only bcachefs is implementing this locking, though it may be moved out
of bcachefs and made available to other filesystems in the future.

---------------------------------

The closest current VFS equivalent is mapping->invalidate_lock, which
comes from XFS. However, it's not used by direct IO.  Instead, direct IO
paths shoot down the page cache twice - before starting the IO and at
the end, and they're still technically racy w.r.t. page cache coherency.

This is a more complete approach: in the future we might consider
replacing mapping->invalidate_lock with the bcachefs code.

Signed-off-by: Kent Overstreet <kent.overstreet@...ux.dev>
Cc: Jan Kara <jack@...e.cz>
Cc: Darrick J. Wong <djwong@...nel.org>
Cc: linux-fsdevel@...r.kernel.org
---
 include/linux/sched.h | 1 +
 init/init_task.c      | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b..f2a56f64f7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -869,6 +869,7 @@ struct task_struct {
 
 	struct mm_struct		*mm;
 	struct mm_struct		*active_mm;
+	struct address_space		*faults_disabled_mapping;
 
 	int				exit_state;
 	int				exit_code;
diff --git a/init/init_task.c b/init/init_task.c
index ff6c4b9bfe..f703116e05 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -85,6 +85,7 @@ struct task_struct init_task
 	.nr_cpus_allowed= NR_CPUS,
 	.mm		= NULL,
 	.active_mm	= &init_mm,
+	.faults_disabled_mapping = NULL,
 	.restart_block	= {
 		.fn = do_no_restart_syscall,
 	},
-- 
2.40.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ