linux-kernel - [PATCH 1/2] Allow a kthread to declare that it calls task_work

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231204014042.6754-2-neilb@suse.de>
Date:   Mon,  4 Dec 2023 12:36:41 +1100
From:   NeilBrown <neilb@...e.de>
To:     Al Viro <viro@...iv.linux.org.uk>,
        Christian Brauner <brauner@...nel.org>,
        Jens Axboe <axboe@...nel.dk>, Oleg Nesterov <oleg@...hat.com>,
        Chuck Lever <chuck.lever@...cle.com>,
        Jeff Layton <jlayton@...nel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>
Cc:     linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-nfs@...r.kernel.org
Subject: [PATCH 1/2] Allow a kthread to declare that it calls task_work_run()

User-space processes always call task_work_run() as needed when
returning from a system call.  Kernel-threads generally do not.
Because of this some work that is best run in the task_works context
(guaranteed that no locks are held) cannot be queued to task_works from
kernel threads and so are queued to a (single) work_time to be managed
on a work queue.

This means that any cost for doing the work is not imposed on the kernel
thread, and importantly excessive amounts of work cannot apply
back-pressure to reduce the amount of new work queued.

I have evidence from a customer site when nfsd (which runs as kernel
threads) is being asked to modify many millions of files which causes
sufficient memory pressure that some cache (in XFS I think) gets cleaned
earlier than would be ideal.  When __dput (from the workqueue) calls
__dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back
previously cached info from storage.  This slows down the single thread
that is making all the final __dput() calls for all the nfsd threads
with the net result that files are added to the delayed_fput_list faster
than they are removed, and the system eventually runs out of memory.

This happens because there is no back-pressure: the nfsd isn't forced to
slow down when __dput() is slow for any reason.  To fix this we can
change the nfsd threads to call task_work_run() regularly (much like
user-space processes do) and allow it to declare this so that work does
get queued to task_works rather than to a work queue.

This patch adds a new process flag PF_RUNS_TASK_WORK which is now used
instead of PF_KTHREAD to determine whether it is sensible to queue
something to task_works.  This flag is always set for non-kernel threads.

task_work_run() is also exported so that it can be called from a module
such as nfsd.

Signed-off-by: NeilBrown <neilb@...e.de>
---
 fs/file_table.c       | 3 ++-
 fs/namespace.c        | 2 +-
 include/linux/sched.h | 2 +-
 kernel/fork.c         | 2 ++
 kernel/task_work.c    | 1 +
 5 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index ee21b3da9d08..d36cade6e366 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -435,7 +435,8 @@ void fput(struct file *file)
 	if (atomic_long_dec_and_test(&file->f_count)) {
 		struct task_struct *task = current;
 
-		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
+		if (likely(!in_interrupt() &&
+			   (task->flags & PF_RUNS_TASK_WORK))) {
 			init_task_work(&file->f_rcuhead, ____fput);
 			if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME))
 				return;
diff --git a/fs/namespace.c b/fs/namespace.c
index e157efc54023..46d640b70ca9 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt)
 
 	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
 		struct task_struct *task = current;
-		if (likely(!(task->flags & PF_KTHREAD))) {
+		if (likely((task->flags & PF_RUNS_TASK_WORK))) {
 			init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
 			if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
 				return;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 77f01ac385f7..e4eebac708e7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1747,7 +1747,7 @@ extern struct pid *cad_pid;
 						 * I am cleaning dirty pages from some other bdi. */
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
-#define PF__HOLE__00800000	0x00800000
+#define PF_RUNS_TASK_WORK	0x00800000	/* Will call task_work_run() periodically */
 #define PF__HOLE__01000000	0x01000000
 #define PF__HOLE__02000000	0x02000000
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b6d20dfb9a8..d612d8f14861 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2330,6 +2330,8 @@ __latent_entropy struct task_struct *copy_process(
 	p->flags &= ~PF_KTHREAD;
 	if (args->kthread)
 		p->flags |= PF_KTHREAD;
+	else
+		p->flags |= PF_RUNS_TASK_WORK;
 	if (args->user_worker) {
 		/*
 		 * Mark us a user worker, and block any signal that isn't
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 95a7e1b7f1da..aec19876e121 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -183,3 +183,4 @@ void task_work_run(void)
 		} while (work);
 	}
 }
+EXPORT_SYMBOL(task_work_run);
-- 
2.43.0