linux-kernel - Re: [PATCH 1/2] Allow a kthread to declare that it calls task_work

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-id: <170176610023.7109.11175368186869568821@noble.neil.brown.name>
Date:   Tue, 05 Dec 2023 19:48:20 +1100
From:   "NeilBrown" <neilb@...e.de>
To:     "Dave Chinner" <david@...morbit.com>
Cc:     "Al Viro" <viro@...iv.linux.org.uk>,
        "Christian Brauner" <brauner@...nel.org>,
        "Jens Axboe" <axboe@...nel.dk>, "Oleg Nesterov" <oleg@...hat.com>,
        "Chuck Lever" <chuck.lever@...cle.com>,
        "Jeff Layton" <jlayton@...nel.org>,
        "Ingo Molnar" <mingo@...hat.com>,
        "Peter Zijlstra" <peterz@...radead.org>,
        "Juri Lelli" <juri.lelli@...hat.com>,
        "Vincent Guittot" <vincent.guittot@...aro.org>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-nfs@...r.kernel.org
Subject: Re: [PATCH 1/2] Allow a kthread to declare that it calls task_work_run()

On Tue, 05 Dec 2023, Dave Chinner wrote:
> On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> > User-space processes always call task_work_run() as needed when
> > returning from a system call.  Kernel-threads generally do not.
> > Because of this some work that is best run in the task_works context
> > (guaranteed that no locks are held) cannot be queued to task_works from
> > kernel threads and so are queued to a (single) work_time to be managed
> > on a work queue.
> > 
> > This means that any cost for doing the work is not imposed on the kernel
> > thread, and importantly excessive amounts of work cannot apply
> > back-pressure to reduce the amount of new work queued.
> > 
> > I have evidence from a customer site when nfsd (which runs as kernel
> > threads) is being asked to modify many millions of files which causes
> > sufficient memory pressure that some cache (in XFS I think) gets cleaned
> > earlier than would be ideal.  When __dput (from the workqueue) calls
> > __dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back
> > previously cached info from storage.
> 
> We fixed that specific XFS problem in 5.9.
> 
> https://lore.kernel.org/linux-xfs/20200622081605.1818434-1-david@fromorbit.com/

Good to know - thanks.

> 
> Can you reproduce these issues on a current TOT kernel?

I haven't tried.  I don't know if I know enough details of the work load
to attempt it.

> 
> If not, there's no bugs to fix in the upstream kernel. If you can,
> then we've got more XFS issues to work through and fix. 
> 
> Fundamentally, though, we should not be papering over an XFS issue
> by changing how core task_work infrastructure is used. So let's deal
> with the XFS issue first....

I disagree.  This customer experience has demonstrated both a bug in XFS
and bug in the interaction between fput, task_work, and nfsd.

If a bug in a filesystem that only causes a modest performance impact
when used through the syscall API can bring the system to its knees
through memory exhaustion when used by nfsd, then that is a robustness
issue for nfsd.

I want to fix that robustness issue so that unusual behaviour in
filesystems does not cause out-of-proportion bad behaviour in nfsd.

I highlighted this in the cover letter to the first version of my patch:

https://lore.kernel.org/all/170112272125.7109.6245462722883333440@noble.neil.brown.name/

  While this might point to a problem with the filesystem not handling the
  final close efficiently, such problems should only hurt throughput, not
  lead to memory exhaustion.

Thanks,
NeilBrown


> 
> -Dave.
> -- 
> Dave Chinner
> david@...morbit.com
>