linux-kernel - Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cda4542e4ae8b30a6f5628386388f813d3209558.camel@kernel.org>
Date: Tue, 08 Jul 2025 10:34:15 -0400
From: Jeff Layton <jlayton@...nel.org>
To: Chuck Lever <chuck.lever@...cle.com>, Trond Myklebust
 <trondmy@...nel.org>,  Anna Schumaker	 <anna@...nel.org>, NeilBrown
 <neil@...wn.name>, Olga Kornievskaia	 <okorniev@...hat.com>, Dai Ngo
 <Dai.Ngo@...cle.com>, Tom Talpey <tom@...pey.com>,  Mike Snitzer
 <snitzer@...nel.org>
Cc: linux-nfs@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ,
 stable WRITE or COMMIT

On Thu, 2025-07-03 at 16:07 -0400, Chuck Lever wrote:
> On 7/3/25 3:53 PM, Jeff Layton wrote:
> > Recent testing has shown that keeping pagecache pages around for too
> > long can be detrimental to performance with nfsd. Clients only rarely
> > revisit the same data, so the pages tend to just hang around.
> > 
> > This patch changes the pc_release callbacks for NFSv3 READ, WRITE and
> > COMMIT to call generic_fadvise(..., POSIX_FADV_DONTNEED) on the accessed
> > range.
> > 
> > Signed-off-by: Jeff Layton <jlayton@...nel.org>
> > ---
> >  fs/nfsd/debugfs.c  |  2 ++
> >  fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++---------
> >  fs/nfsd/nfsd.h     |  1 +
> >  fs/nfsd/nfsproc.c  |  4 ++--
> >  fs/nfsd/vfs.c      | 21 ++++++++++++++-----
> >  fs/nfsd/vfs.h      |  5 +++--
> >  fs/nfsd/xdr3.h     |  3 +++
> >  7 files changed, 77 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> > index 84b0c8b559dc90bd5c2d9d5e15c8e0682c0d610c..b007718dd959bc081166ec84e06f577a8fc2b46b 100644
> > --- a/fs/nfsd/debugfs.c
> > +++ b/fs/nfsd/debugfs.c
> > @@ -44,4 +44,6 @@ void nfsd_debugfs_init(void)
> >  
> >  	debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
> >  			    nfsd_top_dir, NULL, &nfsd_dsr_fops);
> > +	debugfs_create_bool("enable-fadvise-dontneed", 0644,
> > +			    nfsd_top_dir, &nfsd_enable_fadvise_dontneed);
> 
> I prefer that this setting is folded into the new io_cache_read /
> io_cache_write tune-ables that Mike's patch adds, rather than adding
> a new boolean.
> 
> That might make a hybrid "DONTCACHE for READ and fadvise for WRITE"
> pretty easy.
> 

I ended up rebasing Mike's dontcache branch on top of v6.16-rc5 with
all of Chuck's trees in. I then added the attached patch and did some
testing with a couple of machines I checked out internally at Meta.
This is the throughput results with the fio-seq-RW test with the file
size set to 100G and the duration at 5 mins.

Note that:

read and writes buffered:
   READ: bw=3024MiB/s (3171MB/s), 186MiB/s-191MiB/s (195MB/s-201MB/s), io=889GiB (954GB), run=300012-300966msec
  WRITE: bw=2015MiB/s (2113MB/s), 124MiB/s-128MiB/s (131MB/s-134MB/s), io=592GiB (636GB), run=300012-300966msec

   READ: bw=2902MiB/s (3043MB/s), 177MiB/s-183MiB/s (186MB/s-192MB/s), io=851GiB (913GB), run=300027-300118msec
  WRITE: bw=1934MiB/s (2027MB/s), 119MiB/s-122MiB/s (124MB/s-128MB/s), io=567GiB (608GB), run=300027-300118msec

   READ: bw=2897MiB/s (3037MB/s), 178MiB/s-183MiB/s (186MB/s-192MB/s), io=849GiB (911GB), run=300006-300078msec
  WRITE: bw=1930MiB/s (2023MB/s), 119MiB/s-122MiB/s (125MB/s-128MB/s), io=565GiB (607GB), run=300006-300078msec

reads and writes RWF_DONTCACHE:
   READ: bw=3090MiB/s (3240MB/s), 190MiB/s-195MiB/s (199MB/s-205MB/s), io=906GiB (972GB), run=300015-300113msec
  WRITE: bw=2060MiB/s (2160MB/s), 126MiB/s-130MiB/s (132MB/s-137MB/s), io=604GiB (648GB), run=300015-300113msec

   READ: bw=3057MiB/s (3205MB/s), 188MiB/s-193MiB/s (198MB/s-203MB/s), io=897GiB (963GB), run=300329-300450msec
  WRITE: bw=2037MiB/s (2136MB/s), 126MiB/s-129MiB/s (132MB/s-135MB/s), io=598GiB (642GB), run=300329-300450msec

   READ: bw=3166MiB/s (3320MB/s), 196MiB/s-200MiB/s (205MB/s-210MB/s), io=928GiB (996GB), run=300021-300090msec
  WRITE: bw=2111MiB/s (2213MB/s), 131MiB/s-133MiB/s (137MB/s-140MB/s), io=619GiB (664GB), run=300021-300090msec

reads and writes witg O_DIRECT:
   READ: bw=3115MiB/s (3267MB/s), 192MiB/s-198MiB/s (201MB/s-208MB/s), io=913GiB (980GB), run=300025-300078msec
  WRITE: bw=2077MiB/s (2178MB/s), 128MiB/s-131MiB/s (134MB/s-138MB/s), io=609GiB (653GB), run=300025-300078msec

   READ: bw=3189MiB/s (3343MB/s), 197MiB/s-202MiB/s (207MB/s-211MB/s), io=934GiB (1003GB), run=300023-300096msec
  WRITE: bw=2125MiB/s (2228MB/s), 132MiB/s-134MiB/s (138MB/s-140MB/s), io=623GiB (669GB), run=300023-300096msec

   READ: bw=3113MiB/s (3264MB/s), 191MiB/s-197MiB/s (200MB/s-207MB/s), io=912GiB (979GB), run=300020-300098msec
  WRITE: bw=2075MiB/s (2175MB/s), 127MiB/s-131MiB/s (134MB/s-138MB/s), io=608GiB (653GB), run=300020-300098msec

RWF_DONTCACHE on reads and stable writes + fadvise DONTNEED after COMMIT:
   READ: bw=2888MiB/s (3029MB/s), 178MiB/s-182MiB/s (187MB/s-191MB/s), io=846GiB (909GB), run=300012-300109msec
  WRITE: bw=1924MiB/s (2017MB/s), 118MiB/s-121MiB/s (124MB/s-127MB/s), io=564GiB (605GB), run=300012-300109msec

   READ: bw=2899MiB/s (3040MB/s), 180MiB/s-183MiB/s (188MB/s-192MB/s), io=852GiB (915GB), run=300022-300940msec
  WRITE: bw=1931MiB/s (2025MB/s), 119MiB/s-122MiB/s (125MB/s-128MB/s), io=567GiB (609GB), run=300022-300940msec

   READ: bw=2902MiB/s (3043MB/s), 179MiB/s-184MiB/s (188MB/s-193MB/s), io=853GiB (916GB), run=300913-301146msec
  WRITE: bw=1933MiB/s (2027MB/s), 119MiB/s-122MiB/s (125MB/s-128MB/s), io=568GiB (610GB), run=300913-301146msec


The fadvise case is clearly slower than the others. Interestingly it
also slowed down read performance, which leads me to believe that maybe
the fadvise calls were interfering with concurrent reads. Given the
disappointing numbers, I'll probably drop the last patch.

There is probably a case to be made for patch #1, on the general
principle of expediting sending the reply as much as possible. Chuck,
let me know if you want me to submit that individually.
-- 
Jeff Layton <jlayton@...nel.org>

View attachment "0001-nfsd-add-a-NFSD_IO_FADVISE-setting-to-io_cache_write.patch" of type "text/x-patch" (4348 bytes)