lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250703-nfsd-testing-v1-0-cece54f36556@kernel.org>
Date: Thu, 03 Jul 2025 15:53:11 -0400
From: Jeff Layton <jlayton@...nel.org>
To: Trond Myklebust <trondmy@...nel.org>, Anna Schumaker <anna@...nel.org>, 
 Chuck Lever <chuck.lever@...cle.com>, NeilBrown <neil@...wn.name>, 
 Olga Kornievskaia <okorniev@...hat.com>, Dai Ngo <Dai.Ngo@...cle.com>, 
 Tom Talpey <tom@...pey.com>, Mike Snitzer <snitzer@...nel.org>
Cc: linux-nfs@...r.kernel.org, linux-kernel@...r.kernel.org, 
 Jeff Layton <jlayton@...nel.org>
Subject: [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after
 READ/WRITE/COMMIT

Chuck and I were discussing RWF_DONTCACHE and he suggested that this
might be an alternate approach. My main gripe with DONTCACHE was that it
kicks off writeback after every WRITE operation. With NFS, we generally
get a COMMIT operation at some point. Allowing us to batch up writes
until that point has traditionally been considered better for
performance.

Instead of RWF_DONTCACHE, this patch has nfsd issue generic_fadvise(...,
POSIX_FADV_DONTNEED) on the appropriate range after any READ, stable
WRITE or COMMIT operation. This means that it doesn't change how and
when dirty data gets flushed to the disk, but still keeps resident
pagecache to a minimum.

For reference, here are some numbers from a fio run doing sequential
reads and writes, with the server in "normal" buffered I/O mode, with
Mike's RWF_DONTCACHE patch enabled, and with fadvise(...DONTNEED).

Jobfile:

[global]
name=fio-seq-RW
filename=fio-seq-RW
rw=rw
rwmixread=60
rwmixwrite=40
bs=1M
direct=0
numjobs=16
time_based
runtime=300

[file1]
size=100G
ioengine=io_uring
iodepth=16

::::::::::::::::::::::::::::::::::::

3 runs each.

Baseline (nothing enabled):
Run status group 0 (all jobs):
   READ: bw=2999MiB/s (3144MB/s), 185MiB/s-189MiB/s (194MB/s-198MB/s), io=879GiB (944GB), run=300014-300087msec
  WRITE: bw=1998MiB/s (2095MB/s), 124MiB/s-126MiB/s (130MB/s-132MB/s), io=585GiB (629GB), run=300014-300087msec

   READ: bw=2866MiB/s (3005MB/s), 177MiB/s-181MiB/s (185MB/s-190MB/s), io=844GiB (906GB), run=301294-301463msec
  WRITE: bw=1909MiB/s (2002MB/s), 117MiB/s-121MiB/s (123MB/s-127MB/s), io=562GiB (604GB), run=301294-301463msec

   READ: bw=2885MiB/s (3026MB/s), 177MiB/s-183MiB/s (186MB/s-192MB/s), io=846GiB (908GB), run=300017-300117msec
  WRITE: bw=1923MiB/s (2016MB/s), 118MiB/s-122MiB/s (124MB/s-128MB/s), io=563GiB (605GB), run=300017-300117msec

RWF_DONTCACHE:
Run status group 0 (all jobs):
   READ: bw=3088MiB/s (3238MB/s), 189MiB/s-195MiB/s (198MB/s-205MB/s), io=906GiB (972GB), run=300015-300276msec
  WRITE: bw=2058MiB/s (2158MB/s), 126MiB/s-129MiB/s (132MB/s-136MB/s), io=604GiB (648GB), run=300015-300276msec

   READ: bw=3116MiB/s (3267MB/s), 191MiB/s-197MiB/s (201MB/s-206MB/s), io=913GiB (980GB), run=300022-300074msec
  WRITE: bw=2077MiB/s (2178MB/s), 128MiB/s-131MiB/s (134MB/s-137MB/s), io=609GiB (654GB), run=300022-300074msec

   READ: bw=3011MiB/s (3158MB/s), 185MiB/s-191MiB/s (194MB/s-200MB/s), io=886GiB (951GB), run=301049-301133msec
  WRITE: bw=2007MiB/s (2104MB/s), 123MiB/s-127MiB/s (129MB/s-133MB/s), io=590GiB (634GB), run=301049-301133msec

fadvise(..., POSIX_FADV_DONTNEED):
   READ: bw=2918MiB/s (3060MB/s), 180MiB/s-184MiB/s (188MB/s-193MB/s), io=855GiB (918GB), run=300014-300111msec
  WRITE: bw=1944MiB/s (2038MB/s), 120MiB/s-123MiB/s (125MB/s-129MB/s), io=570GiB (612GB), run=300014-300111msec

   READ: bw=2951MiB/s (3095MB/s), 182MiB/s-188MiB/s (191MB/s-197MB/s), io=867GiB (931GB), run=300529-300695msec
  WRITE: bw=1966MiB/s (2061MB/s), 121MiB/s-124MiB/s (127MB/s-130MB/s), io=577GiB (620GB), run=300529-300695msec

   READ: bw=2971MiB/s (3115MB/s), 181MiB/s-188MiB/s (190MB/s-197MB/s), io=871GiB (935GB), run=300015-300077msec
  WRITE: bw=1979MiB/s (2076MB/s), 122MiB/s-125MiB/s (128MB/s-131MB/s), io=580GiB (623GB), run=300015-300077msec

::::::::::::::::::::::::::::::

The numbers are pretty close, but it looks like RWF_DONTCACHE edges out
the other modes. Also, with the RWF_DONTCACHE and fadvise() modes the
pagecache utilization stays very low on the server (which is of course,
the point).

I think next I'll test a hybrid mode. Use RWF_DONTCACHE for READ and
stable WRITE operations, and do the fadvise() only after COMMITs.

Plumbing this in for v4 will be "interesting" if we decide this approach
is sound, but it shouldn't be too bad if we only do it after a COMMIT.

Thoughts?

Signed-off-by: Jeff Layton <jlayton@...nel.org>
---
Jeff Layton (2):
      sunrpc: delay pc_release callback until after sending a reply
      nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT

 fs/nfsd/debugfs.c  |  2 ++
 fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++---------
 fs/nfsd/nfsd.h     |  1 +
 fs/nfsd/nfsproc.c  |  4 ++--
 fs/nfsd/vfs.c      | 21 ++++++++++++++-----
 fs/nfsd/vfs.h      |  5 +++--
 fs/nfsd/xdr3.h     |  3 +++
 net/sunrpc/svc.c   | 19 ++++++++++++++----
 8 files changed, 92 insertions(+), 22 deletions(-)
---
base-commit: 38ddcbef7f4e9c5aa075c8ccf9f6d5293e027951
change-id: 20250701-nfsd-testing-12e7c8da5f1c

Best regards,
-- 
Jeff Layton <jlayton@...nel.org>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ