linux-kernel - Writeback fixes for NFS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87v9miydai.fsf@notabene.neil.brown.name>
Date:   Thu, 02 Apr 2020 10:52:21 +1100
From:   NeilBrown <neilb@...e.de>
To:     Trond Myklebust <trondmy@...merspace.com>,
        "Anna.Schumaker\@Netapp.com" <Anna.Schumaker@...app.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Jan Kara <jack@...e.cz>
Cc:     linux-mm@...ck.org, linux-nfs@...r.kernel.org,
        LKML <linux-kernel@...r.kernel.org>
Subject: Writeback fixes for NFS

Please ignore my previous patch (which this is in reply to), it was
flawed in various ways.

I now understand the code a bit better and have a somewhat simpler patch
which appears to address the same problem.

The problem is that writeback to NFS often produces lots of small writes
(10s of K) rather than fewer large writes (1M).  This pattern can often
hurt throughput, but in certain circumstances it can hurt NFS throughput
more than expected.

Each nfs_writepages() call results in an NFS commit being sent to the
server.  If writeback triggers lots of smaller nfs_writepages calls,
this means lots of COMMITs.  If the server is slow to handle the COMMIT
(I've seen the Ganesha NFS server take over 200ms per commit), these
COMMITs can overlap, queue up, and choke the NFS server and cause
order-of-magnitude drop in throughput.

So we really want to only call nfs_writepages when there are a largish
number of pages to be written - i.e. that are 'dirty'.

For historical reasons that I didn't thoroughly research but I'm
confident are no longer relevant, pages that have been written to the
NFS server but have not yet been the subject of a COMMIT - so-called
"unstable" pages - are effectively accounted that same as "dirty" pages
(sometimes called "reclaimable").
This can result in writeback thinking there are lots of "dirty" pages to
reclaim, while nfs_writepages can only find a few that it can write out.

The second patch following changes the accounting for these "unstable"
pages.  They are now always accounted exactly the same was writeback
pages.  Conceptually they can be thought of as still in writeback, but the
writeback is now happening on the server.

A COMMIT will always automatically follow the writes generated by
nfs_writepages, so from the perspective of the VM, there really is no
difference: It has scheduled the write and there is nothing else it can
do except wait.

Testing this patch showed that loop-back NFS is prone to deadlocks
again.  I cannot see exactly how the change to 'unstable' accounting
affected this, but I can see that the old +25% heuristic can no longer
be justified given the complexity of writeback calculations.
So the first patch following changes how writeback is handled for NFS
servers handling loop-back requests (and other similar services) so that
it is more obviously safe against excessive dirty pages scheduled for
other devices.

Thanks,
NeilBrown

Download attachment "signature.asc" of type "application/pgp-signature" (833 bytes)