linux-kernel - [PATCH/RFC] MM: fix writeback for NFS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <87tv2b7q72.fsf@notabene.neil.brown.name>
Date:   Thu, 26 Mar 2020 14:25:21 +1100
From:   NeilBrown <neilb@...e.de>
To:     Trond Myklebust <trondmy@...merspace.com>,
        "Anna.Schumaker\@Netapp.com" <Anna.Schumaker@...app.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Tejun Heo <tj@...nel.org>
cc:     linux-mm@...ck.org, linux-nfs@...r.kernel.org,
        LKML <linux-kernel@...r.kernel.org>
Subject: [PATCH/RFC] MM: fix writeback for NFS


Page account happens a little differently for NFS, and writeback is
aware of this, but handles it wrongly.
With NFS, pages go throught three states which are accounted separately:
 NR_FILE_DIRTY - page is dirty and unwritten
 NR_WRITEBACK  - page is being written back
 NR_UNSTABLE_NFS - page has been written, but might need to be written
		again if server restarts.  Once an NFS COMMIT
		completes, page becomes freeable.

writeback combines both the NF_FILE_DIRTY counters and NF_UNSTABLE_NFS
together, which doesn't make a lot of sense.  NR_UNSTABLE_NFS is more
like NR_WRITEBACK in that there is nothing that can be done for these
pages except wait.

Current behaviour is that when there are a lot of NR_UNSTABLE_NFS pages,
balance_dirty_pages() repeatedly schedules the writeback thread, even
when the number of dirty pages is below the threshold.  This result is
each ->write_pages() call into NFS only finding relatively few DIRTY
pages, so it creates requests that are well below the maximum size.
Depending on performance of characteristics of the server, this can
substantially reduce throughput.

There are two places where balance_dirty_pages() schedules the writeback
thread, and each of them schedule writeback too often.

1/ When balance_dirty_pages() determines that it must wait for the
   number of dirty pages to shrink, it unconditionally schedules
   writeback.
   This is probably not ideal for any filesystem but I measured it to be
   harmful for NFS.  This writeback should only be scheduled if the
   number of dirty pages is above the threshold.  While it is below,
   waiting for the pending writes to complete (and be committed) is a more
   appropriate behaviour.

2/ When balance_dirty_pages() finishes, it again shedules writeback
   if nr_reclaimable is above the background threshold.  This is
   conceptually correct, but wrong because nr_reclaimable is wrong.
   The impliciation of this usage is that nr_reclaimable is the number
   of pages that can be reclaimed if writeback is triggered.  In fact it
   is NR_FILE_DIRTY plus NR_UNSTABLE_NFS, which is not a meaningful
   number.

So this patch removes NR_UNSTABLE_NFS from nr_reclaimable, and only
schedules writeback when the new nr_reclaimable is greater than
->thresh or ->bg_thresh, depending on context.

The effect of this can be measured by looking at the count of
nfs_writepages calls while performing a large streaming write (larger
than dirty_threshold)

e.g.
  mount -t nfs server:/path /mnt
  dd if=/dev/zero of=/mnt/zeros bs=1M count=1000
  awk '/events:/ {print $13}' /proc/self/mountstats

Each writepages call should normally submit writes for 4M (1024 pages)
or more (MIN_WRITEBACK_PAGES) (unless memory size is tiny).
With the patch I measure typically 200-300 writepages calls with the
above, which suggests an average of about 850 pages being made available to
each writepages call.  This is less than MIN_WRITEBACK_PAGES, but enough
to assemble large 1MB WRITE requests fairly often.

Without the patch, numbers range from about 2000 to over 10000, meaning
an average of maybe 100 pages per call, which means we rarely get large
1M requests, and often get small requests

Note that there are other places where NR_FILE_DIRTY and NR_UNSTABLE_NFS
are combined (wb_over_bg_thresh() and get_nr_dirty_pages()).  I suspect
they are wrong too.  I also suspect that unstable-NFS pages should not be
included in the WB_RECLAIMABLE wb_stat, but I haven't explored that in
detail yet.

Note that the current behaviour of NFS w.r.t sending COMMIT requests is
that as soon as a batch of writes all complete, a COMMIT is scheduled.
Prior to commit 919e3bd9a875 ("NFS: Ensure we commit after writeback is
complete"), other mechanisms triggered the COMMIT.  It is possible that
the current writeback code made more sense before that commit.

Signed-off-by: NeilBrown <neilb@...e.de>
---
 mm/page-writeback.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2caf780a42e7..99419cba1e7a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1593,10 +1593,11 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) +
-					global_node_page_state(NR_UNSTABLE_NFS);
+		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY);
 		gdtc->avail = global_dirtyable_memory();
-		gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK);
+		gdtc->dirty = nr_reclaimable +
+			global_node_page_state(NR_WRITEBACK) +
+			global_node_page_state(NR_UNSTABLE_NFS);
 
 		domain_dirty_limits(gdtc);
 
@@ -1664,7 +1665,8 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 			break;
 		}
 
-		if (unlikely(!writeback_in_progress(wb)))
+		if (nr_reclaimable > gdtc->thresh &&
+		    unlikely(!writeback_in_progress(wb)))
 			wb_start_background_writeback(wb);
 
 		mem_cgroup_flush_foreign(wb);
-- 
2.25.2


Download attachment "signature.asc" of type "application/pgp-signature" (833 bytes)