linux-kernel - Re: [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LSU.2.11.2102232210130.9202@eggly.anvils>
Date:   Tue, 23 Feb 2021 23:24:23 -0800 (PST)
From:   Hugh Dickins <hughd@...gle.com>
To:     Roman Gushchin <guro@...com>
cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Hugh Dickins <hughd@...gle.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>,
        Vlastimil Babka <vbabka@...e.cz>, linux-mm@...ck.org,
        kernel-team@...com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating
 false warnings

On Thu, 6 Aug 2020, Andrew Morton wrote:
> On Thu, 6 Aug 2020 16:38:04 -0700 Roman Gushchin <guro@...com> wrote:

August, yikes, I thought it was much more recent.

> 
> > it seems that Hugh and me haven't reached a consensus here.
> > Can, you, please, not merge this patch into 5.9, so we would have
> > more time to find a solution, acceptable for all?
> 
> No probs.  I already had a big red asterisk on it ;)

I've a suspicion that Andrew might be tiring of his big red asterisk,
and wanting to unload
mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings.patch
mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix.patch
mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix-2.patch
into 5.12.

I would prefer not, and reiterate my Nack: but no great harm will
befall the cosmos if he overrules that, and it does go through to
5.12 - I'll just want to revert it again later.  And I do think a
more straightforward way of suppressing those warnings would be just
to delete the code that issues them, rather than brushing them under
a carpet of overtuning.

I've been running mmotm with the patch below (shown as sign of good
faith, and for you to try, but not ready to go yet) for a few months
now - overriding your max_drift, restoring nr_writeback and friends to
the same checking, fixing the obvious reason why nr_zone_write_pending
and nr_writeback are seen negative occasionally (interrupt interrupting
to decrement those stats before they have even been incremented).

Two big BUTs (if not asterisks): since adding that patch, I have
usually forgotten all about it, so forgotten to run the script that
echoes /proc/sys/vm/stat_refresh at odd intervals while under load:
so have less data than I'd intended by now.  And secondly (and I've
just checked again this evening) I do still see nr_zone_write_pending
and nr_writeback occasionally caught negative while under load.  So,
there's something more at play, perhaps the predicted Gushchin Effect
(but wouldn't they go together if so? I've only seen them separately),
or maybe something else, I don't know.

Those are the only stats I've seen caught negative, but I don't have
CMA configured at all.  You mention nr_free_cma as the only(?) other
stat you've seen negative, that of course I won't see, but looking
at the source I now notice that NR_FREE_CMA_PAGES is incremented
and decremented according to page migratetype...

... internally we have another stat that's incremented and decremented
according to page migratetype, and that one has been seen negative too:
isn't page migratetype something that usually stays the same, but
sometimes the migratetype of the page's block can change, even while
some pages of it are allocated?  Not a stable basis for maintaining
stats, though won't matter much if they are only for display.

vmstat_refresh could just exempt nr_zone_write_pending, nr_writeback
and nr_free_cma from warnings, if we cannot find a fix to them: but
I see no reason to suppress warnings on all the other vmstats.

The patch I've been testing with:

--- mmotm/mm/page-writeback.c	2021-02-14 14:32:24.000000000 -0800
+++ hughd/mm/page-writeback.c	2021-02-20 18:01:11.264162616 -0800
@@ -2769,6 +2769,13 @@ int __test_set_page_writeback(struct pag
 	int ret, access_ret;
 
 	lock_page_memcg(page);
+	/*
+	 * Increment counts in advance, so that they will not go negative
+	 * if test_clear_page_writeback() comes in to decrement them.
+	 */
+	inc_lruvec_page_state(page, NR_WRITEBACK);
+	inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
+
 	if (mapping && mapping_use_writeback_tags(mapping)) {
 		XA_STATE(xas, &mapping->i_pages, page_index(page));
 		struct inode *inode = mapping->host;
@@ -2804,9 +2811,14 @@ int __test_set_page_writeback(struct pag
 	} else {
 		ret = TestSetPageWriteback(page);
 	}
-	if (!ret) {
-		inc_lruvec_page_state(page, NR_WRITEBACK);
-		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
+
+	if (WARN_ON_ONCE(ret)) {
+		/*
+		 * Correct counts in retrospect, if PageWriteback was already
+		 * set; but does any filesystem ever allow this to happen?
+		 */
+		dec_lruvec_page_state(page, NR_WRITEBACK);
+		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 	}
 	unlock_page_memcg(page);
 	access_ret = arch_make_page_accessible(page);
--- mmotm/mm/vmstat.c	2021-02-20 17:59:44.838171232 -0800
+++ hughd/mm/vmstat.c	2021-02-20 18:01:11.272162661 -0800
@@ -1865,7 +1865,7 @@ int vmstat_refresh(struct ctl_table *tab
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		val = atomic_long_read(&vm_zone_stat[i]);
-		if (val < -max_drift) {
+		if (val < 0) {
 			pr_warn("%s: %s %ld\n",
 				__func__, zone_stat_name(i), val);
 			err = -EINVAL;
@@ -1874,13 +1874,21 @@ int vmstat_refresh(struct ctl_table *tab
 #ifdef CONFIG_NUMA
 	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
 		val = atomic_long_read(&vm_numa_stat[i]);
-		if (val < -max_drift) {
+		if (val < 0) {
 			pr_warn("%s: %s %ld\n",
 				__func__, numa_stat_name(i), val);
 			err = -EINVAL;
 		}
 	}
 #endif
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+		val = atomic_long_read(&vm_node_stat[i]);
+		if (val < 0) {
+			pr_warn("%s: %s %ld\n",
+				__func__, node_stat_name(i), val);
+			err = -EINVAL;
+		}
+	}
 	if (err)
 		return err;
 	if (write)