linux-kernel - Re: [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200804004012.GA1049259@carbon.dhcp.thefacebook.com>
Date:   Mon, 3 Aug 2020 17:40:12 -0700
From:   Roman Gushchin <guro@...com>
To:     Hugh Dickins <hughd@...gle.com>
CC:     Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>,
        Vlastimil Babka <vbabka@...e.cz>, <linux-mm@...ck.org>,
        <kernel-team@...com>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating
 false warnings

On Fri, Jul 31, 2020 at 07:17:05PM -0700, Hugh Dickins wrote:
> On Fri, 31 Jul 2020, Roman Gushchin wrote:
> > On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote:
> > > 
> > > Though another alternative did occur to me overnight: we could
> > > scrap the logged warning, and show "nr_whatever -53" as output
> > > from /proc/sys/vm/stat_refresh: that too would be acceptable
> > > to me, and you redirect to /dev/null.
> > 
> > It sounds like a good idea to me. Do you want me to prepare a patch?
> 
> Yes, if you like that one best, please do prepare a patch - thanks!

Hi Hugh,

I mastered a patch (attached below), but honestly I can't say I like it.
The resulting interface is confusing: we don't generally use sysctls to
print debug data and/or warnings.

I thought about treating a write to this sysctls as setting the threshold,
so that "echo 0 > /proc/sys/vm/stat_refresh" would warn on all negative
entries, and "cat /proc/sys/vm/stat_refresh" would use the default threshold
as in my patch. But this breaks  to some extent the current ABI, as passing
an incorrect value will result in -EINVAL instead of passing (as now).

Overall I still think we shouldn't warn on any values inside the possible
range, as it's not an indication of any kind of error. The only reason
why we see some values going negative and some not, is that some of them
are updated more frequently than others, and some are bouncing around
zero, while other can't reach zero too easily (like the number of free pages).

Actually, if someone wants to ensure that numbers are accurate,
we have to temporarily set the threshold to 0, then flush the percpu data
and only then check atomics. In the current design flushing percpu data
matters for only slowly updated counters, as all others will run away while
we're waiting for the flush. So if we're targeting some slowly updating
counters, maybe we should warn only on them being negative, Idk.

Thanks!

--

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6eecfcbbfe79..1d2f2471bb78 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1815,13 +1815,28 @@ static void refresh_vm_stats(struct work_struct *work)
        refresh_cpu_vm_stats(true);
 }
 
+static void warn_vmstat(void *buffer, size_t *lenp, loff_t *ppos,
+                       const char *name, long val)
+{
+       int len;
+
+       len = snprintf(buffer + *ppos, *lenp, "%s %lu\n", name, val);
+       *lenp -= len;
+       *ppos += len;
+}
+
 int vmstat_refresh(struct ctl_table *table, int write,
                   void *buffer, size_t *lenp, loff_t *ppos)
 {
-       long val, max_drift;
+       long val;
        int err;
        int i;
 
+       if (!*lenp || (*ppos && !write)) {
+               *lenp = 0;
+               return 0;
+       }
+
        /*
         * The regular update, every sysctl_stat_interval, may come later
         * than expected: leaving a significant amount in per_cpu buckets.
@@ -1837,35 +1852,24 @@ int vmstat_refresh(struct ctl_table *table, int write,
        /*
         * Since global_zone_page_state() etc. are so careful to hide
         * transiently negative values, report an error here if any of
-        * the stats is negative and are less than the maximum drift value,
-        * so we know to go looking for imbalance.
+        * the stats is negative, so we know to go looking for imbalance.
         */
-       max_drift = num_online_cpus() * MAX_THRESHOLD;
-
        for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
                val = atomic_long_read(&vm_zone_stat[i]);
-               if (val < -max_drift) {
-                       pr_warn("%s: %s %ld\n",
-                               __func__, zone_stat_name(i), val);
-                       err = -EINVAL;
-               }
+               if (!write && val < 0)
+                       warn_vmstat(buffer, lenp, ppos, zone_stat_name(i), val);
        }
 #ifdef CONFIG_NUMA
        for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
                val = atomic_long_read(&vm_numa_stat[i]);
-               if (val < -max_drift) {
-                       pr_warn("%s: %s %ld\n",
-                               __func__, numa_stat_name(i), val);
-                       err = -EINVAL;
-               }
+               if (!write && val < 0)
+                       warn_vmstat(buffer, lenp, ppos, numa_stat_name(i), val);
        }
 #endif
        if (err)
                return err;
        if (write)
                *ppos += *lenp;
-       else
-               *lenp = 0;
        return 0;
 }
 #endif /* CONFIG_PROC_FS */