linux-kernel - Re: [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200806233804.GB1217906@carbon.dhcp.thefacebook.com>
Date:   Thu, 6 Aug 2020 16:38:04 -0700
From:   Roman Gushchin <guro@...com>
To:     Andrew Morton <akpm@...ux-foundation.org>
CC:     Hugh Dickins <hughd@...gle.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>,
        Vlastimil Babka <vbabka@...e.cz>, <linux-mm@...ck.org>,
        <kernel-team@...com>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating
 false warnings


On Wed, Aug 05, 2020 at 08:01:33PM -0700, Hugh Dickins wrote:
> On Mon, 3 Aug 2020, Roman Gushchin wrote:
> > On Fri, Jul 31, 2020 at 07:17:05PM -0700, Hugh Dickins wrote:
> > > On Fri, 31 Jul 2020, Roman Gushchin wrote:
> > > > On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote:
> > > > > 
> > > > > Though another alternative did occur to me overnight: we could
> > > > > scrap the logged warning, and show "nr_whatever -53" as output
> > > > > from /proc/sys/vm/stat_refresh: that too would be acceptable
> > > > > to me, and you redirect to /dev/null.
> > > > 
> > > > It sounds like a good idea to me. Do you want me to prepare a patch?
> > > 
> > > Yes, if you like that one best, please do prepare a patch - thanks!
> > 
> > Hi Hugh,
> > 
> > I mastered a patch (attached below), but honestly I can't say I like it.
> > The resulting interface is confusing: we don't generally use sysctls to
> > print debug data and/or warnings.
> 
> Since you confessed to not liking it yourself, I paid it very little
> attention.  Yes, when I made that suggestion, I wasn't really thinking
> of how stat_refresh is a /proc/sys/vm sysctl thing; and I'm not at all
> sure how issuing output from a /proc file intended for input works out
> (perhaps there are plenty of good examples, and you followed one, but
> it smells fishy to me now).
> 
> > 
> > I thought about treating a write to this sysctls as setting the threshold,
> > so that "echo 0 > /proc/sys/vm/stat_refresh" would warn on all negative
> > entries, and "cat /proc/sys/vm/stat_refresh" would use the default threshold
> > as in my patch. But this breaks  to some extent the current ABI, as passing
> > an incorrect value will result in -EINVAL instead of passing (as now).
> 
> I expect we could handle that well enough, by more lenient validation
> of the input; though my comment above on output versus input sheds doubt.
> 
> > 
> > Overall I still think we shouldn't warn on any values inside the possible
> > range, as it's not an indication of any kind of error. The only reason
> > why we see some values going negative and some not, is that some of them
> > are updated more frequently than others, and some are bouncing around
> > zero, while other can't reach zero too easily (like the number of free pages).
> 
> We continue to disagree on that (and it amuses me that you who are so
> sure they can be ignored, cannot ignore them; whereas I who am so curious
> to investigate them, have not actually found the time to do so in years).
> It was looking as if nothing could satisfy us both, but...
> 
> > 
> > Actually, if someone wants to ensure that numbers are accurate,
> > we have to temporarily set the threshold to 0, then flush the percpu data
> > and only then check atomics. In the current design flushing percpu data
> > matters for only slowly updated counters, as all others will run away while
> > we're waiting for the flush. So if we're targeting some slowly updating
> > counters, maybe we should warn only on them being negative, Idk.
> 
> I was going to look into that angle, though it would probably add a little
> unjustifiable overhead to fast paths, and be rejected on that basis.
> 
> But in going to do so, came up against an earlier comment of yours, of
> which I had misunderstood the significance. I had said and you replied:
> 
> > > nr_zone_write_pending: yes, I've looked at our machines, and see that
> > > showing up for us too (-49 was the worst I saw).  Not at all common,
> > > but seen.  And not followed by increasingly worse numbers, so a state
> > > that corrects itself.  nr_dirty too (fewer instances, bigger numbers);
> > > but never nr_writeback, which you'd expect to go along with those.
> > 
> > NR_DIRTY and NR_WRITEBACK are node counters, so we don't check them?
> 
> Wow. Now I see what you were pointing out: when v4.8's 75ef71840539
> ("mm, vmstat: add infrastructure for per-node vmstats") went in, it
> missed updating vmstat_refresh() to check all the NR_VM_NODE_STAT items.
> 
> And I've never noticed, and have interpreted its silence on those items
> as meaning they're all good (and the nr_dirty ones I mentioned above,
> must have been from residual old kernels, hence the fewer instances).
> I see the particularly tricky NR_ISOLATED ones are in that category.
> Maybe they are all good, but I have been mistaken.
> 
> I shall certainly want to reintroduce those stats to checking for
> negatives, even if it's in a patch that never earns your approval,
> and just ends up kept internal for debugging.  But equally certainly,
> I must not suddenly reintroduce that checking without gaining some
> experience of it (and perhaps getting as irritated as you by more
> transient negatives).
> 
> I said earlier that I'd prefer you to rip out all that checking for
> negatives, rather than retaining it with the uselessly over-generous
> 125 * nr_cpus leeway.  Please, Roman, would you send Andrew a patch
> doing that, to replace the patch in this thread?  Or if you prefer,
> I can do so.

Hi Andrew,

it seems that Hugh and me haven't reached a consensus here.
Can, you, please, not merge this patch into 5.9, so we would have
more time to find a solution, acceptable for all?

Thank you!

Roman