linux-kernel - Re: [PATCH] mm: memcg: provide accurate stats for userspace reads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZNOVS0Smp2PHUIuq@dhcp22.suse.cz>
Date:   Wed, 9 Aug 2023 15:31:55 +0200
From:   Michal Hocko <mhocko@...e.com>
To:     Yosry Ahmed <yosryahmed@...gle.com>
Cc:     Johannes Weiner <hannes@...xchg.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Shakeel Butt <shakeelb@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Muchun Song <muchun.song@...ux.dev>, cgroups@...r.kernel.org,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm: memcg: provide accurate stats for userspace reads

On Wed 09-08-23 06:13:05, Yosry Ahmed wrote:
> On Wed, Aug 9, 2023 at 5:58 AM Michal Hocko <mhocko@...e.com> wrote:
> >
> > On Wed 09-08-23 05:31:04, Yosry Ahmed wrote:
> > > On Wed, Aug 9, 2023 at 1:51 AM Michal Hocko <mhocko@...e.com> wrote:
> > > >
> > > > On Wed 09-08-23 04:58:10, Yosry Ahmed wrote:
> > > > > Over time, the memcg code added multiple optimizations to the stats
> > > > > flushing path that introduce a tradeoff between accuracy and
> > > > > performance. In some contexts (e.g. dirty throttling, refaults, etc), a
> > > > > full rstat flush of the stats in the tree can be too expensive. Such
> > > > > optimizations include [1]:
> > > > > (a) Introducing a periodic background flusher to keep the size of the
> > > > > update tree from growing unbounded.
> > > > > (b) Allowing only one thread to flush at a time, and other concurrent
> > > > > flushers just skip the flush. This avoids a thundering herd problem
> > > > > when multiple reclaim/refault threads attempt to flush the stats at
> > > > > once.
> > > > > (c) Only executing a flush if the magnitude of the stats updates exceeds
> > > > > a certain threshold.
> > > > >
> > > > > These optimizations were necessary to make flushing feasible in
> > > > > performance-critical paths, and they come at the cost of some accuracy
> > > > > that we choose to live without. On the other hand, for flushes invoked
> > > > > when userspace is reading the stats, the tradeoff is less appealing
> > > > > This code path is not performance-critical, and the inaccuracies can
> > > > > affect userspace behavior. For example, skipping flushing when there is
> > > > > another ongoing flush is essentially a coin flip. We don't know if the
> > > > > ongoing flush is done with the subtree of interest or not.
> > > >
> > > > I am not convinced by this much TBH. What kind of precision do you
> > > > really need and how much off is what we provide?
> > > >
> > > > More expensive read of stats from userspace is quite easy to notice
> > > > and usually reported as a regression. So you should have a convincing
> > > > argument that an extra time spent is really worth it. AFAIK there are
> > > > many monitoring (top like) tools which simply read those files regularly
> > > > just to show numbers and they certainly do not need a high level of
> > > > precision.
> > >
> > > We used to spend this time before commit fd25a9e0e23b ("memcg: unify
> > > memcg stat flushing") which generalized the "skip if ongoing flush"
> > > for all stat flushing. As far I know, the problem was contention on
> > > the flushing lock which also affected critical paths like refault.
> > >
> > > The problem is that the current behavior is indeterministic, if cpu A
> > > tries to flush stats and cpu B is already doing that, cpu A will just
> > > skip. At that point, the cgroup(s) that cpu A cares about may have
> > > been fully flushed, partially flushed (in terms of cpus), or not
> > > flushed at all. We have no idea. We just know that someone else is
> > > flushing something. IOW, in some cases the flush request will be
> > > completely ignored and userspace will read stale stats (up to 2s + the
> > > periodic flusher runtime).
> >
> > Yes, that is certainly true but why does that matter? Stats are always a
> > snapshot of the past. Do we get an inconsistent image that would be
> > actively harmful.
> 
> That can very well be the case because we may be in a state where some
> cpus are flushed and some aren't. Also sometimes a few seconds is too
> old. We have some workloads that read the stats every 1-2 seconds to
> keep a fresh state, and they certainly do not expect stats to be 2+
> seconds old when they read them.

I hate to repeat myself but please be more specific. This all sounds
just too wavy to me.

> > > Some workloads need to read up-to-date stats as feedback to actions
> > > (e.g. after proactive reclaim, or for userspace OOM killing purposes),
> > > and reading such stale stats causes regressions or misbehavior by
> > > userspace.
> >
> > Please tell us more about those and why should all others that do not
> > require such a precision should page that price as well.
> 
> Everyone used to pay this price though and no one used to complain.

Right, and then the overhead has been reduced and now you want to bring
it back and that will be seen as a regression. It doesn't really matter
what used to be the overhead. People always care when something gets
slower.
-- 
Michal Hocko
SUSE Labs