linux-kernel - Re: [PATCH 3/3] mm: memcg: optimize stats flushing for latency and accuracy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJD7tkbb5zxEPSW6vvijDpQKCZiDcUJqe8SPvLHfo0RX7iHXAw@mail.gmail.com>
Date:   Thu, 14 Sep 2023 10:56:52 -0700
From:   Yosry Ahmed <yosryahmed@...gle.com>
To:     Shakeel Butt <shakeelb@...gle.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Muchun Song <muchun.song@...ux.dev>,
        Ivan Babrou <ivan@...udflare.com>, Tejun Heo <tj@...nel.org>,
        Michal Koutný <mkoutny@...e.com>,
        Waiman Long <longman@...hat.com>, kernel-team@...udflare.com,
        Wei Xu <weixugc@...gle.com>, Greg Thelen <gthelen@...gle.com>,
        linux-mm@...ck.org, cgroups@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/3] mm: memcg: optimize stats flushing for latency and accuracy

On Thu, Sep 14, 2023 at 10:36 AM Shakeel Butt <shakeelb@...gle.com> wrote:
>
> On Wed, Sep 13, 2023 at 12:38 AM Yosry Ahmed <yosryahmed@...gle.com> wrote:
> >
> > Stats flushing for memcg currently follows the following rules:
> > - Always flush the entire memcg hierarchy (i.e. flush the root).
> > - Only one flusher is allowed at a time. If someone else tries to flush
> >   concurrently, they skip and return immediately.
> > - A periodic flusher flushes all the stats every 2 seconds.
> >
> > The reason this approach is followed is because all flushes are
> > serialized by a global rstat spinlock. On the memcg side, flushing is
> > invoked from userspace reads as well as in-kernel flushers (e.g.
> > reclaim, refault, etc). This approach aims to avoid serializing all
> > flushers on the global lock, which can cause a significant performance
> > hit under high concurrency.
> >
> > This approach has the following problems:
> > - Occasionally a userspace read of the stats of a non-root cgroup will
> >   be too expensive as it has to flush the entire hierarchy [1].
>
> This is a real world workload exhibiting the issue which is good.
>
> > - Sometimes the stats accuracy are compromised if there is an ongoing
> >   flush, and we skip and return before the subtree of interest is
> >   actually flushed. This is more visible when reading stats from
> >   userspace, but can also affect in-kernel flushers.
>
> Please provide similar data/justification for the above. In addition:
>
> 1. How much delayed/stale stats have you observed on real world workload?

I am not really sure. We don't have a wide deployment of kernels with
rstat yet. These are problems observed in testing and/or concerns
expressed by our userspace team.

I am trying to solve this now because any problems that result from
this staleness will be very hard to debug and link back to stale
stats.

>
> 2. What is acceptable staleness in the stats for your use-case?

Again, unfortunately I am not sure, but right now it can be O(seconds)
which is not acceptable as we have workloads querying the stats every
1s (and sometimes more frequently).

>
> 3. What is your use-case?

A few use cases we have that may be affected by this:
- System overhead: calculations using memory.usage and some stats from
memory.stat. If one of them is fresh and the other one isn't we have
an inconsistent view of the system.
- Userspace OOM killing: We use some stats in memory.stat to gauge the
amount of memory that will be freed by killing a task as sometimes
memory.usage includes shared resources that wouldn't be freed anyway.
- Proactive reclaim: we read memory.stat in a proactive reclaim
feedback loop, stale stats may cause us to mistakenly think reclaim is
ineffective and prematurely stop.

>
> 4. Does your use-case care about staleness of all the stats in
> memory.stat or some specific stats?

We have multiple use cases that can be affected by this, so I don't
think there are some specific stats. I am also not aware of all
possibly affected use cases.

>
> 5. If some specific stats in memory.stat, does it make sense to
> decouple them from rstat and just pay the price up front to maintain
> them accurately?
>
> Most importantly please please please be concise in your responses.

I try, sometimes I am not sure how much detail is needed. Sorry about that :)

>
> I know I am going back on some of the previous agreements but this
> whole locking back and forth has made in question the original
> motivation.

That's okay. Taking a step back, having flushing being indeterministic
in this way is a time bomb in my opinion. Note that this also affects
in-kernel flushers like reclaim or dirty isolation, which I suspect
will be more affected by staleness. No one complained yet AFAICT, but
I think it's a time bomb. The worst part is that if/when a problem
happens, we won't be able to easily tell what went wrong.