linux-kernel - Re: [mm-unstable v4 5/5] mm: memcg: restore subtree stats flushing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAPL-u-Futq5biNhQKTVi15vzihZxoan-dVORPqpov1saJ99=Q@mail.gmail.com>
Date:   Mon, 4 Dec 2023 15:46:31 -0800
From:   Wei Xu <weixugc@...gle.com>
To:     Shakeel Butt <shakeelb@...gle.com>
Cc:     Yosry Ahmed <yosryahmed@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Muchun Song <muchun.song@...ux.dev>,
        Ivan Babrou <ivan@...udflare.com>, Tejun Heo <tj@...nel.org>,
        Michal Koutný <mkoutny@...e.com>,
        Waiman Long <longman@...hat.com>, kernel-team@...udflare.com,
        Greg Thelen <gthelen@...gle.com>,
        Domenico Cerasuolo <cerasuolodomenico@...il.com>,
        linux-mm@...ck.org, cgroups@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [mm-unstable v4 5/5] mm: memcg: restore subtree stats flushing

On Mon, Dec 4, 2023 at 3:31 PM Shakeel Butt <shakeelb@...gle.com> wrote:
>
> On Mon, Dec 4, 2023 at 1:38 PM Yosry Ahmed <yosryahmed@...gle.com> wrote:
> >
> > On Mon, Dec 4, 2023 at 12:12 PM Yosry Ahmed <yosryahmed@...gle.com> wrote:
> > >
> > > On Sat, Dec 2, 2023 at 12:31 AM Shakeel Butt <shakeelb@...gle.com> wrote:
> > > >
> > > > On Wed, Nov 29, 2023 at 03:21:53AM +0000, Yosry Ahmed wrote:
> > > > [...]
> > > > > +void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
> > > > >  {
> > > > > -     if (memcg_should_flush_stats(root_mem_cgroup))
> > > > > -             do_flush_stats();
> > > > > +     static DEFINE_MUTEX(memcg_stats_flush_mutex);
> > > > > +
> > > > > +     if (mem_cgroup_disabled())
> > > > > +             return;
> > > > > +
> > > > > +     if (!memcg)
> > > > > +             memcg = root_mem_cgroup;
> > > > > +
> > > > > +     if (memcg_should_flush_stats(memcg)) {
> > > > > +             mutex_lock(&memcg_stats_flush_mutex);
> > > >
> > > > What's the point of this mutex now? What is it providing? I understand
> > > > we can not try_lock here due to targeted flushing. Why not just let the
> > > > global rstat serialize the flushes? Actually this mutex can cause
> > > > latency hiccups as the mutex owner can get resched during flush and then
> > > > no one can flush for a potentially long time.
> > >
> > > I was hoping this was clear from the commit message and code comments,
> > > but apparently I was wrong, sorry. Let me give more context.
> > >
> > > In previous versions and/or series, the mutex was only used with
> > > flushes from userspace to guard in-kernel flushers against high
> > > contention from userspace. Later on, I kept the mutex for all memcg
> > > flushers for the following reasons:
> > >
> > > (a) Allow waiters to sleep:
> > > Unlike other flushers, the memcg flushing path can see a lot of
> > > concurrency. The mutex avoids having a lot of CPUs spinning (e.g.
> > > concurrent reclaimers) by allowing waiters to sleep.
> > >
> > > (b) Check the threshold under lock but before calling cgroup_rstat_flush():
> > > The calls to cgroup_rstat_flush() are not very cheap even if there's
> > > nothing to flush, as we still need to iterate all CPUs. If flushers
> > > contend directly on the rstat lock, overlapping flushes will
> > > unnecessarily do the percpu iteration once they hold the lock. With
> > > the mutex, they will check the threshold again once they hold the
> > > mutex.
> > >
> > > (c) Protect non-memcg flushers from contention from memcg flushers.
> > > This is not as strong of an argument as protecting in-kernel flushers
> > > from userspace flushers.
> > >
> > > There has been discussions before about changing the rstat lock itself
> > > to be a mutex, which would resolve (a), but there are concerns about
> > > priority inversions if a low priority task holds the mutex and gets
> > > preempted, as well as the amount of time the rstat lock holder keeps
> > > the lock for:
> > > https://lore.kernel.org/lkml/ZO48h7c9qwQxEPPA@slm.duckdns.org/
> > >
> > > I agree about possible hiccups due to the inner lock being dropped
> > > while the mutex is held. Running a synthetic test with high
> > > concurrency between reclaimers (in-kernel flushers) and stats readers
> > > show no material performance difference with or without the mutex.
> > > Maybe things cancel out, or don't really matter in practice.
> > >
> > > I would prefer to keep the current code as I think (a) and (b) could
> > > cause problems in the future, and the current form of the code (with
> > > the mutex) has already seen mileage with production workloads.
> >
> > Correction: The priority inversion is possible on the memcg side due
> > to the mutex in this patch. Also, for point (a), the spinners will
> > eventually sleep once they hold the lock and hit the first CPU
> > boundary -- because of the lock dropping and cond_resched(). So
> > eventually, all spinners should be able to sleep, although it will be
> > a while until they do. With the mutex, they all sleep from the
> > beginning. Point (b) still holds though.
> >
> > I am slightly inclined to keep the mutex but I can send a small fixlet
> > to remove it if others think otherwise.
> >
> > Shakeel, Wei, any preferences?
>
> My preference is to avoid the issue we know we see in production alot
> i.e. priority inversion.
>
> In future if you see issues with spinning then you can come up with
> the lockless flush mechanism at that time.

Given that the synthetic high concurrency test doesn't show material
performance difference between the mutex and non-mutex versions, I
agree that the mutex can be taken out from this patch set (one less
global mutex to worry about).