lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJD7tkaFHgc3eN1K1wYsQFWMLu4+Frf9DJ-5HOja2nC20Es9Dw@mail.gmail.com>
Date:   Sat, 12 Aug 2023 04:04:32 -0700
From:   Yosry Ahmed <yosryahmed@...gle.com>
To:     Michal Hocko <mhocko@...e.com>
Cc:     Shakeel Butt <shakeelb@...gle.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Muchun Song <muchun.song@...ux.dev>, cgroups@...r.kernel.org,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        Tejun Heo <tj@...nel.org>
Subject: Re: [PATCH] mm: memcg: provide accurate stats for userspace reads

On Sat, Aug 12, 2023 at 1:35 AM Michal Hocko <mhocko@...e.com> wrote:
>
> On Fri 11-08-23 19:48:14, Shakeel Butt wrote:
> > On Fri, Aug 11, 2023 at 7:36 PM Yosry Ahmed <yosryahmed@...gle.com> wrote:
> > >
> > > On Fri, Aug 11, 2023 at 7:29 PM Shakeel Butt <shakeelb@...gle.com> wrote:
> > > >
> > > > On Fri, Aug 11, 2023 at 7:12 PM Yosry Ahmed <yosryahmed@...gle.com> wrote:
> > > > >
> > > > [...]
> > > > >
> > > > > I am worried that writing to a stat for flushing then reading will
> > > > > increase the staleness window which we are trying to reduce here.
> > > > > Would it be acceptable to add a separate interface to explicitly read
> > > > > flushed stats without having to write first? If the distinction
> > > > > disappears in the future we can just short-circuit both interfaces.
> > > >
> > > > What is the acceptable staleness time window for your case? It is hard
> > > > to imagine that a write+read will always be worse than just a read.
> > > > Even the proposed patch can have an unintended and larger than
> > > > expected staleness window due to some processing on
> > > > return-to-userspace or some scheduling delay.
> > >
> > > Maybe I am worrying too much, we can just go for writing to
> > > memory.stat for explicit stats refresh.
> > >
> > > Do we still want to go with the mutex approach Michal suggested for
> > > do_flush_stats() to support either waiting for ongoing flushes
> > > (mutex_lock) or skipping (mutex_trylock)?
> >
> > I would say keep that as a separate patch.
>
> Separate patches would be better but please make the mutex conversion
> first. We really do not want to have any busy waiting depending on a
> sleep exported to the userspace. That is just no-go.

+tj@...nel.org

That makes sense.

Taking a step back though, and considering there have been other
complaints about unified flushing causing expensive reads from
memory.stat [1], I am wondering if we should tackle the fundamental
problem.

We have a single global rstat lock for flushing, which protects the
global per-cgroup counters as far as I understand. A single lock means
a lot of contention, which is why we implemented unified flushing on
the memcg side in the first place, where we only let one flusher
operate and everyone else skip, but that flusher needs to flush the
entire tree.

This can be unnecessarily expensive (see [1]), and to avoid how
expensive it is we sacrifice accuracy (what this patch is about). I am
exploring breaking down that lock into per-cgroup locks, where a
flusher acquires locks in a top down fashion. This allows for some
concurrency in flushing, and makes unified flushing unnecessary. If we
retire unified flushing we fix both accuracy and expensive reads at
the same time, while not sacrificing performance for concurrent
in-kernel flushers.

What do you think? I am prototyping something now and running some
tests, it seems promising and simple-ish (unless I am missing a big
correctness issue).

[1] https://lore.kernel.org/lkml/CABWYdi3YNwtPDwwJWmCO-ER50iP7CfbXkCep5TKb-9QzY-a40A@mail.gmail.com/

>
> Thanks!
> --
> Michal Hocko
> SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ