lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJD7tkbZi16w4mYngVK8qA84FMijmHvwzMjHfrJiCsV=WjixOA@mail.gmail.com>
Date:   Tue, 1 Aug 2023 10:29:39 -0700
From:   Yosry Ahmed <yosryahmed@...gle.com>
To:     Michal Hocko <mhocko@...e.com>
Cc:     Johannes Weiner <hannes@...xchg.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Shakeel Butt <shakeelb@...gle.com>,
        Muchun Song <muchun.song@...ux.dev>,
        Andrew Morton <akpm@...ux-foundation.org>,
        linux-kernel@...r.kernel.org, cgroups@...r.kernel.org,
        linux-mm@...ck.org
Subject: Re: [PATCH v3] mm: memcg: use rstat for non-hierarchical stats

On Tue, Aug 1, 2023 at 9:39 AM Yosry Ahmed <yosryahmed@...gle.com> wrote:
>
> On Tue, Aug 1, 2023 at 7:30 AM Michal Hocko <mhocko@...e.com> wrote:
> >
> > On Wed 26-07-23 15:32:23, Yosry Ahmed wrote:
> > > Currently, memcg uses rstat to maintain aggregated hierarchical stats.
> > > Counters are maintained for hierarchical stats at each memcg. Rstat
> > > tracks which cgroups have updates on which cpus to keep those counters
> > > fresh on the read-side.
> > >
> > > Non-hierarchical stats are currently not covered by rstat. Their
> > > per-cpu counters are summed up on every read, which is expensive.
> > > The original implementation did the same. At some point before rstat,
> > > non-hierarchical aggregated counters were introduced by
> > > commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
> > > memory.stat reporting"). However, those counters were updated on the
> > > performance critical write-side, which caused regressions, so they were
> > > later removed by commit 815744d75152 ("mm: memcontrol: don't batch
> > > updates of local VM stats and events"). See [1] for more detailed
> > > history.
> > >
> > > Kernel versions in between a983b5ebee57 & 815744d75152 (a year and a
> > > half) enjoyed cheap reads of non-hierarchical stats, specifically on
> > > cgroup v1. When moving to more recent kernels, a performance regression
> > > for reading non-hierarchical stats is observed.
> > >
> > > Now that we have rstat, we know exactly which percpu counters have
> > > updates for each stat. We can maintain non-hierarchical counters again,
> > > making reads much more efficient, without affecting the performance
> > > critical write-side. Hence, add non-hierarchical (i.e local) counters
> > > for the stats, and extend rstat flushing to keep those up-to-date.
> > >
> > > A caveat is that we now need a stats flush before reading
> > > local/non-hierarchical stats through {memcg/lruvec}_page_state_local()
> > > or memcg_events_local(), where we previously only needed a flush to
> > > read hierarchical stats. Most contexts reading non-hierarchical stats
> > > are already doing a flush, add a flush to the only missing context in
> > > count_shadow_nodes().
> > >
> > > With this patch, reading memory.stat from 1000 memcgs is 3x faster on a
> > > machine with 256 cpus on cgroup v1:
> > >  # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done
> > >  # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null
> > >  real  0m0.125s
> > >  user  0m0.005s
> > >  sys   0m0.120s
> > >
> > > After:
> > >  real  0m0.032s
> > >  user  0m0.005s
> > >  sys   0m0.027s
> >
> > Have you measured any potential regression for cgroup v2 which collects
> > all this data without ever using it (AFAICS)?
>
> I did not. I did not expect noticeable regressions given that all the
> extra work is done during flushing, which should mostly be done by the
> asynchronous worker, but can also happen in the stats reading context.
> Let me run the same script on cgroup v2 just in case and report back.

A few runs on mm-unstable with this patch:

# time cat /sys/fs/cgroup/cg*/memory.stat > /dev/null
real 0m0.020s
user 0m0.005s
sys 0m0.015s

# time cat /sys/fs/cgroup/cg*/memory.stat > /dev/null
real 0m0.017s
user 0m0.005s
sys 0m0.012s

# time cat /sys/fs/cgroup/cg*/memory.stat > /dev/null
real 0m0.016s
user 0m0.004s
sys 0m0.012s

A few runs on mm-unstable with the patch reverted:

# time cat /sys/fs/cgroup/cg*/memory.stat > /dev/null
real 0m0.020s
user 0m0.005s
sys 0m0.015s

# time cat /sys/fs/cgroup/cg*/memory.stat > /dev/null
real 0m0.016s
user 0m0.004s
sys 0m0.012s

# time cat /sys/fs/cgroup/cg*/memory.stat > /dev/null
real 0m0.017s
user 0m0.005s
sys 0m0.012s

It looks like there are no regressions on cgroup v2 when reading the
stats. Please let me know if you want me to send a new version with
the cgroup v2 results as well in the commit log -- or I can just send
a new commit log. Whatever is easier for Andrew.

>
> > --
> > Michal Hocko
> > SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ