linux-kernel - Re: [PATCH] mm: memcg: use rstat for non-hierarchical stats

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230725140435.GB1146582@cmpxchg.org>
Date:   Tue, 25 Jul 2023 10:04:35 -0400
From:   Johannes Weiner <hannes@...xchg.org>
To:     Yosry Ahmed <yosryahmed@...gle.com>
Cc:     Michal Hocko <mhocko@...nel.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Shakeel Butt <shakeelb@...gle.com>,
        Muchun Song <muchun.song@...ux.dev>,
        Andrew Morton <akpm@...ux-foundation.org>,
        linux-kernel@...r.kernel.org, cgroups@...r.kernel.org,
        linux-mm@...ck.org
Subject: Re: [PATCH] mm: memcg: use rstat for non-hierarchical stats

On Wed, Jul 19, 2023 at 05:46:13PM +0000, Yosry Ahmed wrote:
> Currently, memcg uses rstat to maintain hierarchical stats. The rstat
> framework keeps track of which cgroups have updates on which cpus.
> 
> For non-hierarchical stats, as memcg moved to rstat, they are no longer
> readily available as counters. Instead, the percpu counters for a given
> stat need to be summed to get the non-hierarchical stat value. This
> causes a performance regression when reading non-hierarchical stats on
> kernels where memcg moved to using rstat. This is especially visible
> when reading memory.stat on cgroup v1. There are also some code paths
> internal to the kernel that read such non-hierarchical stats.

It's actually not an rstat regression. It's always been this costly.

Quick history:

We used to maintain *all* stats in per-cpu counters at the local
level. memory.stat reads would have to iterate and aggregate the
entire subtree every time. This was obviously very costly, so we added
batched upward propagation during stat updates to simplify reads:

commit 42a300353577ccc17ecc627b8570a89fa1678bec
Author: Johannes Weiner <hannes@...xchg.org>
Date:   Tue May 14 15:47:12 2019 -0700

    mm: memcontrol: fix recursive statistics correctness & scalabilty

However, that caused a regression in the stat write path, as the
upward propagation would bottleneck on the cachelines in the shared
parents. The fix for *that* re-introduced the per-cpu loops in the
local stat reads:

commit 815744d75152078cde5391fc1e3c2d4424323fb6
Author: Johannes Weiner <hannes@...xchg.org>
Date:   Thu Jun 13 15:55:46 2019 -0700

    mm: memcontrol: don't batch updates of local VM stats and events

So I wouldn't say it's a regression from rstat. Except for that short
period between the two commits above, the read side for local stats
was always expensive.

rstat promises a shot at finally fixing it, with less risk to the
write path.

> It is inefficient to iterate and sum counters in all cpus when the rstat
> framework knows exactly when a percpu counter has an update. Instead,
> maintain cpu-aggregated non-hierarchical counters for each stat. During
> an rstat flush, keep those updated as well. When reading
> non-hierarchical stats, we no longer need to iterate cpus, we just need
> to read the maintainer counters, similar to hierarchical stats.
> 
> A caveat is that we now a stats flush before reading
> local/non-hierarchical stats through {memcg/lruvec}_page_state_local()
> or memcg_events_local(), where we previously only needed a flush to
> read hierarchical stats. Most contexts reading non-hierarchical stats
> are already doing a flush, add a flush to the only missing context in
> count_shadow_nodes().
> 
> With this patch, reading memory.stat from 1000 memcgs is 3x faster on a
> machine with 256 cpus on cgroup v1:
>  # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done
>  # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null
>  real	 0m0.125s
>  user	 0m0.005s
>  sys	 0m0.120s
> 
> After:
>  real	 0m0.032s
>  user	 0m0.005s
>  sys	 0m0.027s
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@...gle.com>

Acked-by: Johannes Weiner <hannes@...xchg.org>

But I want to be clear: this isn't a regression fix. It's a new
performance optimization for the deprecated cgroup1 code. And it comes
at the cost of higher memory footprint for both cgroup1 AND cgroup2.

If this causes a regression, we should revert it again. But let's try.