linux-kernel - Re: [PATCH v3] mm: memcg: use rstat for non-hierarchical stats

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJD7tkb17x=qwoO37uxyYXLEUVp15BQKR+Xfh7Sg9Hx-wTQ_=w@mail.gmail.com>
Date:   Wed, 2 Aug 2023 15:02:55 -0700
From:   Yosry Ahmed <yosryahmed@...gle.com>
To:     Michal Hocko <mhocko@...e.com>
Cc:     Johannes Weiner <hannes@...xchg.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Shakeel Butt <shakeelb@...gle.com>,
        Muchun Song <muchun.song@...ux.dev>,
        Andrew Morton <akpm@...ux-foundation.org>,
        linux-kernel@...r.kernel.org, cgroups@...r.kernel.org,
        linux-mm@...ck.org
Subject: Re: [PATCH v3] mm: memcg: use rstat for non-hierarchical stats

On Wed, Aug 2, 2023 at 1:11 AM Yosry Ahmed <yosryahmed@...gle.com> wrote:
>
> On Wed, Aug 2, 2023 at 12:40 AM Michal Hocko <mhocko@...e.com> wrote:
> >
> > On Tue 01-08-23 10:29:39, Yosry Ahmed wrote:
> > > On Tue, Aug 1, 2023 at 9:39 AM Yosry Ahmed <yosryahmed@...gle.com> wrote:
> > [...]
> > > > > Have you measured any potential regression for cgroup v2 which collects
> > > > > all this data without ever using it (AFAICS)?
> > > >
> > > > I did not. I did not expect noticeable regressions given that all the
> > > > extra work is done during flushing, which should mostly be done by the
> > > > asynchronous worker, but can also happen in the stats reading context.
> > > > Let me run the same script on cgroup v2 just in case and report back.
> > >
> > > A few runs on mm-unstable with this patch:
> > >
> > > # time cat /sys/fs/cgroup/cg*/memory.stat > /dev/null
> >
> > Is this really representative test to make? I would have expected the
> > overhead would be mostly in mem_cgroup_css_rstat_flush (if it is visible
> > at all of course). This would be more likely visible in all cpus busy
> > situation (you can try heavy parallel kernel build from tmpfs for
> > example).
>
>
> I see. You are more worried about asynchronous flushing eating cpu
> time rather than the synchronous flushing being slower. In fact, my
> test is actually not representative at all because probably most of
> the cgroups either do not have updates or the asynchronous flusher got
> to them first.
>
> Let me try a workload that is more parallel & cpu intensive and report
> back. I am thinking of parallel reclaim/refault loops since both
> reclaim and refault paths invoke stat updates and stat flushing.
>

I am back with more data.

So I wrote a small reclaim/refault stress test that creates (NR_CPUS *
2) cgroups, assigns them limits, runs a worker process in each cgroup
that allocates tmpfs memory equal to quadruple the limit (to invoke
reclaim) continuously, and then reads back the entire file (to invoke
refaults). All workers are run in parallel, and zram is used as a
swapping backend. Both reclaim and refault have conditional stats
flushing. I ran this on a machine with 112 cpus, once on mm-unstable,
and once on mm-unstable with this patch reverted. The script is
attached.

(1) A few runs without this patch:

# time ./stress_reclaim_refault.sh
real 0m9.949s
user 0m0.496s
sys 14m44.974s

# time ./stress_reclaim_refault.sh
real 0m10.049s
user 0m0.486s
sys 14m55.791s

# time ./stress_reclaim_refault.sh
real 0m9.984s
user 0m0.481s
sys 14m53.841s

(2) A few runs with this patch:

# time ./stress_reclaim_refault.sh
real 0m9.885s
user 0m0.486s
sys 14m48.753s

# time ./stress_reclaim_refault.sh
real 0m9.903s
user 0m0.495s
sys 14m48.339s

# time ./stress_reclaim_refault.sh
real 0m9.861s
user 0m0.507s
sys 14m49.317s

I do not see any regressions from this patch. There is actually a very
slight improvement. If I have to guess, maybe it's because we avoid
the percpu loop in count_shadow_nodes() when calling
lruvec_page_state_local(), but I could not prove this using perf, it's
probably in the noise.

Let me know if the testing is satisfactory for you. I can send an
updated commit log accordingly with a summary of this conversation.

> > --
> > Michal Hocko
> > SUSE Labs

View attachment "stress_reclaim_refault.sh" of type "text/x-sh" (1070 bytes)