linux-kernel - Re: [PATCH v3] mm: memcg: Use larger batches for proactive reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABdmKX35GV3VFar0_pNR_vAXLpvxo+APALXMharsXh6TO+0mrQ@mail.gmail.com>
Date: Mon, 5 Feb 2024 20:01:40 -0800
From: "T.J. Mercier" <tjmercier@...gle.com>
To: Michal Hocko <mhocko@...e.com>
Cc: Johannes Weiner <hannes@...xchg.org>, Roman Gushchin <roman.gushchin@...ux.dev>, 
	Shakeel Butt <shakeelb@...gle.com>, Muchun Song <muchun.song@...ux.dev>, 
	Andrew Morton <akpm@...ux-foundation.org>, Efly Young <yangyifei03@...ishou.com>, 
	android-mm@...gle.com, yuzhao@...gle.com, mkoutny@...e.com, 
	Yosry Ahmed <yosryahmed@...gle.com>, cgroups@...r.kernel.org, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3] mm: memcg: Use larger batches for proactive reclaim

On Mon, Feb 5, 2024 at 1:16 PM Michal Hocko <mhocko@...e.com> wrote:
>
> On Mon 05-02-24 12:47:47, T.J. Mercier wrote:
> > On Mon, Feb 5, 2024 at 12:36 PM Michal Hocko <mhocko@...e.com> wrote:
> [...]
> > > This of something like
> > > timeout $TIMEOUT echo $TARGET > $MEMCG_PATH/memory.reclaim
> > > where timeout acts as a stop gap if the reclaim cannot finish in
> > > TIMEOUT.
> >
> > Yeah I get the desired behavior, but using sc->nr_reclaimed to achieve
> > it is what's bothering me.
>
> I am not really happy about this subtlety. If we have a better way then
> let's do it. Better in its own patch, though.
>
> > It's already wired up that way though, so if you want to make this
> > change now then I can try to test for the difference using really
> > large reclaim targets.
>
> Yes, please. If you want it a separate patch then no objection from me
> of course. If you do no like the nr_to_reclaim bailout then maybe we can
> go with a simple break out flag in scan_control.
>
> Thanks!

It's a bit difficult to test under the too_many_isolated check, so I
moved the fatal_signal_pending check outside and tried with that.
Performing full reclaim on the /uid_0 cgroup with a 250ms delay before
SIGKILL, I got an average of 16ms better latency with
sc->nr_to_reclaim across 20 runs ignoring one 1s outlier with
SWAP_CLUSTER_MAX.

The return values from memory_reclaim are different since with
sc->nr_to_reclaim we "succeed" and don't reach the signal_pending
check to return -EINTR, but I don't think it matters since the return
code is 137 (SIGKILL) in both cases.

With SWAP_CLUSTER_MAX there was an outlier at nearly 1s, and in
general the latency numbers were noiser: 2% RSD vs 13% RSD. I'm
guessing that's a function of nr_to_scan being occasionally much less
than SWAP_CLUSTER_MAX causing nr[lru] to drain slowly. But it could
also have simply been scheduled out more often at the cond_resched in
shrink_lruvec, and that would help explain the 1s outlier. I don't
have enough debug info on the outlier to say much more.

With sc->nr_to_reclaim, the largest sc->nr_reclaimed value I saw was
about 2^53 for a sc->nr_to_reclaim of 2^51, but for large memcg
hierarchies I think it's possible to get more than that. There were
only 15 cgroups under /uid_0. This is the only thing that gives me
pause, since we could touch more than 2k cgroups in
shrink_node_memcgs, each one adding 4 * 2^51, potentially overflowing
sc->nr_to_reclaim. Looks testable but I didn't get to it.