linux-kernel - Re: [PATCH] Revert "mm:vmscan: fix inaccurate reclaim during proactive reclaim"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABdmKX2K4MMe9rsKfWi9RxUS5G1RkLVzuUkPnovt5O2hqVmbWA@mail.gmail.com>
Date: Tue, 23 Jan 2024 05:58:05 -0800
From: "T.J. Mercier" <tjmercier@...gle.com>
To: Michal Hocko <mhocko@...e.com>
Cc: Johannes Weiner <hannes@...xchg.org>, Roman Gushchin <roman.gushchin@...ux.dev>, 
	Shakeel Butt <shakeelb@...gle.com>, Muchun Song <muchun.song@...ux.dev>, 
	Andrew Morton <akpm@...ux-foundation.org>, android-mm@...gle.com, yuzhao@...gle.com, 
	yangyifei03@...ishou.com, cgroups@...r.kernel.org, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] Revert "mm:vmscan: fix inaccurate reclaim during
 proactive reclaim"

On Tue, Jan 23, 2024 at 1:33 AM Michal Hocko <mhocko@...e.com> wrote:
>
> On Sun 21-01-24 21:44:12, T.J. Mercier wrote:
> > This reverts commit 0388536ac29104a478c79b3869541524caec28eb.
> >
> > Proactive reclaim on the root cgroup is 10x slower after this patch when
> > MGLRU is enabled, and completion times for proactive reclaim on much
> > smaller non-root cgroups take ~30% longer (with or without MGLRU).
>
> What is the reclaim target in these pro-active reclaim requests?

Two targets:
1) /sys/fs/cgroup/memory.reclaim
2) /sys/fs/cgroup/uid_0/memory.reclaim (a bunch of Android system services)

Note that lru_gen_shrink_node is used for 1, but shrink_node_memcgs is
used for 2.

The 10x comes from the rate of reclaim (~70k pages/sec vs ~6.6k
pages/sec) for 1. After this revert the root reclaim took only about
10 seconds. Before the revert it's still running after about 3 minutes
using a core at 100% the whole time, and I'm too impatient to wait
longer to record times for comparison.

The 30% comes from the average of a few runs for 2:
Before revert:
$ adb wait-for-device && sleep 120 && adb root && adb shell -t 'time
echo "" > /sys/fs/cgroup/uid_0/memory.reclaim'
restarting adbd as root
    0m09.69s real     0m00.00s user     0m09.19s system

After revert:
$ adb wait-for-device && sleep 120 && adb root && adb shell -t 'time
echo "" > /sys/fs/cgroup/uid_0/memory.reclaim'
    0m07.31s real     0m00.00s user     0m06.44s system

It's actually a bigger difference for smaller reclaim amounts:
Before revert:
$ adb wait-for-device && sleep 120 && adb root && adb shell -t 'time
echo "3G" > /sys/fs/cgroup/uid_0/memory.reclaim'
    0m12.04s real     0m00.00s user     0m11.48s system

After revert:
$ adb wait-for-device && sleep 120 && adb root && adb shell -t 'time
echo "3G" > /sys/fs/cgroup/uid_0/memory.reclaim'
    0m06.65s real     0m00.00s user     0m05.91s system

> > With
> > root reclaim before the patch, I observe average reclaim rates of
> > ~70k pages/sec before try_to_free_mem_cgroup_pages starts to fail and
> > the nr_retries counter starts to decrement, eventually ending the
> > proactive reclaim attempt.
>
> Do I understand correctly that the reclaim target is over estimated and
> you expect that the reclaim process breaks out early

Yes. I expect memory_reclaim to fail at some point when it becomes
difficult/impossible to reclaim pages where I specify a large amount
to reclaim. The ask here is, "please reclaim as much as possible from
this cgroup, but don't take all day". But it takes minutes to get
there on the root cgroup, working SWAP_CLUSTER_MAX pages at a time.

> > After the patch the reclaim rate is
> > consistently ~6.6k pages/sec due to the reduced nr_pages value causing
> > scan aborts as soon as SWAP_CLUSTER_MAX pages are reclaimed. The
> > proactive reclaim doesn't complete after several minutes because
> > try_to_free_mem_cgroup_pages is still capable of reclaiming pages in
> > tiny SWAP_CLUSTER_MAX page chunks and nr_retries is never decremented.
>
> I do not understand this part. How does a smaller reclaim target manages
> to have reclaimed > 0 while larger one doesn't?

They both are able to make progress. The main difference is that a
single iteration of try_to_free_mem_cgroup_pages with MGLRU ends soon
after it reclaims nr_to_reclaim, and before it touches all memcgs. So
a single iteration really will reclaim only about SWAP_CLUSTER_MAX-ish
pages with MGLRU. WIthout MGLRU the memcg walk is not aborted
immediately after nr_to_reclaim is reached, so a single call to
try_to_free_mem_cgroup_pages can actually reclaim thousands of pages
even when sc->nr_to_reclaim is 32. (I.E. MGLRU overreclaims less.)
https://lore.kernel.org/lkml/20221201223923.873696-1-yuzhao@google.com/