linux-kernel - Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJD7tkaXL_vMsgYET9yjYQW5pM2c60fD_7r_z4vkMPcqferS8A@mail.gmail.com>
Date: Wed, 30 Oct 2024 14:01:11 -0700
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Usama Arif <usamaarif642@...il.com>
Cc: Barry Song <21cnbao@...il.com>, akpm@...ux-foundation.org, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, Barry Song <v-songbaohua@...o.com>, 
	Kanchana P Sridhar <kanchana.p.sridhar@...el.com>, David Hildenbrand <david@...hat.com>, 
	Baolin Wang <baolin.wang@...ux.alibaba.com>, Chris Li <chrisl@...nel.org>, 
	"Huang, Ying" <ying.huang@...el.com>, Kairui Song <kasong@...cent.com>, 
	Ryan Roberts <ryan.roberts@....com>, Johannes Weiner <hannes@...xchg.org>, 
	Michal Hocko <mhocko@...nel.org>, Roman Gushchin <roman.gushchin@...ux.dev>, 
	Shakeel Butt <shakeel.butt@...ux.dev>, Muchun Song <muchun.song@...ux.dev>
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing
 for nearly full memcg

On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@...il.com> wrote:
>
>
>
> On 30/10/2024 19:51, Yosry Ahmed wrote:
> > [..]
> >>> My second point about the mitigation is as follows: For a system (or
> >>> memcg) under severe memory pressure, especially one without hardware TLB
> >>> optimization, is enabling mTHP always the right choice? Since mTHP operates at
> >>> a larger granularity, some internal fragmentation is unavoidable, regardless
> >>> of optimization. Could the mitigation code help in automatically tuning
> >>> this fragmentation?
> >>>
> >>
> >> I agree with the point that enabling mTHP always is not the right thing to do
> >> on all platforms. I also think it might be the case that enabling mTHP
> >> might be a good thing for some workloads, but enabling mTHP swapin along with
> >> it might not.
> >>
> >> As you said when you have apps switching between foreground and background
> >> in android, it probably makes sense to have large folio swapping, as you
> >> want to bringin all the pages from background app as quickly as possible.
> >> And also all the TLB optimizations and smaller lru overhead you get after
> >> you have brought in all the pages.
> >> Linux kernel build test doesnt really get to benefit from the TLB optimization
> >> and smaller lru overhead, as probably the pages are very short lived. So I
> >> think it doesnt show the benefit of large folio swapin properly and
> >> large folio swapin should probably be disabled for this kind of workload,
> >> eventhough mTHP should be enabled.
> >>
> >> I am not sure that the approach we are trying in this patch is the right way:
> >> - This patch makes it a memcg issue, but you could have memcg disabled and
> >> then the mitigation being tried here wont apply.
> >
> > Is the problem reproducible without memcg? I imagine only if the
> > entire system is under memory pressure. I guess we would want the same
> > "mitigation" either way.
> >
> What would be a good open source benchmark/workload to test without limiting memory
> in memcg?
> For the kernel build test, I can only get zswap activity to happen if I build
> in cgroup and limit memory.max.

You mean a benchmark that puts the entire system under memory
pressure? I am not sure, it ultimately depends on the size of memory
you have, among other factors.

What if you run the kernel build test in a VM? Then you can limit is
size like a memcg, although you'd probably need to leave more room
because the entire guest OS will also subject to the same limit.

>
> I can just run zswap large folio zswapin in production and see, but that will take me a few
> days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing,
> then maybe its not really an issue? I believe Barry doesnt see an issue in android
> phones (but please correct me if I am wrong), and if there isnt an issue in Meta
> production as well, its a good data point for servers as well. And maybe
> kernel build in 4G memcg is not a good test.

If there is a regression in the kernel build, this means some
workloads may be affected, even if Meta's prod isn't. I understand
that the benchmark is not very representative of real world workloads,
but in this instance I think the thrashing problem surfaced by the
benchmark is real.

>
> >> - Instead of this being a large folio swapin issue, is it more of a readahead
> >> issue? If we zswap (without the large folio swapin series) and change the window
> >> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> >> when cgroup memory is limited as readahead would probably cause swap thrashing as
> >> well.
> >
> > I think large folio swapin would make the problem worse anyway. I am
> > also not sure if the readahead window adjusts on memory pressure or
> > not.
> >
> readahead window doesnt look at memory pressure. So maybe the same thing is being
> seen here as there would be in swapin_readahead?

Maybe readahead is not as aggressive in general as large folio
swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
of the window is the smaller of page_cluster (2 or 3) and
SWAP_RA_ORDER_CEILING (5).

Also readahead will swapin 4k folios AFAICT, so we don't need a
contiguous allocation like large folio swapin. So that could be
another factor why readahead may not reproduce the problem.

> Maybe if we check kernel build test
> performance in 4G memcg with below diff, it might get better?

I think you can use the page_cluster tunable to do this at runtime.

>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 4669f29cf555..9e196e1e6885 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -809,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>         pgoff_t ilx;
>         bool page_allocated;
>
> -       win = swap_vma_ra_win(vmf, &start, &end);
> +       win = 1;
>         if (win == 1)
>                 goto skip;
>