linux-kernel - Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJD7tkZ60ROeHek92jgO0z7LsEfgPbfXN9naUC5j7QjRQxpoKw@mail.gmail.com>
Date: Wed, 30 Oct 2024 12:51:58 -0700
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Usama Arif <usamaarif642@...il.com>
Cc: Barry Song <21cnbao@...il.com>, akpm@...ux-foundation.org, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, Barry Song <v-songbaohua@...o.com>, 
	Kanchana P Sridhar <kanchana.p.sridhar@...el.com>, David Hildenbrand <david@...hat.com>, 
	Baolin Wang <baolin.wang@...ux.alibaba.com>, Chris Li <chrisl@...nel.org>, 
	"Huang, Ying" <ying.huang@...el.com>, Kairui Song <kasong@...cent.com>, 
	Ryan Roberts <ryan.roberts@....com>, Johannes Weiner <hannes@...xchg.org>, 
	Michal Hocko <mhocko@...nel.org>, Roman Gushchin <roman.gushchin@...ux.dev>, 
	Shakeel Butt <shakeel.butt@...ux.dev>, Muchun Song <muchun.song@...ux.dev>
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing
 for nearly full memcg

[..]
> > My second point about the mitigation is as follows: For a system (or
> > memcg) under severe memory pressure, especially one without hardware TLB
> > optimization, is enabling mTHP always the right choice? Since mTHP operates at
> > a larger granularity, some internal fragmentation is unavoidable, regardless
> > of optimization. Could the mitigation code help in automatically tuning
> > this fragmentation?
> >
>
> I agree with the point that enabling mTHP always is not the right thing to do
> on all platforms. I also think it might be the case that enabling mTHP
> might be a good thing for some workloads, but enabling mTHP swapin along with
> it might not.
>
> As you said when you have apps switching between foreground and background
> in android, it probably makes sense to have large folio swapping, as you
> want to bringin all the pages from background app as quickly as possible.
> And also all the TLB optimizations and smaller lru overhead you get after
> you have brought in all the pages.
> Linux kernel build test doesnt really get to benefit from the TLB optimization
> and smaller lru overhead, as probably the pages are very short lived. So I
> think it doesnt show the benefit of large folio swapin properly and
> large folio swapin should probably be disabled for this kind of workload,
> eventhough mTHP should be enabled.
>
> I am not sure that the approach we are trying in this patch is the right way:
> - This patch makes it a memcg issue, but you could have memcg disabled and
> then the mitigation being tried here wont apply.

Is the problem reproducible without memcg? I imagine only if the
entire system is under memory pressure. I guess we would want the same
"mitigation" either way.

> - Instead of this being a large folio swapin issue, is it more of a readahead
> issue? If we zswap (without the large folio swapin series) and change the window
> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> when cgroup memory is limited as readahead would probably cause swap thrashing as
> well.

I think large folio swapin would make the problem worse anyway. I am
also not sure if the readahead window adjusts on memory pressure or
not.

> - Instead of looking at cgroup margin, maybe we should try and look at
> the rate of change of workingset_restore_anon? This might be a lot more complicated
> to do, but probably is the right metric to determine swap thrashing. It also means
> that this could be used in both the synchronous swapcache skipping path and
> swapin_readahead path.
> (Thanks Johannes for suggesting this)
>
> With the large folio swapin, I do see the large improvement when considering only
> swapin performance and latency in the same way as you saw in zram.
> Maybe the right short term approach is to have
> /sys/kernel/mm/transparent_hugepage/swapin
> and have that disabled by default to avoid regression.
> If the workload owner sees a benefit, they can enable it.
> I can add this when sending the next version of large folio zswapin if that makes
> sense?

I would honestly prefer we avoid this if possible. It's always easy to
just put features behind knobs, and then users have the toil of
figuring out if/when they can use it, or just give up. We should find
a way to avoid the thrashing due to hitting the memcg limit (or being
under global memory pressure), it seems like something the kernel
should be able to do on its own.

> Longer term I can try and have a look at if we can do something with
> workingset_restore_anon to improve things.

I am not a big fan of this, mainly because reading a stat from the
kernel puts us in a situation where we have to choose between:
- Doing a memcg stats flush in the kernel, which is something we are
trying to move away from due to various problems we have been running
into.
- Using potentially stale stats (up to 2s), which may be fine but is
suboptimal at best. We may have blips of thrashing due to stale stats
not showing the refaults.