linux-kernel - Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <852211c6-0b55-4bdd-8799-90e1f0c002c1@gmail.com>
Date: Wed, 30 Oct 2024 20:25:27 +0000
From: Usama Arif <usamaarif642@...il.com>
To: Yosry Ahmed <yosryahmed@...gle.com>
Cc: Barry Song <21cnbao@...il.com>, akpm@...ux-foundation.org,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org,
 Barry Song <v-songbaohua@...o.com>,
 Kanchana P Sridhar <kanchana.p.sridhar@...el.com>,
 David Hildenbrand <david@...hat.com>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>, Chris Li <chrisl@...nel.org>,
 "Huang, Ying" <ying.huang@...el.com>, Kairui Song <kasong@...cent.com>,
 Ryan Roberts <ryan.roberts@....com>, Johannes Weiner <hannes@...xchg.org>,
 Michal Hocko <mhocko@...nel.org>, Roman Gushchin <roman.gushchin@...ux.dev>,
 Shakeel Butt <shakeel.butt@...ux.dev>, Muchun Song <muchun.song@...ux.dev>
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing
 for nearly full memcg



On 30/10/2024 19:51, Yosry Ahmed wrote:
> [..]
>>> My second point about the mitigation is as follows: For a system (or
>>> memcg) under severe memory pressure, especially one without hardware TLB
>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at
>>> a larger granularity, some internal fragmentation is unavoidable, regardless
>>> of optimization. Could the mitigation code help in automatically tuning
>>> this fragmentation?
>>>
>>
>> I agree with the point that enabling mTHP always is not the right thing to do
>> on all platforms. I also think it might be the case that enabling mTHP
>> might be a good thing for some workloads, but enabling mTHP swapin along with
>> it might not.
>>
>> As you said when you have apps switching between foreground and background
>> in android, it probably makes sense to have large folio swapping, as you
>> want to bringin all the pages from background app as quickly as possible.
>> And also all the TLB optimizations and smaller lru overhead you get after
>> you have brought in all the pages.
>> Linux kernel build test doesnt really get to benefit from the TLB optimization
>> and smaller lru overhead, as probably the pages are very short lived. So I
>> think it doesnt show the benefit of large folio swapin properly and
>> large folio swapin should probably be disabled for this kind of workload,
>> eventhough mTHP should be enabled.
>>
>> I am not sure that the approach we are trying in this patch is the right way:
>> - This patch makes it a memcg issue, but you could have memcg disabled and
>> then the mitigation being tried here wont apply.
> 
> Is the problem reproducible without memcg? I imagine only if the
> entire system is under memory pressure. I guess we would want the same
> "mitigation" either way.
> 
What would be a good open source benchmark/workload to test without limiting memory
in memcg?
For the kernel build test, I can only get zswap activity to happen if I build
in cgroup and limit memory.max.

I can just run zswap large folio zswapin in production and see, but that will take me a few
days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing,
then maybe its not really an issue? I believe Barry doesnt see an issue in android
phones (but please correct me if I am wrong), and if there isnt an issue in Meta 
production as well, its a good data point for servers as well. And maybe
kernel build in 4G memcg is not a good test.

>> - Instead of this being a large folio swapin issue, is it more of a readahead
>> issue? If we zswap (without the large folio swapin series) and change the window
>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
>> when cgroup memory is limited as readahead would probably cause swap thrashing as
>> well.
> 
> I think large folio swapin would make the problem worse anyway. I am
> also not sure if the readahead window adjusts on memory pressure or
> not.
> 
readahead window doesnt look at memory pressure. So maybe the same thing is being
seen here as there would be in swapin_readahead? Maybe if we check kernel build test
performance in 4G memcg with below diff, it might get better?  

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4669f29cf555..9e196e1e6885 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -809,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
        pgoff_t ilx;
        bool page_allocated;
 
-       win = swap_vma_ra_win(vmf, &start, &end);
+       win = 1;
        if (win == 1)
                goto skip;

>> - Instead of looking at cgroup margin, maybe we should try and look at
>> the rate of change of workingset_restore_anon? This might be a lot more complicated
>> to do, but probably is the right metric to determine swap thrashing. It also means
>> that this could be used in both the synchronous swapcache skipping path and
>> swapin_readahead path.
>> (Thanks Johannes for suggesting this)
>>
>> With the large folio swapin, I do see the large improvement when considering only
>> swapin performance and latency in the same way as you saw in zram.
>> Maybe the right short term approach is to have
>> /sys/kernel/mm/transparent_hugepage/swapin
>> and have that disabled by default to avoid regression.
>> If the workload owner sees a benefit, they can enable it.
>> I can add this when sending the next version of large folio zswapin if that makes
>> sense?
> 
> I would honestly prefer we avoid this if possible. It's always easy to
> just put features behind knobs, and then users have the toil of
> figuring out if/when they can use it, or just give up. We should find
> a way to avoid the thrashing due to hitting the memcg limit (or being
> under global memory pressure), it seems like something the kernel
> should be able to do on its own.
> 
>> Longer term I can try and have a look at if we can do something with
>> workingset_restore_anon to improve things.
> 
> I am not a big fan of this, mainly because reading a stat from the
> kernel puts us in a situation where we have to choose between:
> - Doing a memcg stats flush in the kernel, which is something we are
> trying to move away from due to various problems we have been running
> into.
> - Using potentially stale stats (up to 2s), which may be fine but is
> suboptimal at best. We may have blips of thrashing due to stale stats
> not showing the refaults.