[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d88867ba-32e6-4b28-961e-756389d345ca@gmail.com>
Date: Wed, 30 Oct 2024 20:41:12 +0000
From: Usama Arif <usamaarif642@...il.com>
To: Barry Song <21cnbao@...il.com>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Barry Song <v-songbaohua@...o.com>,
Kanchana P Sridhar <kanchana.p.sridhar@...el.com>,
David Hildenbrand <david@...hat.com>,
Baolin Wang <baolin.wang@...ux.alibaba.com>, Chris Li <chrisl@...nel.org>,
Yosry Ahmed <yosryahmed@...gle.com>, "Huang, Ying" <ying.huang@...el.com>,
Kairui Song <kasong@...cent.com>, Ryan Roberts <ryan.roberts@....com>,
Johannes Weiner <hannes@...xchg.org>, Michal Hocko <mhocko@...nel.org>,
Roman Gushchin <roman.gushchin@...ux.dev>,
Shakeel Butt <shakeel.butt@...ux.dev>, Muchun Song <muchun.song@...ux.dev>
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing
for nearly full memcg
On 30/10/2024 20:27, Barry Song wrote:
> On Thu, Oct 31, 2024 at 3:51 AM Usama Arif <usamaarif642@...il.com> wrote:
>>
>>
>>
>> On 28/10/2024 22:03, Barry Song wrote:
>>> On Mon, Oct 28, 2024 at 8:07 PM Usama Arif <usamaarif642@...il.com> wrote:
>>>>
>>>>
>>>>
>>>> On 27/10/2024 01:14, Barry Song wrote:
>>>>> From: Barry Song <v-songbaohua@...o.com>
>>>>>
>>>>> In a memcg where mTHP is always utilized, even at full capacity, it
>>>>> might not be the best option. Consider a system that uses only small
>>>>> folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX
>>>>> of buffer space before it can initiate the next reclamation. However,
>>>>> large folios can quickly fill this space, rapidly bringing the memcg
>>>>> back to full capacity, even though some portions of the large folios
>>>>> may not be immediately needed and used by the process.
>>>>>
>>>>> Usama and Kanchana identified a regression when building the kernel in
>>>>> a memcg with memory.max set to a small value while enabling large
>>>>> folio swap-in support on zswap[1].
>>>>>
>>>>> The issue arises from an edge case where the memory cgroup remains
>>>>> nearly full most of the time. Consequently, bringing in mTHP can
>>>>> quickly cause a memcg overflow, triggering a swap-out. The subsequent
>>>>> swap-in then recreates the overflow, resulting in a repetitive cycle.
>>>>>
>>>>> We need a mechanism to stop the cup from overflowing continuously.
>>>>> One potential solution is to slow the filling process when we identify
>>>>> that the cup is nearly full.
>>>>>
>>>>> Usama reported an improvement when we mitigate mTHP swap-in as the
>>>>> memcg approaches full capacity[2]:
>>>>>
>>>>> int mem_cgroup_swapin_charge_folio(...)
>>>>> {
>>>>> ...
>>>>> if (folio_test_large(folio) &&
>>>>> mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)))
>>>>> ret = -ENOMEM;
>>>>> else
>>>>> ret = charge_memcg(folio, memcg, gfp);
>>>>> ...
>>>>> }
>>>>>
>>>>> AMD 16K+32K THP=always
>>>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix
>>>>> real 1m23.038s 1m23.050s 1m22.704s
>>>>> user 53m57.210s 53m53.437s 53m52.577s
>>>>> sys 7m24.592s 7m48.843s 7m22.519s
>>>>> zswpin 612070 999244 815934
>>>>> zswpout 2226403 2347979 2054980
>>>>> pgfault 20667366 20481728 20478690
>>>>> pgmajfault 385887 269117 309702
>>>>>
>>>>> AMD 16K+32K+64K THP=always
>>>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix
>>>>> real 1m22.975s 1m23.266s 1m22.549s
>>>>> user 53m51.302s 53m51.069s 53m46.471s
>>>>> sys 7m40.168s 7m57.104s 7m25.012s
>>>>> zswpin 676492 1258573 1225703
>>>>> zswpout 2449839 2714767 2899178
>>>>> pgfault 17540746 17296555 17234663
>>>>> pgmajfault 429629 307495 287859
>>>>>
>>>>> I wonder if we can extend the mitigation to do_anonymous_page() as
>>>>> well. Without hardware like AMD and ARM with hardware TLB coalescing
>>>>> or CONT-PTE, I conducted a quick test on my Intel i9 workstation with
>>>>> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel
>>>>> builds in a memcg with memory.max set to 1 GiB.
>>>>>
>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
>>>>> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>>>>>
>>>>> $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
>>>>> CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null
>>>>>
>>>>> disable-mTHP-swapin mm-unstable with-this-patch
>>>>> Real: 6m54.595s 7m4.832s 6m45.811s
>>>>> User: 66m42.795s 66m59.984s 67m21.150s
>>>>> Sys: 12m7.092s 15m18.153s 12m52.644s
>>>>> pswpin: 4262327 11723248 5918690
>>>>> pswpout: 14883774 19574347 14026942
>>>>> 64k-swpout: 624447 889384 480039
>>>>> 32k-swpout: 115473 242288 73874
>>>>> 16k-swpout: 158203 294672 109142
>>>>> 64k-swpin: 0 495869 159061
>>>>> 32k-swpin: 0 219977 56158
>>>>> 16k-swpin: 0 223501 81445
>>>>>
>>>>
>>>
>>> Hi Usama,
>>>
>>>> hmm, both the user and sys time are worse with the patch compared to
>>>> disable-mTHP-swapin. I wonder if the real time is an anomaly and if you
>>>> repeat the experiment the real time might be worse as well?
>>>
>>> Well, I've improved my script to include a loop:
>>>
>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
>>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>>>
>>> for ((i=1; i<=100; i++))
>>> do
>>> echo "Executing round $i"
>>> make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null
>>> echo 3 > /proc/sys/vm/drop_caches
>>> time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
>>> CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j15 1>/dev/null 2>/dev/null
>>> cat /proc/vmstat | grep pswp
>>> echo -n 64k-swpout: ; cat
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout
>>> echo -n 32k-swpout: ; cat
>>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout
>>> echo -n 16k-swpout: ; cat
>>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout
>>> echo -n 64k-swpin: ; cat
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin
>>> echo -n 32k-swpin: ; cat
>>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin
>>> echo -n 16k-swpin: ; cat
>>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin
>>> done
>>>
>>> I've noticed that the user/sys/real time on my i9 machine fluctuates
>>> constantly, could be things
>>> like:
>>> real 6m52.087s
>>> user 67m12.463s
>>> sys 13m8.281s
>>> ...
>>>
>>> real 7m42.937s
>>> user 66m55.250s
>>> sys 12m56.330s
>>> ...
>>>
>>> real 6m49.374s
>>> user 66m37.040s
>>> sys 12m44.542s
>>> ...
>>>
>>> real 6m54.205s
>>> user 65m49.732s
>>> sys 11m33.078s
>>> ...
>>>
>>> likely due to unstable temperatures and I/O latency. As a result, my
>>> data doesn’t seem
>>> reference-worthy.
>>>
>>
>> So I had suggested retrying the experiment to see how reproducible it is,
>> but had not done that myself!
>> Thanks for sharing this. I tried many times on the AMD server and I see
>> varying numbers as well.
>>
>> AMD 16K THP always, cgroup = 4G, large folio zswapin patches
>> real 1m28.351s
>> user 54m14.476s
>> sys 8m46.596s
>> zswpin 811693
>> zswpout 2137310
>> pgfault 27344671
>> pgmajfault 290510
>> ..
>> real 1m24.557s
>> user 53m56.815s
>> sys 8m10.200s
>> zswpin 571532
>> zswpout 1645063
>> pgfault 26989075
>> pgmajfault 205177
>> ..
>> real 1m26.083s
>> user 54m5.303s
>> sys 9m55.247s
>> zswpin 1176292
>> zswpout 2910825
>> pgfault 27286835
>> pgmajfault 419746
>>
>>
>> The sys time can especially vary by large numbers. I think you see the same.
>>
>>
>>> As a phone engineer, we never use phones to run kernel builds. I'm also
>>> quite certain that phones won't provide stable and reliable data for this
>>> type of workload. Without access to a Linux server to conduct the test,
>>> I really need your help.
>>>
>>> I used to work on optimizing the ARM server scheduler and memory
>>> management, and I really miss that machine I had until three years ago :-)
>>>
>>>>
>>>>> I need Usama's assistance to identify a suitable patch, as I lack
>>>>> access to hardware such as AMD machines and ARM servers with TLB
>>>>> optimization.
>>>>>
>>>>> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/
>>>>> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/
>>>>>
>>>>> Cc: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>
>>>>> Cc: Usama Arif <usamaarif642@...il.com>
>>>>> Cc: David Hildenbrand <david@...hat.com>
>>>>> Cc: Baolin Wang <baolin.wang@...ux.alibaba.com>
>>>>> Cc: Chris Li <chrisl@...nel.org>
>>>>> Cc: Yosry Ahmed <yosryahmed@...gle.com>
>>>>> Cc: "Huang, Ying" <ying.huang@...el.com>
>>>>> Cc: Kairui Song <kasong@...cent.com>
>>>>> Cc: Ryan Roberts <ryan.roberts@....com>
>>>>> Cc: Johannes Weiner <hannes@...xchg.org>
>>>>> Cc: Michal Hocko <mhocko@...nel.org>
>>>>> Cc: Roman Gushchin <roman.gushchin@...ux.dev>
>>>>> Cc: Shakeel Butt <shakeel.butt@...ux.dev>
>>>>> Cc: Muchun Song <muchun.song@...ux.dev>
>>>>> Signed-off-by: Barry Song <v-songbaohua@...o.com>
>>>>> ---
>>>>> include/linux/memcontrol.h | 9 ++++++++
>>>>> mm/memcontrol.c | 45 ++++++++++++++++++++++++++++++++++++++
>>>>> mm/memory.c | 17 ++++++++++++++
>>>>> 3 files changed, 71 insertions(+)
>>>>>
>>>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>>>> index 524006313b0d..8bcc8f4af39f 100644
>>>>> --- a/include/linux/memcontrol.h
>>>>> +++ b/include/linux/memcontrol.h
>>>>> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
>>>>> int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
>>>>> long nr_pages);
>>>>>
>>>>> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
>>>>> + swp_entry_t *entry);
>>>>> +
>>>>> int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
>>>>> gfp_t gfp, swp_entry_t entry);
>>>>>
>>>>> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
>>>>> return 0;
>>>>> }
>>>>>
>>>>> +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
>>>>> + swp_entry_t *entry)
>>>>> +{
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>> static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
>>>>> struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
>>>>> {
>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>> index 17af08367c68..f3d92b93ea6d 100644
>>>>> --- a/mm/memcontrol.c
>>>>> +++ b/mm/memcontrol.c
>>>>> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
>>>>> return 0;
>>>>> }
>>>>>
>>>>> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg)
>>>>> +{
>>>>> + for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
>>>>> + if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR)
>>>>
>>>> There might be 3 issues with the approach:
>>>>
>>>> Its a very big margin, lets say you have ARM64_64K_PAGES, and you have
>>>> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are
>>>> basically saying you need 512M free memory to swapin just 256K?
>>>
>>> Right, sorry for the noisy code. I was just thinking about 4KB pages
>>> and wondering
>>> if we could simplify the code.
>>>
>>>>
>>>> Its an uneven margin for different folio sizes.
>>>> For 16K folio swapin, you are checking if there is margin for 128 folios,
>>>> but for 1M folio swapin, you are checking there is margin for just 2 folios.
>>>>
>>>> Maybe it might be better to make this dependent on some factor of folio_nr_pages?
>>>
>>> Agreed. This is similar to what we discussed regarding your zswap mTHP
>>> swap-in series:
>>>
>>> int mem_cgroup_swapin_charge_folio(...)
>>> {
>>> ...
>>> if (folio_test_large(folio) &&
>>> mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH,
>>> folio_nr_pages(folio)))
>>> ret = -ENOMEM;
>>> else
>>> ret = charge_memcg(folio, memcg, gfp);
>>> ...
>>> }
>>>
>>> As someone focused on phones, my challenge is the absence of stable platforms to
>>> benchmark this type of workload. If possible, Usama, I would greatly
>>> appreciate it if
>>> you could take the lead on the patch.
>>>
>>>>
>>>> As Johannes pointed out, the charging code already does the margin check.
>>>> So for 4K, the check just checks if there is 4K available, but for 16K it checks
>>>> if a lot more than 16K is available. Maybe there should be a similar policy for
>>>> all? I guess this is similar to my 2nd point, but just considers 4K folios as
>>>> well.
>>>
>>> I don't think the charging code performs a margin check. It simply
>>> tries to charge
>>> the specified nr_pages (whether 1 or more). If nr_pages are available,
>>> the charge
>>> proceeds; otherwise, if GFP allows blocking, it triggers memory reclamation to
>>> reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages.
>>>
>>
>> So if you have defrag not set to always, it will not trigger reclamation.
>> I think that is a bigger usecase, i.e. defrag=madvise,defer,etc is probably
>> used much more then always.
>>
>> In the current code in that case try_charge_memcg will return -ENOMEM all
>> the way to mem_cgroup_swapin_charge_folio and alloc_swap_folio will then
>> try the next order. So eventhough it might not be calling the mem_cgroup_margin
>> function, it is kind of is doing the same?
>>
>>> If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages available, a
>>> large folio with nr_pages == SWAP_CLUSTER_MAX will successfully charge,
>>> immediately filling the memcg.
>>>
>>> Shortly after, smaller folios—typically with blockable GFP—will quickly trigger
>>> additional reclamation. While nr_pages - 1 subpages of the large folio may not
>>> be immediately needed, they still occupy enough space to fill the memcg to
>>> capacity.
>>>
>>> My second point about the mitigation is as follows: For a system (or
>>> memcg) under severe memory pressure, especially one without hardware TLB
>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at
>>> a larger granularity, some internal fragmentation is unavoidable, regardless
>>> of optimization. Could the mitigation code help in automatically tuning
>>> this fragmentation?
>>>
>>
>> I agree with the point that enabling mTHP always is not the right thing to do
>> on all platforms. I also think it might be the case that enabling mTHP
>> might be a good thing for some workloads, but enabling mTHP swapin along with
>> it might not.
>>
>> As you said when you have apps switching between foreground and background
>> in android, it probably makes sense to have large folio swapping, as you
>> want to bringin all the pages from background app as quickly as possible.
>> And also all the TLB optimizations and smaller lru overhead you get after
>> you have brought in all the pages.
>> Linux kernel build test doesnt really get to benefit from the TLB optimization
>> and smaller lru overhead, as probably the pages are very short lived. So I
>> think it doesnt show the benefit of large folio swapin properly and
>> large folio swapin should probably be disabled for this kind of workload,
>> eventhough mTHP should be enabled.
>
> I'm not entirely sure if this applies to platforms without TLB
> optimization, especially
> in the absence of swap. In a memory-limited cgroup without swap, would
> mTHP still
> cause significant thrashing of file-backed folios? When a large swap
> file is present,
> the inability to swap in mTHP seems to act as a workaround for fragmentation,
> allowing fragmented pages of the original mTHP from do_anonymous_page() to
> remain in swap.
>
>>
>> I am not sure that the approach we are trying in this patch is the right way:
>> - This patch makes it a memcg issue, but you could have memcg disabled and
>> then the mitigation being tried here wont apply.
>> - Instead of this being a large folio swapin issue, is it more of a readahead
>> issue? If we zswap (without the large folio swapin series) and change the window
>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
>> when cgroup memory is limited as readahead would probably cause swap thrashing as
>> well.
>> - Instead of looking at cgroup margin, maybe we should try and look at
>> the rate of change of workingset_restore_anon? This might be a lot more complicated
>> to do, but probably is the right metric to determine swap thrashing. It also means
>> that this could be used in both the synchronous swapcache skipping path and
>> swapin_readahead path.
>> (Thanks Johannes for suggesting this)
>>
>> With the large folio swapin, I do see the large improvement when considering only
>> swapin performance and latency in the same way as you saw in zram.
>> Maybe the right short term approach is to have
>> /sys/kernel/mm/transparent_hugepage/swapin
>> and have that disabled by default to avoid regression.
>
> A crucial component is still missing—managing the compression and decompression
> of multiple pages as a larger block. This could significantly reduce
> system time and
> potentially resolve the kernel build issue within a small memory
> cgroup, even with
> swap thrashing.
>
> I’ll send an update ASAP so you can rebase for zswap.
Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/?
Thats wont benefit zswap, right?
I actually had a few questions about it. Mainly that the benefit comes if the
pagefault happens on page 0 of the large folio. But if the page fault happens
on any other page, lets say page 1 of a 64K folio. then it will decompress the
entire 64K chunk and just copy page 1? (memcpy in zram_bvec_read_multi_pages_partial).
Could that cause a regression as you have to decompress a large chunk for just
getting 1 4K page?
If we assume uniform distribution of page faults, maybe it could make things worse?
I probably should ask all of this in that thread.
>
>> If the workload owner sees a benefit, they can enable it.
>> I can add this when sending the next version of large folio zswapin if that makes
>> sense?
>> Longer term I can try and have a look at if we can do something with
>> workingset_restore_anon to improve things.
>>
>> Thanks,
>> Usama
>
> Thanks
> Barry
Powered by blists - more mailing lists