[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4yHCpY7MX+9aBwkHvP1e93jzGrCzskJk1Xh1zuFw_d-bg@mail.gmail.com>
Date: Thu, 24 Oct 2024 09:36:38 +1300
From: Barry Song <21cnbao@...il.com>
To: Usama Arif <usamaarif642@...il.com>
Cc: Yosry Ahmed <yosryahmed@...gle.com>, senozhatsky@...omium.org, minchan@...nel.org,
hanchuanhua@...o.com, v-songbaohua@...o.com, akpm@...ux-foundation.org,
linux-mm@...ck.org, hannes@...xchg.org, david@...hat.com, willy@...radead.org,
kanchana.p.sridhar@...el.com, nphamcs@...il.com, chengming.zhou@...ux.dev,
ryan.roberts@....com, ying.huang@...el.com, riel@...riel.com,
shakeel.butt@...ux.dev, kernel-team@...a.com, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org
Subject: Re: [RFC 0/4] mm: zswap: add support for zswapin of large folios
On Thu, Oct 24, 2024 at 8:47 AM Usama Arif <usamaarif642@...il.com> wrote:
>
>
>
> On 23/10/2024 19:52, Barry Song wrote:
> > On Thu, Oct 24, 2024 at 7:31 AM Usama Arif <usamaarif642@...il.com> wrote:
> >>
> >>
> >>
> >> On 23/10/2024 19:02, Yosry Ahmed wrote:
> >>> [..]
> >>>>>> I suspect the regression occurs because you're running an edge case
> >>>>>> where the memory cgroup stays nearly full most of the time (this isn't
> >>>>>> an inherent issue with large folio swap-in). As a result, swapping in
> >>>>>> mTHP quickly triggers a memcg overflow, causing a swap-out. The
> >>>>>> next swap-in then recreates the overflow, leading to a repeating
> >>>>>> cycle.
> >>>>>>
> >>>>>
> >>>>> Yes, agreed! Looking at the swap counters, I think this is what is going
> >>>>> on as well.
> >>>>>
> >>>>>> We need a way to stop the cup from repeatedly filling to the brim and
> >>>>>> overflowing. While not a definitive fix, the following change might help
> >>>>>> improve the situation:
> >>>>>>
> >>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>>>>>
> >>>>>> index 17af08367c68..f2fa0eeb2d9a 100644
> >>>>>> --- a/mm/memcontrol.c
> >>>>>> +++ b/mm/memcontrol.c
> >>>>>>
> >>>>>> @@ -4559,7 +4559,10 @@ int mem_cgroup_swapin_charge_folio(struct folio
> >>>>>> *folio, struct mm_struct *mm,
> >>>>>> memcg = get_mem_cgroup_from_mm(mm);
> >>>>>> rcu_read_unlock();
> >>>>>>
> >>>>>> - ret = charge_memcg(folio, memcg, gfp);
> >>>>>> + if (folio_test_large(folio) && mem_cgroup_margin(memcg) <
> >>>>>> MEMCG_CHARGE_BATCH)
> >>>>>> + ret = -ENOMEM;
> >>>>>> + else
> >>>>>> + ret = charge_memcg(folio, memcg, gfp);
> >>>>>>
> >>>>>> css_put(&memcg->css);
> >>>>>> return ret;
> >>>>>> }
> >>>>>>
> >>>>>
> >>>>> The diff makes sense to me. Let me test later today and get back to you.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>>> Please confirm if it makes the kernel build with memcg limitation
> >>>>>> faster. If so, let's
> >>>>>> work together to figure out an official patch :-) The above code hasn't consider
> >>>>>> the parent memcg's overflow, so not an ideal fix.
> >>>>>>
> >>>>
> >>>> Thanks Barry, I think this fixes the regression, and even gives an improvement!
> >>>> I think the below might be better to do:
> >>>>
> >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>>> index c098fd7f5c5e..0a1ec55cc079 100644
> >>>> --- a/mm/memcontrol.c
> >>>> +++ b/mm/memcontrol.c
> >>>> @@ -4550,7 +4550,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> >>>> memcg = get_mem_cgroup_from_mm(mm);
> >>>> rcu_read_unlock();
> >>>>
> >>>> - ret = charge_memcg(folio, memcg, gfp);
> >>>> + if (folio_test_large(folio) &&
> >>>> + mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)))
> >>>> + ret = -ENOMEM;
> >>>> + else
> >>>> + ret = charge_memcg(folio, memcg, gfp);
> >>>>
> >>>> css_put(&memcg->css);
> >>>> return ret;
> >>>>
> >>>>
> >>>> AMD 16K+32K THP=always
> >>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix
> >>>> real 1m23.038s 1m23.050s 1m22.704s
> >>>> user 53m57.210s 53m53.437s 53m52.577s
> >>>> sys 7m24.592s 7m48.843s 7m22.519s
> >>>> zswpin 612070 999244 815934
> >>>> zswpout 2226403 2347979 2054980
> >>>> pgfault 20667366 20481728 20478690
> >>>> pgmajfault 385887 269117 309702
> >>>>
> >>>> AMD 16K+32K+64K THP=always
> >>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix
> >>>> real 1m22.975s 1m23.266s 1m22.549s
> >>>> user 53m51.302s 53m51.069s 53m46.471s
> >>>> sys 7m40.168s 7m57.104s 7m25.012s
> >>>> zswpin 676492 1258573 1225703
> >>>> zswpout 2449839 2714767 2899178
> >>>> pgfault 17540746 17296555 17234663
> >>>> pgmajfault 429629 307495 287859
> >>>>
> >>>
> >>> Thanks Usama and Barry for looking into this. It seems like this would
> >>> fix a regression with large folio swapin regardless of zswap. Can the
> >>> same result be reproduced on zram without this series?
> >>
> >>
> >> Yes, its a regression in large folio swapin support regardless of zswap/zram.
> >>
> >> Need to do 3 tests, one with probably the below diff to remove large folio support,
> >> one with current upstream and one with upstream + swap thrashing fix.
> >>
> >> We only use zswap and dont have a zram setup (and I am a bit lazy to create one :)).
> >> Any zram volunteers to try this?
> >
> > Hi Usama,
> >
> > I tried a quick experiment:
> >
> > echo 1 > /sys/module/zswap/parameters/enabled
> > echo 0 > /sys/module/zswap/parameters/enabled
> >
> > This was to test the zRAM scenario. Enabling zswap even
> > once disables mTHP swap-in. :)
> >
> > I noticed a similar regression with zRAM alone, but the change resolved
> > the issue and even sped up the kernel build compared to the setup without
> > mTHP swap-in.
>
> Thanks for trying, this is amazing!
> >
> > However, I’m still working on a proper patch to address this. The current
> > approach:
> >
> > mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))
> >
> > isn’t sufficient, as it doesn’t cover cases where group A contains group B, and
> > we’re operating within group B. The problem occurs not at the boundary of
> > group B but at the boundary of group A.
>
> I am not sure I completely followed this. As MEMCG_CHARGE_BATCH=64, if we are
> trying to swapin a 16kB page, we basically check if atleast 64/4 = 16 folios can be
> charged to cgroup, which is reasonable. If we try to swapin a 1M folio, we just
> check if we can charge atleast 1 folio. Are you saying that checking just 1 folio
> is not enough in this case and can still cause thrashing, i.e we should check more?
My understanding is that cgroups are hierarchical. Even if we don’t
hit the memory
limit of the folio’s direct memcg, we could still reach the limit of
one of its parent
memcgs. Imagine a structure like:
/sys/fs/cgroup/a/b/c/d
If we’re compiling the kernel in d, there’s a chance that while d
isn’t at its limit, its
parents (c, b, or a) could be. Currently, the check only applies to d.
>
> If we want to maintain consitency for all folios another option is
> mem_cgroup_margin(memcg) < MEMCG_CHARGE_BATCH * folio_nr_pages(folio)
> but I think this is too extreme, we would be checking if 64M can be charged to
> cgroup just to swapin 1M.
>
> >
> > I believe there’s still room for improvement. For example, if a 64KB charge
> > attempt fails, there’s no need to waste time trying 32KB or 16KB. We can
> > directly fall back to 4KB, as 32KB and 16KB will also fail based on our
> > margin detection logic.
> >
>
> Yes that makes sense. Would something like below work to fix that:
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c098fd7f5c5e..0a1ec55cc079 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4550,7 +4550,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> memcg = get_mem_cgroup_from_mm(mm);
> rcu_read_unlock();
>
> - ret = charge_memcg(folio, memcg, gfp);
> + if (folio_test_large(folio) &&
> + mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)))
> + ret = -ENOMEM;
> + else
> + ret = charge_memcg(folio, memcg, gfp);
>
> css_put(&memcg->css);
> return ret;
> diff --git a/mm/memory.c b/mm/memory.c
> index fecdd044bc0b..b6ce6605dc63 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4123,6 +4123,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> pte_t *pte;
> gfp_t gfp;
> int order;
> + int ret;
>
> /*
> * If uffd is active for the vma we need per-page fault fidelity to
> @@ -4170,9 +4171,13 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> folio = vma_alloc_folio(gfp, order, vma, addr, true);
> if (folio) {
> - if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
> - gfp, entry))
> + ret = mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, gfp, entry);
> + if (!ret) {
> return folio;
> + } else if (ret == -ENOMEM) {
> + folio_put(folio);
> + goto fallback;
> + }
> folio_put(folio);
> }
> order = next_order(&orders, order);
>
Yes, does it make your kernel build even faster?
Thanks
Barry
Powered by blists - more mailing lists