[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7AeVkmOiVFa4-iP4nbyX3LHft_5wfvV_raj_N+twdzMKQ@mail.gmail.com>
Date: Sat, 17 Aug 2024 00:50:00 +0800
From: Kairui Song <ryncsn@...il.com>
To: Barry Song <21cnbao@...il.com>
Cc: wangkefeng.wang@...wei.com, akpm@...ux-foundation.org,
baolin.wang@...ux.alibaba.com, chrisl@...nel.org, david@...hat.com,
hanchuanhua@...o.com, hannes@...xchg.org, hch@...radead.org, hughd@...gle.com,
kaleshsingh@...gle.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
mhocko@...e.com, minchan@...nel.org, nphamcs@...il.com, ryan.roberts@....com,
senozhatsky@...omium.org, shakeel.butt@...ux.dev, shy828301@...il.com,
surenb@...gle.com, v-songbaohua@...o.com, willy@...radead.org,
xiang@...nel.org, ying.huang@...el.com, yosryahmed@...gle.com
Subject: Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
On Fri, Aug 16, 2024 at 7:06 AM Barry Song <21cnbao@...il.com> wrote:
>
> On Fri, Aug 16, 2024 at 1:27 AM Kefeng Wang <wangkefeng.wang@...wei.com> wrote:
> >
> >
> >
> > On 2024/8/15 17:47, Kairui Song wrote:
> > > On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@...il.com> wrote:
> > >>
> > >> From: Chuanhua Han <hanchuanhua@...o.com>
> > >
> > > Hi Chuanhua,
> > >
> > >>
> > ...
> >
> > >> +
> > >> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > >> +{
> > >> + struct vm_area_struct *vma = vmf->vma;
> > >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > >> + unsigned long orders;
> > >> + struct folio *folio;
> > >> + unsigned long addr;
> > >> + swp_entry_t entry;
> > >> + spinlock_t *ptl;
> > >> + pte_t *pte;
> > >> + gfp_t gfp;
> > >> + int order;
> > >> +
> > >> + /*
> > >> + * If uffd is active for the vma we need per-page fault fidelity to
> > >> + * maintain the uffd semantics.
> > >> + */
> > >> + if (unlikely(userfaultfd_armed(vma)))
> > >> + goto fallback;
> > >> +
> > >> + /*
> > >> + * A large swapped out folio could be partially or fully in zswap. We
> > >> + * lack handling for such cases, so fallback to swapping in order-0
> > >> + * folio.
> > >> + */
> > >> + if (!zswap_never_enabled())
> > >> + goto fallback;
> > >> +
> > >> + entry = pte_to_swp_entry(vmf->orig_pte);
> > >> + /*
> > >> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
> > >> + * and suitable for swapping THP.
> > >> + */
> > >> + orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> > >> + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> > >> + orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> > >> + orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders);
> > >> +
> > >> + if (!orders)
> > >> + goto fallback;
> > >> +
> > >> + pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl);
> > >> + if (unlikely(!pte))
> > >> + goto fallback;
> > >> +
> > >> + /*
> > >> + * For do_swap_page, find the highest order where the aligned range is
> > >> + * completely swap entries with contiguous swap offsets.
> > >> + */
> > >> + order = highest_order(orders);
> > >> + while (orders) {
> > >> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > >> + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
> > >> + break;
> > >> + order = next_order(&orders, order);
> > >> + }
> > >> +
> > >> + pte_unmap_unlock(pte, ptl);
> > >> +
> > >> + /* Try allocating the highest of the remaining orders. */
> > >> + gfp = vma_thp_gfp_mask(vma);
> > >> + while (orders) {
> > >> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > >> + folio = vma_alloc_folio(gfp, order, vma, addr, true);
> > >> + if (folio)
> > >> + return folio;
> > >> + order = next_order(&orders, order);
> > >> + }
> > >> +
> > >> +fallback:
> > >> +#endif
> > >> + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> > >> +}
> > >> +
> > >> +
> > >> /*
> > >> * We enter with non-exclusive mmap_lock (to exclude vma changes,
> > >> * but allow concurrent faults), and pte mapped but not yet locked.
> > >> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >> if (!folio) {
> > >> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > >> __swap_count(entry) == 1) {
> > >> - /*
> > >> - * Prevent parallel swapin from proceeding with
> > >> - * the cache flag. Otherwise, another thread may
> > >> - * finish swapin first, free the entry, and swapout
> > >> - * reusing the same entry. It's undetectable as
> > >> - * pte_same() returns true due to entry reuse.
> > >> - */
> > >> - if (swapcache_prepare(entry, 1)) {
> > >> - /* Relax a bit to prevent rapid repeated page faults */
> > >> - schedule_timeout_uninterruptible(1);
> > >> - goto out;
> > >> - }
> > >> - need_clear_cache = true;
> > >> -
> > >> /* skip swapcache */
> > >> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > >> - vma, vmf->address, false);
> > >> + folio = alloc_swap_folio(vmf);
> > >> page = &folio->page;
> > >> if (folio) {
> > >> __folio_set_locked(folio);
> > >> __folio_set_swapbacked(folio);
> > >>
> > >> + nr_pages = folio_nr_pages(folio);
> > >> + if (folio_test_large(folio))
> > >> + entry.val = ALIGN_DOWN(entry.val, nr_pages);
> > >> + /*
> > >> + * Prevent parallel swapin from proceeding with
> > >> + * the cache flag. Otherwise, another thread may
> > >> + * finish swapin first, free the entry, and swapout
> > >> + * reusing the same entry. It's undetectable as
> > >> + * pte_same() returns true due to entry reuse.
> > >> + */
> > >> + if (swapcache_prepare(entry, nr_pages)) {
> > >> + /* Relax a bit to prevent rapid repeated page faults */
> > >> + schedule_timeout_uninterruptible(1);
> > >> + goto out_page;
> > >> + }
> > >> + need_clear_cache = true;
> > >> +
> > >> if (mem_cgroup_swapin_charge_folio(folio,
> > >> vma->vm_mm, GFP_KERNEL,
> > >> entry)) {
> > >> ret = VM_FAULT_OOM;
> > >> goto out_page;
> > >> }
> > >
> > > After your patch, with build kernel test, I'm seeing kernel log
> > > spamming like this:
> > > [ 101.048594] pagefault_out_of_memory: 95 callbacks suppressed
> > > [ 101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [ 101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [ 101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [ 101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [ 101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [ 101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [ 101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [ 101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > ............
> > >
> > > And heavy performance loss with workloads limited by memcg, mTHP enabled.
> > >
> > > After some debugging, the problematic part is the
> > > mem_cgroup_swapin_charge_folio call above.
> > > When under pressure, cgroup charge fails easily for mTHP. One 64k
> > > swapin will require a much more aggressive reclaim to success.
> > >
> > > If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is
> > > gone and mTHP swapin should have a much higher swapin success rate.
> > > But this might not be the right way.
> > >
> > > For this particular issue, maybe you can change the charge order, try
> > > charging first, if successful, use mTHP. if failed, fallback to 4k?
> >
> > This is what we did in alloc_anon_folio(), see 085ff35e7636
> > ("mm: memory: move mem_cgroup_charge() into alloc_anon_folio()"),
> > 1) fallback earlier
> > 2) using same GFP flags for allocation and charge
> >
> > but it seems that there is a little complicated for swapin charge
>
> Kefeng, thanks! I guess we can continue using the same approach and
> it's not too complicated.
>
> Kairui, sorry for the trouble and thanks for the report! could you
> check if the solution below resolves the issue? On phones, we don't
> encounter the scenarios you’re facing.
>
> From 2daaf91077705a8fa26a3a428117f158f05375b0 Mon Sep 17 00:00:00 2001
> From: Barry Song <v-songbaohua@...o.com>
> Date: Fri, 16 Aug 2024 10:51:48 +1200
> Subject: [PATCH] mm: fallback to next_order if charing mTHP fails
>
> When memcg approaches its limit, charging mTHP becomes difficult.
> At this point, when the charge fails, we fallback to the next order
> to avoid repeatedly retrying larger orders.
>
> Reported-by: Kairui Song <ryncsn@...il.com>
> Signed-off-by: Barry Song <v-songbaohua@...o.com>
> ---
> mm/memory.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 0ed3603aaf31..6cba28ef91e7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4121,8 +4121,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> while (orders) {
> addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> folio = vma_alloc_folio(gfp, order, vma, addr, true);
> - if (folio)
> - return folio;
> + if (folio) {
> + if (!mem_cgroup_swapin_charge_folio(folio,
> + vma->vm_mm, gfp, entry))
> + return folio;
> + folio_put(folio);
> + }
> order = next_order(&orders, order);
> }
>
> @@ -4244,7 +4248,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
> need_clear_cache = true;
>
> - if (mem_cgroup_swapin_charge_folio(folio,
> + if (nr_pages == 1 && mem_cgroup_swapin_charge_folio(folio,
> vma->vm_mm, GFP_KERNEL,
> entry)) {
> ret = VM_FAULT_OOM;
> --
> 2.34.1
>
Hi Barry
After the fix the spamming log is gone, thanks for the fix.
>
> Thanks
> Barry
>
Powered by blists - more mailing lists