linux-kernel - Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7AeVkmOiVFa4-iP4nbyX3LHft_5wfvV_raj_N+twdzMKQ@mail.gmail.com>
Date: Sat, 17 Aug 2024 00:50:00 +0800
From: Kairui Song <ryncsn@...il.com>
To: Barry Song <21cnbao@...il.com>
Cc: wangkefeng.wang@...wei.com, akpm@...ux-foundation.org, 
	baolin.wang@...ux.alibaba.com, chrisl@...nel.org, david@...hat.com, 
	hanchuanhua@...o.com, hannes@...xchg.org, hch@...radead.org, hughd@...gle.com, 
	kaleshsingh@...gle.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	mhocko@...e.com, minchan@...nel.org, nphamcs@...il.com, ryan.roberts@....com, 
	senozhatsky@...omium.org, shakeel.butt@...ux.dev, shy828301@...il.com, 
	surenb@...gle.com, v-songbaohua@...o.com, willy@...radead.org, 
	xiang@...nel.org, ying.huang@...el.com, yosryahmed@...gle.com
Subject: Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices

On Fri, Aug 16, 2024 at 7:06 AM Barry Song <21cnbao@...il.com> wrote:
>
> On Fri, Aug 16, 2024 at 1:27 AM Kefeng Wang <wangkefeng.wang@...wei.com> wrote:
> >
> >
> >
> > On 2024/8/15 17:47, Kairui Song wrote:
> > > On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@...il.com> wrote:
> > >>
> > >> From: Chuanhua Han <hanchuanhua@...o.com>
> > >
> > > Hi Chuanhua,
> > >
> > >>
> > ...
> >
> > >> +
> > >> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > >> +{
> > >> +       struct vm_area_struct *vma = vmf->vma;
> > >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > >> +       unsigned long orders;
> > >> +       struct folio *folio;
> > >> +       unsigned long addr;
> > >> +       swp_entry_t entry;
> > >> +       spinlock_t *ptl;
> > >> +       pte_t *pte;
> > >> +       gfp_t gfp;
> > >> +       int order;
> > >> +
> > >> +       /*
> > >> +        * If uffd is active for the vma we need per-page fault fidelity to
> > >> +        * maintain the uffd semantics.
> > >> +        */
> > >> +       if (unlikely(userfaultfd_armed(vma)))
> > >> +               goto fallback;
> > >> +
> > >> +       /*
> > >> +        * A large swapped out folio could be partially or fully in zswap. We
> > >> +        * lack handling for such cases, so fallback to swapping in order-0
> > >> +        * folio.
> > >> +        */
> > >> +       if (!zswap_never_enabled())
> > >> +               goto fallback;
> > >> +
> > >> +       entry = pte_to_swp_entry(vmf->orig_pte);
> > >> +       /*
> > >> +        * Get a list of all the (large) orders below PMD_ORDER that are enabled
> > >> +        * and suitable for swapping THP.
> > >> +        */
> > >> +       orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> > >> +                       TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> > >> +       orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> > >> +       orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders);
> > >> +
> > >> +       if (!orders)
> > >> +               goto fallback;
> > >> +
> > >> +       pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl);
> > >> +       if (unlikely(!pte))
> > >> +               goto fallback;
> > >> +
> > >> +       /*
> > >> +        * For do_swap_page, find the highest order where the aligned range is
> > >> +        * completely swap entries with contiguous swap offsets.
> > >> +        */
> > >> +       order = highest_order(orders);
> > >> +       while (orders) {
> > >> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > >> +               if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
> > >> +                       break;
> > >> +               order = next_order(&orders, order);
> > >> +       }
> > >> +
> > >> +       pte_unmap_unlock(pte, ptl);
> > >> +
> > >> +       /* Try allocating the highest of the remaining orders. */
> > >> +       gfp = vma_thp_gfp_mask(vma);
> > >> +       while (orders) {
> > >> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > >> +               folio = vma_alloc_folio(gfp, order, vma, addr, true);
> > >> +               if (folio)
> > >> +                       return folio;
> > >> +               order = next_order(&orders, order);
> > >> +       }
> > >> +
> > >> +fallback:
> > >> +#endif
> > >> +       return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> > >> +}
> > >> +
> > >> +
> > >>   /*
> > >>    * We enter with non-exclusive mmap_lock (to exclude vma changes,
> > >>    * but allow concurrent faults), and pte mapped but not yet locked.
> > >> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >>          if (!folio) {
> > >>                  if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > >>                      __swap_count(entry) == 1) {
> > >> -                       /*
> > >> -                        * Prevent parallel swapin from proceeding with
> > >> -                        * the cache flag. Otherwise, another thread may
> > >> -                        * finish swapin first, free the entry, and swapout
> > >> -                        * reusing the same entry. It's undetectable as
> > >> -                        * pte_same() returns true due to entry reuse.
> > >> -                        */
> > >> -                       if (swapcache_prepare(entry, 1)) {
> > >> -                               /* Relax a bit to prevent rapid repeated page faults */
> > >> -                               schedule_timeout_uninterruptible(1);
> > >> -                               goto out;
> > >> -                       }
> > >> -                       need_clear_cache = true;
> > >> -
> > >>                          /* skip swapcache */
> > >> -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > >> -                                               vma, vmf->address, false);
> > >> +                       folio = alloc_swap_folio(vmf);
> > >>                          page = &folio->page;
> > >>                          if (folio) {
> > >>                                  __folio_set_locked(folio);
> > >>                                  __folio_set_swapbacked(folio);
> > >>
> > >> +                               nr_pages = folio_nr_pages(folio);
> > >> +                               if (folio_test_large(folio))
> > >> +                                       entry.val = ALIGN_DOWN(entry.val, nr_pages);
> > >> +                               /*
> > >> +                                * Prevent parallel swapin from proceeding with
> > >> +                                * the cache flag. Otherwise, another thread may
> > >> +                                * finish swapin first, free the entry, and swapout
> > >> +                                * reusing the same entry. It's undetectable as
> > >> +                                * pte_same() returns true due to entry reuse.
> > >> +                                */
> > >> +                               if (swapcache_prepare(entry, nr_pages)) {
> > >> +                                       /* Relax a bit to prevent rapid repeated page faults */
> > >> +                                       schedule_timeout_uninterruptible(1);
> > >> +                                       goto out_page;
> > >> +                               }
> > >> +                               need_clear_cache = true;
> > >> +
> > >>                                  if (mem_cgroup_swapin_charge_folio(folio,
> > >>                                                          vma->vm_mm, GFP_KERNEL,
> > >>                                                          entry)) {
> > >>                                          ret = VM_FAULT_OOM;
> > >>                                          goto out_page;
> > >>                                  }
> > >
> > > After your patch, with build kernel test, I'm seeing kernel log
> > > spamming like this:
> > > [  101.048594] pagefault_out_of_memory: 95 callbacks suppressed
> > > [  101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > ............
> > >
> > > And heavy performance loss with workloads limited by memcg, mTHP enabled.
> > >
> > > After some debugging, the problematic part is the
> > > mem_cgroup_swapin_charge_folio call above.
> > > When under pressure, cgroup charge fails easily for mTHP. One 64k
> > > swapin will require a much more aggressive reclaim to success.
> > >
> > > If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is
> > > gone and mTHP swapin should have a much higher swapin success rate.
> > > But this might not be the right way.
> > >
> > > For this particular issue, maybe you can change the charge order, try
> > > charging first, if successful, use mTHP. if failed, fallback to 4k?
> >
> > This is what we did in alloc_anon_folio(), see 085ff35e7636
> > ("mm: memory: move mem_cgroup_charge() into alloc_anon_folio()"),
> > 1) fallback earlier
> > 2) using same GFP flags for allocation and charge
> >
> > but it seems that there is a little complicated for swapin charge
>
> Kefeng, thanks! I guess we can continue using the same approach and
> it's not too complicated.
>
> Kairui, sorry for the trouble and thanks for the report! could you
> check if the solution below resolves the issue? On phones, we don't
> encounter the scenarios you’re facing.
>
> From 2daaf91077705a8fa26a3a428117f158f05375b0 Mon Sep 17 00:00:00 2001
> From: Barry Song <v-songbaohua@...o.com>
> Date: Fri, 16 Aug 2024 10:51:48 +1200
> Subject: [PATCH] mm: fallback to next_order if charing mTHP fails
>
> When memcg approaches its limit, charging mTHP becomes difficult.
> At this point, when the charge fails, we fallback to the next order
> to avoid repeatedly retrying larger orders.
>
> Reported-by: Kairui Song <ryncsn@...il.com>
> Signed-off-by: Barry Song <v-songbaohua@...o.com>
> ---
>  mm/memory.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 0ed3603aaf31..6cba28ef91e7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4121,8 +4121,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>         while (orders) {
>                 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>                 folio = vma_alloc_folio(gfp, order, vma, addr, true);
> -               if (folio)
> -                       return folio;
> +               if (folio) {
> +                       if (!mem_cgroup_swapin_charge_folio(folio,
> +                                       vma->vm_mm, gfp, entry))
> +                               return folio;
> +                       folio_put(folio);
> +               }
>                 order = next_order(&orders, order);
>         }
>
> @@ -4244,7 +4248,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                                 }
>                                 need_clear_cache = true;
>
> -                               if (mem_cgroup_swapin_charge_folio(folio,
> +                               if (nr_pages == 1 && mem_cgroup_swapin_charge_folio(folio,
>                                                         vma->vm_mm, GFP_KERNEL,
>                                                         entry)) {
>                                         ret = VM_FAULT_OOM;
> --
> 2.34.1
>

Hi Barry

After the fix the spamming log is gone, thanks for the fix.

>
> Thanks
> Barry
>