linux-kernel - Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4ynfvjXsr6QFBA_7Gzk3PaO1pk+6ErKZaNCt4H+nuwiJw@mail.gmail.com>
Date: Tue, 18 Jun 2024 09:55:24 +0800
From: Barry Song <21cnbao@...il.com>
To: yangge1116 <yangge1116@....com>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, baolin.wang@...ux.alibaba.com, 
	liuzixing@...on.cn
Subject: Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating
 non-CMA THP-sized page

On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@....com> wrote:
>
>
>
> 在 2024/6/17 下午8:47, yangge1116 写道:
> >
> >
> > 在 2024/6/17 下午6:26, Barry Song 写道:
> >> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@....com> wrote:
> >>>
> >>> From: yangge <yangge1116@....com>
> >>>
> >>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>> THP-sized allocations") no longer differentiates the migration type
> >>> of pages in THP-sized PCP list, it's possible to get a CMA page from
> >>> the list, in some cases, it's not acceptable, for example, allocating
> >>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
> >>>
> >>> The patch forbids allocating non-CMA THP-sized page from THP-sized
> >>> PCP list to avoid the issue above.
> >>
> >> Could you please describe the impact on users in the commit log?
> >
> > If a large number of CMA memory are configured in the system (for
> > example, the CMA memory accounts for 50% of the system memory), starting
> > virtual machine with device passthrough will get stuck.
> >
> > During starting virtual machine, it will call pin_user_pages_remote(...,
> > FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
> > pin_user_pages_remote() will migrate the page from CMA area to non-CMA
> > area because of FOLL_LONGTERM flag. If non-movable allocation requests
> > return CMA memory, pin_user_pages_remote() will enter endless loops.
> >
> > backtrace:
> > pin_user_pages_remote
> > ----__gup_longterm_locked //cause endless loops in this function
> > --------__get_user_pages_locked
> > --------check_and_migrate_movable_pages //always check fail and continue
> > to migrate
> > ------------migrate_longterm_unpinnable_pages
> > ----------------alloc_migration_target // non-movable allocation
> >
> >> Is it possible that some CMA memory might be used by non-movable
> >> allocation requests?
> >
> > Yes.
> >
> >
> >> If so, will CMA somehow become unable to migrate, causing cma_alloc()
> >> to fail?
> >
> >
> > No, it will cause endless loops in __gup_longterm_locked(). If
> > non-movable allocation requests return CMA memory,
> > migrate_longterm_unpinnable_pages() will migrate a CMA page to another
> > CMA page, which is useless and cause endless loops in
> > __gup_longterm_locked().

This is only one perspective. We also need to consider the impact on
CMA itself. For example,
when CMA is borrowed by THP, and we need to reclaim it through
cma_alloc() or dma_alloc_coherent(),
we must move those pages out to ensure CMA's users can retrieve that
contiguous memory.

Currently, CMA's memory is occupied by non-movable pages, meaning we
can't relocate them.
As a result, cma_alloc() is more likely to fail.

> >
> > backtrace:
> > pin_user_pages_remote
> > ----__gup_longterm_locked //cause endless loops in this function
> > --------__get_user_pages_locked
> > --------check_and_migrate_movable_pages //always check fail and continue
> > to migrate
> > ------------migrate_longterm_unpinnable_pages
> >
> >
> >
> >
> >
> >>>
> >>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>> THP-sized allocations")
> >>> Signed-off-by: yangge <yangge1116@....com>
> >>> ---
> >>>   mm/page_alloc.c | 10 ++++++++++
> >>>   1 file changed, 10 insertions(+)
> >>>
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index 2e22ce5..0bdf471 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
> >>> *preferred_zone,
> >>>          WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> >>>
> >>>          if (likely(pcp_allowed_order(order))) {
> >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
> >>> ALLOC_CMA ||
> >>> +                                               order !=
> >>> HPAGE_PMD_ORDER) {
> >>> +                       page = rmqueue_pcplist(preferred_zone, zone,
> >>> order,
> >>> +                                               migratetype,
> >>> alloc_flags);
> >>> +                       if (likely(page))
> >>> +                               goto out;
> >>> +               }
> >>
> >> This seems not ideal, because non-CMA THP gets no chance to use PCP.
> >> But it
> >> still seems better than causing the failure of CMA allocation.
> >>
> >> Is there a possible approach to avoiding adding CMA THP into pcp from
> >> the first
> >> beginning? Otherwise, we might need a separate PCP for CMA.
> >>
>
> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
> adding CMA THP into pcp may incur a slight performance penalty.
>

But the majority of movable pages aren't CMA, right? Do we have an estimate for
adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which
the original intention for THP was to avoid by having only one PCP[1]?

[1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/


> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
> refer to it.
>
>
> >>> +#else
> >>>                  page = rmqueue_pcplist(preferred_zone, zone, order,
> >>>                                         migratetype, alloc_flags);
> >>>                  if (likely(page))
> >>>                          goto out;
> >>> +#endif
> >>>          }
> >>>
> >>>          page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
> >>> --
> >>> 2.7.4
> >>
> >> Thanks
> >> Barry
> >>
>
>