linux-kernel - Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b1b108d5-0008-4681-97cc-253992e18c3b@126.com>
Date: Wed, 19 Jun 2024 13:34:15 +0800
From: Ge Yang <yangge1116@....com>
To: Barry Song <21cnbao@...il.com>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, baolin.wang@...ux.alibaba.com,
 liuzixing@...on.cn
Subject: Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating
 non-CMA THP-sized page



在 2024/6/18 15:51, yangge1116 写道:
> 
> 
> 在 2024/6/18 下午2:58, Barry Song 写道:
>> On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@....com> wrote:
>>>
>>>
>>>
>>> 在 2024/6/18 下午12:10, Barry Song 写道:
>>>> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@....com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> 在 2024/6/18 上午9:55, Barry Song 写道:
>>>>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@....com> 
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 在 2024/6/17 下午8:47, yangge1116 写道:
>>>>>>>>
>>>>>>>>
>>>>>>>> 在 2024/6/17 下午6:26, Barry Song 写道:
>>>>>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@....com> wrote:
>>>>>>>>>>
>>>>>>>>>> From: yangge <yangge1116@....com>
>>>>>>>>>>
>>>>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP 
>>>>>>>>>> list for
>>>>>>>>>> THP-sized allocations") no longer differentiates the migration 
>>>>>>>>>> type
>>>>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA 
>>>>>>>>>> page from
>>>>>>>>>> the list, in some cases, it's not acceptable, for example, 
>>>>>>>>>> allocating
>>>>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>>>>>>>>>
>>>>>>>>>> The patch forbids allocating non-CMA THP-sized page from 
>>>>>>>>>> THP-sized
>>>>>>>>>> PCP list to avoid the issue above.
>>>>>>>>>
>>>>>>>>> Could you please describe the impact on users in the commit log?
>>>>>>>>
>>>>>>>> If a large number of CMA memory are configured in the system (for
>>>>>>>> example, the CMA memory accounts for 50% of the system memory), 
>>>>>>>> starting
>>>>>>>> virtual machine with device passthrough will get stuck.
>>>>>>>>
>>>>>>>> During starting virtual machine, it will call 
>>>>>>>> pin_user_pages_remote(...,
>>>>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
>>>>>>>> pin_user_pages_remote() will migrate the page from CMA area to 
>>>>>>>> non-CMA
>>>>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation 
>>>>>>>> requests
>>>>>>>> return CMA memory, pin_user_pages_remote() will enter endless 
>>>>>>>> loops.
>>>>>>>>
>>>>>>>> backtrace:
>>>>>>>> pin_user_pages_remote
>>>>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>>>>> --------__get_user_pages_locked
>>>>>>>> --------check_and_migrate_movable_pages //always check fail and 
>>>>>>>> continue
>>>>>>>> to migrate
>>>>>>>> ------------migrate_longterm_unpinnable_pages
>>>>>>>> ----------------alloc_migration_target // non-movable allocation
>>>>>>>>
>>>>>>>>> Is it possible that some CMA memory might be used by non-movable
>>>>>>>>> allocation requests?
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>
>>>>>>>>> If so, will CMA somehow become unable to migrate, causing 
>>>>>>>>> cma_alloc()
>>>>>>>>> to fail?
>>>>>>>>
>>>>>>>>
>>>>>>>> No, it will cause endless loops in __gup_longterm_locked(). If
>>>>>>>> non-movable allocation requests return CMA memory,
>>>>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to 
>>>>>>>> another
>>>>>>>> CMA page, which is useless and cause endless loops in
>>>>>>>> __gup_longterm_locked().
>>>>>>
>>>>>> This is only one perspective. We also need to consider the impact on
>>>>>> CMA itself. For example,
>>>>>> when CMA is borrowed by THP, and we need to reclaim it through
>>>>>> cma_alloc() or dma_alloc_coherent(),
>>>>>> we must move those pages out to ensure CMA's users can retrieve that
>>>>>> contiguous memory.
>>>>>>
>>>>>> Currently, CMA's memory is occupied by non-movable pages, meaning we
>>>>>> can't relocate them.
>>>>>> As a result, cma_alloc() is more likely to fail.
>>>>>>
>>>>>>>>
>>>>>>>> backtrace:
>>>>>>>> pin_user_pages_remote
>>>>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>>>>> --------__get_user_pages_locked
>>>>>>>> --------check_and_migrate_movable_pages //always check fail and 
>>>>>>>> continue
>>>>>>>> to migrate
>>>>>>>> ------------migrate_longterm_unpinnable_pages
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>>>>> THP-sized allocations")
>>>>>>>>>> Signed-off-by: yangge <yangge1116@....com>
>>>>>>>>>> ---
>>>>>>>>>>      mm/page_alloc.c | 10 ++++++++++
>>>>>>>>>>      1 file changed, 10 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>>>>> index 2e22ce5..0bdf471 100644
>>>>>>>>>> --- a/mm/page_alloc.c
>>>>>>>>>> +++ b/mm/page_alloc.c
>>>>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
>>>>>>>>>> *preferred_zone,
>>>>>>>>>>             WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order 
>>>>>>>>>> > 1));
>>>>>>>>>>
>>>>>>>>>>             if (likely(pcp_allowed_order(order))) {
>>>>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>>>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
>>>>>>>>>> ALLOC_CMA ||
>>>>>>>>>> +                                               order !=
>>>>>>>>>> HPAGE_PMD_ORDER) {
>>>>>>>>>> +                       page = rmqueue_pcplist(preferred_zone, 
>>>>>>>>>> zone,
>>>>>>>>>> order,
>>>>>>>>>> +                                               migratetype,
>>>>>>>>>> alloc_flags);
>>>>>>>>>> +                       if (likely(page))
>>>>>>>>>> +                               goto out;
>>>>>>>>>> +               }
>>>>>>>>>
>>>>>>>>> This seems not ideal, because non-CMA THP gets no chance to use 
>>>>>>>>> PCP.
>>>>>>>>> But it
>>>>>>>>> still seems better than causing the failure of CMA allocation.
>>>>>>>>>
>>>>>>>>> Is there a possible approach to avoiding adding CMA THP into 
>>>>>>>>> pcp from
>>>>>>>>> the first
>>>>>>>>> beginning? Otherwise, we might need a separate PCP for CMA.
>>>>>>>>>
>>>>>>>
>>>>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
>>>>>>> adding CMA THP into pcp may incur a slight performance penalty.
>>>>>>>
>>>>>>
>>>>>> But the majority of movable pages aren't CMA, right?
>>>>>
>>>>>> Do we have an estimate for
>>>>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new 
>>>>>> cacheline, which
>>>>>> the original intention for THP was to avoid by having only one 
>>>>>> PCP[1]?
>>>>>>
>>>>>> [1] 
>>>>>> https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/
>>>>>>
>>>>>
>>>>> The size of struct per_cpu_pages is 256 bytes in current code 
>>>>> containing
>>>>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for 
>>>>> THP-sized
>>>>> allocations").
>>>>> crash> struct per_cpu_pages
>>>>> struct per_cpu_pages {
>>>>>        spinlock_t lock;
>>>>>        int count;
>>>>>        int high;
>>>>>        int high_min;
>>>>>        int high_max;
>>>>>        int batch;
>>>>>        u8 flags;
>>>>>        u8 alloc_factor;
>>>>>        u8 expire;
>>>>>        short free_count;
>>>>>        struct list_head lists[13];
>>>>> }
>>>>> SIZE: 256
>>>>>
>>>>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP 
>>>>> list
>>>>> for THP-sized allocations"), the size of struct per_cpu_pages is 
>>>>> 272 bytes.
>>>>> crash> struct per_cpu_pages
>>>>> struct per_cpu_pages {
>>>>>        spinlock_t lock;
>>>>>        int count;
>>>>>        int high;
>>>>>        int high_min;
>>>>>        int high_max;
>>>>>        int batch;
>>>>>        u8 flags;
>>>>>        u8 alloc_factor;
>>>>>        u8 expire;
>>>>>        short free_count;
>>>>>        struct list_head lists[15];
>>>>> }
>>>>> SIZE: 272
>>>>>
>>>>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>> THP-sized allocations") decrease one cacheline.
>>>>
>>>> the proposal is not reverting the patch but adding one CMA pcp.
>>>> so it is "struct list_head lists[14]"; in this case, the size is still
>>>> 256?
>>>>
>>>
>>> Yes, the size is still 256. If add one PCP list, we will have 2 PCP
>>> lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other
>>> PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that 
>>> right?
>>
>> i am not quite sure about MIGRATE_RECLAIMABLE as we want to
>> CMA is only used by movable.
>> So it might be:
>> MOVABLE and NON-MOVABLE.
> 
> One PCP list is used by UNMOVABLE pages, and the other PCP list is used 
> by MOVABLE pages, seems it is feasible. UNMOVABLE PCP list contains 
> MIGRATE_UNMOVABLE pages and MIGRATE_RECLAIMABLE pages, and MOVABLE PCP 
> list contains MIGRATE_MOVABLE pages.
> 

Is the following modification feasiable?

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define NR_PCP_THP 1
+#define NR_PCP_THP 2
+#define PCP_THP_MOVABLE 0
+#define PCP_THP_UNMOVABLE 1
  #else
  #define NR_PCP_THP 0
  #endif

  static inline unsigned int order_to_pindex(int migratetype, int order)
  {
+       int pcp_type = migratetype;
+
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
         if (order > PAGE_ALLOC_COSTLY_ORDER) {
                 VM_BUG_ON(order != HPAGE_PMD_ORDER);
-               return NR_LOWORDER_PCP_LISTS;
+
+               if (migratetype != MIGRATE_MOVABLE)
+                       pcp_type = PCP_THP_UNMOVABLE;
+               else
+                       pcp_type = PCP_THP_MOVABLE;
+
+               return NR_LOWORDER_PCP_LISTS + pcp_type;
         }
  #else
         VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
  #endif

-       return (MIGRATE_PCPTYPES * order) + migratetype;
+       return (MIGRATE_PCPTYPES * order) + pcp_type;
  }



@@ -521,7 +529,7 @@ static inline int pindex_to_order(unsigned int pindex)
         int order = pindex / MIGRATE_PCPTYPES;

  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-       if (pindex == NR_LOWORDER_PCP_LISTS)
+       if (order > PAGE_ALLOC_COSTLY_ORDER)
                 order = HPAGE_PMD_ORDER;
  #else
         VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);



>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
>>>>>>> refer to it.
>>>>>>>
>>>>>>>
>>>>>>>>>> +#else
>>>>>>>>>>                     page = rmqueue_pcplist(preferred_zone, 
>>>>>>>>>> zone, order,
>>>>>>>>>>                                            migratetype, 
>>>>>>>>>> alloc_flags);
>>>>>>>>>>                     if (likely(page))
>>>>>>>>>>                             goto out;
>>>>>>>>>> +#endif
>>>>>>>>>>             }
>>>>>>>>>>
>>>>>>>>>>             page = rmqueue_buddy(preferred_zone, zone, order, 
>>>>>>>>>> alloc_flags,
>>>>>>>>>> -- 
>>>>>>>>>> 2.7.4
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Barry
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>