linux-kernel - Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d082f7e0-cbb9-f21a-e08d-47b5abd9e91d@126.com>
Date: Tue, 18 Jun 2024 15:51:59 +0800
From: yangge1116 <yangge1116@....com>
To: Barry Song <21cnbao@...il.com>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, baolin.wang@...ux.alibaba.com,
 liuzixing@...on.cn
Subject: Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating
 non-CMA THP-sized page



在 2024/6/18 下午2:58, Barry Song 写道:
> On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@....com> wrote:
>>
>>
>>
>> 在 2024/6/18 下午12:10, Barry Song 写道:
>>> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@....com> wrote:
>>>>
>>>>
>>>>
>>>> 在 2024/6/18 上午9:55, Barry Song 写道:
>>>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@....com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> 在 2024/6/17 下午8:47, yangge1116 写道:
>>>>>>>
>>>>>>>
>>>>>>> 在 2024/6/17 下午6:26, Barry Song 写道:
>>>>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@....com> wrote:
>>>>>>>>>
>>>>>>>>> From: yangge <yangge1116@....com>
>>>>>>>>>
>>>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>>>> THP-sized allocations") no longer differentiates the migration type
>>>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from
>>>>>>>>> the list, in some cases, it's not acceptable, for example, allocating
>>>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>>>>>>>>
>>>>>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized
>>>>>>>>> PCP list to avoid the issue above.
>>>>>>>>
>>>>>>>> Could you please describe the impact on users in the commit log?
>>>>>>>
>>>>>>> If a large number of CMA memory are configured in the system (for
>>>>>>> example, the CMA memory accounts for 50% of the system memory), starting
>>>>>>> virtual machine with device passthrough will get stuck.
>>>>>>>
>>>>>>> During starting virtual machine, it will call pin_user_pages_remote(...,
>>>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
>>>>>>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA
>>>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation requests
>>>>>>> return CMA memory, pin_user_pages_remote() will enter endless loops.
>>>>>>>
>>>>>>> backtrace:
>>>>>>> pin_user_pages_remote
>>>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>>>> --------__get_user_pages_locked
>>>>>>> --------check_and_migrate_movable_pages //always check fail and continue
>>>>>>> to migrate
>>>>>>> ------------migrate_longterm_unpinnable_pages
>>>>>>> ----------------alloc_migration_target // non-movable allocation
>>>>>>>
>>>>>>>> Is it possible that some CMA memory might be used by non-movable
>>>>>>>> allocation requests?
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>
>>>>>>>> If so, will CMA somehow become unable to migrate, causing cma_alloc()
>>>>>>>> to fail?
>>>>>>>
>>>>>>>
>>>>>>> No, it will cause endless loops in __gup_longterm_locked(). If
>>>>>>> non-movable allocation requests return CMA memory,
>>>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another
>>>>>>> CMA page, which is useless and cause endless loops in
>>>>>>> __gup_longterm_locked().
>>>>>
>>>>> This is only one perspective. We also need to consider the impact on
>>>>> CMA itself. For example,
>>>>> when CMA is borrowed by THP, and we need to reclaim it through
>>>>> cma_alloc() or dma_alloc_coherent(),
>>>>> we must move those pages out to ensure CMA's users can retrieve that
>>>>> contiguous memory.
>>>>>
>>>>> Currently, CMA's memory is occupied by non-movable pages, meaning we
>>>>> can't relocate them.
>>>>> As a result, cma_alloc() is more likely to fail.
>>>>>
>>>>>>>
>>>>>>> backtrace:
>>>>>>> pin_user_pages_remote
>>>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>>>> --------__get_user_pages_locked
>>>>>>> --------check_and_migrate_movable_pages //always check fail and continue
>>>>>>> to migrate
>>>>>>> ------------migrate_longterm_unpinnable_pages
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>>>> THP-sized allocations")
>>>>>>>>> Signed-off-by: yangge <yangge1116@....com>
>>>>>>>>> ---
>>>>>>>>>      mm/page_alloc.c | 10 ++++++++++
>>>>>>>>>      1 file changed, 10 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>>>> index 2e22ce5..0bdf471 100644
>>>>>>>>> --- a/mm/page_alloc.c
>>>>>>>>> +++ b/mm/page_alloc.c
>>>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
>>>>>>>>> *preferred_zone,
>>>>>>>>>             WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>>>>>>>>>
>>>>>>>>>             if (likely(pcp_allowed_order(order))) {
>>>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
>>>>>>>>> ALLOC_CMA ||
>>>>>>>>> +                                               order !=
>>>>>>>>> HPAGE_PMD_ORDER) {
>>>>>>>>> +                       page = rmqueue_pcplist(preferred_zone, zone,
>>>>>>>>> order,
>>>>>>>>> +                                               migratetype,
>>>>>>>>> alloc_flags);
>>>>>>>>> +                       if (likely(page))
>>>>>>>>> +                               goto out;
>>>>>>>>> +               }
>>>>>>>>
>>>>>>>> This seems not ideal, because non-CMA THP gets no chance to use PCP.
>>>>>>>> But it
>>>>>>>> still seems better than causing the failure of CMA allocation.
>>>>>>>>
>>>>>>>> Is there a possible approach to avoiding adding CMA THP into pcp from
>>>>>>>> the first
>>>>>>>> beginning? Otherwise, we might need a separate PCP for CMA.
>>>>>>>>
>>>>>>
>>>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
>>>>>> adding CMA THP into pcp may incur a slight performance penalty.
>>>>>>
>>>>>
>>>>> But the majority of movable pages aren't CMA, right?
>>>>
>>>>> Do we have an estimate for
>>>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which
>>>>> the original intention for THP was to avoid by having only one PCP[1]?
>>>>>
>>>>> [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/
>>>>>
>>>>
>>>> The size of struct per_cpu_pages is 256 bytes in current code containing
>>>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
>>>> allocations").
>>>> crash> struct per_cpu_pages
>>>> struct per_cpu_pages {
>>>>        spinlock_t lock;
>>>>        int count;
>>>>        int high;
>>>>        int high_min;
>>>>        int high_max;
>>>>        int batch;
>>>>        u8 flags;
>>>>        u8 alloc_factor;
>>>>        u8 expire;
>>>>        short free_count;
>>>>        struct list_head lists[13];
>>>> }
>>>> SIZE: 256
>>>>
>>>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list
>>>> for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes.
>>>> crash> struct per_cpu_pages
>>>> struct per_cpu_pages {
>>>>        spinlock_t lock;
>>>>        int count;
>>>>        int high;
>>>>        int high_min;
>>>>        int high_max;
>>>>        int batch;
>>>>        u8 flags;
>>>>        u8 alloc_factor;
>>>>        u8 expire;
>>>>        short free_count;
>>>>        struct list_head lists[15];
>>>> }
>>>> SIZE: 272
>>>>
>>>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>> THP-sized allocations") decrease one cacheline.
>>>
>>> the proposal is not reverting the patch but adding one CMA pcp.
>>> so it is "struct list_head lists[14]"; in this case, the size is still
>>> 256?
>>>
>>
>> Yes, the size is still 256. If add one PCP list, we will have 2 PCP
>> lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other
>> PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that right?
> 
> i am not quite sure about MIGRATE_RECLAIMABLE as we want to
> CMA is only used by movable.
> So it might be:
> MOVABLE and NON-MOVABLE.

One PCP list is used by UNMOVABLE pages, and the other PCP list is used 
by MOVABLE pages, seems it is feasible. UNMOVABLE PCP list contains 
MIGRATE_UNMOVABLE pages and MIGRATE_RECLAIMABLE pages, and MOVABLE PCP 
list contains MIGRATE_MOVABLE pages.

> 
>>
>>>
>>>>
>>>>>
>>>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
>>>>>> refer to it.
>>>>>>
>>>>>>
>>>>>>>>> +#else
>>>>>>>>>                     page = rmqueue_pcplist(preferred_zone, zone, order,
>>>>>>>>>                                            migratetype, alloc_flags);
>>>>>>>>>                     if (likely(page))
>>>>>>>>>                             goto out;
>>>>>>>>> +#endif
>>>>>>>>>             }
>>>>>>>>>
>>>>>>>>>             page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
>>>>>>>>> --
>>>>>>>>> 2.7.4
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Barry
>>>>>>>>
>>>>>>
>>>>>>
>>>>
>>