[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b1b108d5-0008-4681-97cc-253992e18c3b@126.com>
Date: Wed, 19 Jun 2024 13:34:15 +0800
From: Ge Yang <yangge1116@....com>
To: Barry Song <21cnbao@...il.com>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, baolin.wang@...ux.alibaba.com,
liuzixing@...on.cn
Subject: Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating
non-CMA THP-sized page
在 2024/6/18 15:51, yangge1116 写道:
>
>
> 在 2024/6/18 下午2:58, Barry Song 写道:
>> On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@....com> wrote:
>>>
>>>
>>>
>>> 在 2024/6/18 下午12:10, Barry Song 写道:
>>>> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@....com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> 在 2024/6/18 上午9:55, Barry Song 写道:
>>>>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@....com>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 在 2024/6/17 下午8:47, yangge1116 写道:
>>>>>>>>
>>>>>>>>
>>>>>>>> 在 2024/6/17 下午6:26, Barry Song 写道:
>>>>>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@....com> wrote:
>>>>>>>>>>
>>>>>>>>>> From: yangge <yangge1116@....com>
>>>>>>>>>>
>>>>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP
>>>>>>>>>> list for
>>>>>>>>>> THP-sized allocations") no longer differentiates the migration
>>>>>>>>>> type
>>>>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA
>>>>>>>>>> page from
>>>>>>>>>> the list, in some cases, it's not acceptable, for example,
>>>>>>>>>> allocating
>>>>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>>>>>>>>>
>>>>>>>>>> The patch forbids allocating non-CMA THP-sized page from
>>>>>>>>>> THP-sized
>>>>>>>>>> PCP list to avoid the issue above.
>>>>>>>>>
>>>>>>>>> Could you please describe the impact on users in the commit log?
>>>>>>>>
>>>>>>>> If a large number of CMA memory are configured in the system (for
>>>>>>>> example, the CMA memory accounts for 50% of the system memory),
>>>>>>>> starting
>>>>>>>> virtual machine with device passthrough will get stuck.
>>>>>>>>
>>>>>>>> During starting virtual machine, it will call
>>>>>>>> pin_user_pages_remote(...,
>>>>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
>>>>>>>> pin_user_pages_remote() will migrate the page from CMA area to
>>>>>>>> non-CMA
>>>>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation
>>>>>>>> requests
>>>>>>>> return CMA memory, pin_user_pages_remote() will enter endless
>>>>>>>> loops.
>>>>>>>>
>>>>>>>> backtrace:
>>>>>>>> pin_user_pages_remote
>>>>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>>>>> --------__get_user_pages_locked
>>>>>>>> --------check_and_migrate_movable_pages //always check fail and
>>>>>>>> continue
>>>>>>>> to migrate
>>>>>>>> ------------migrate_longterm_unpinnable_pages
>>>>>>>> ----------------alloc_migration_target // non-movable allocation
>>>>>>>>
>>>>>>>>> Is it possible that some CMA memory might be used by non-movable
>>>>>>>>> allocation requests?
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>
>>>>>>>>> If so, will CMA somehow become unable to migrate, causing
>>>>>>>>> cma_alloc()
>>>>>>>>> to fail?
>>>>>>>>
>>>>>>>>
>>>>>>>> No, it will cause endless loops in __gup_longterm_locked(). If
>>>>>>>> non-movable allocation requests return CMA memory,
>>>>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to
>>>>>>>> another
>>>>>>>> CMA page, which is useless and cause endless loops in
>>>>>>>> __gup_longterm_locked().
>>>>>>
>>>>>> This is only one perspective. We also need to consider the impact on
>>>>>> CMA itself. For example,
>>>>>> when CMA is borrowed by THP, and we need to reclaim it through
>>>>>> cma_alloc() or dma_alloc_coherent(),
>>>>>> we must move those pages out to ensure CMA's users can retrieve that
>>>>>> contiguous memory.
>>>>>>
>>>>>> Currently, CMA's memory is occupied by non-movable pages, meaning we
>>>>>> can't relocate them.
>>>>>> As a result, cma_alloc() is more likely to fail.
>>>>>>
>>>>>>>>
>>>>>>>> backtrace:
>>>>>>>> pin_user_pages_remote
>>>>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>>>>> --------__get_user_pages_locked
>>>>>>>> --------check_and_migrate_movable_pages //always check fail and
>>>>>>>> continue
>>>>>>>> to migrate
>>>>>>>> ------------migrate_longterm_unpinnable_pages
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>>>>> THP-sized allocations")
>>>>>>>>>> Signed-off-by: yangge <yangge1116@....com>
>>>>>>>>>> ---
>>>>>>>>>> mm/page_alloc.c | 10 ++++++++++
>>>>>>>>>> 1 file changed, 10 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>>>>> index 2e22ce5..0bdf471 100644
>>>>>>>>>> --- a/mm/page_alloc.c
>>>>>>>>>> +++ b/mm/page_alloc.c
>>>>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
>>>>>>>>>> *preferred_zone,
>>>>>>>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order
>>>>>>>>>> > 1));
>>>>>>>>>>
>>>>>>>>>> if (likely(pcp_allowed_order(order))) {
>>>>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>>>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
>>>>>>>>>> ALLOC_CMA ||
>>>>>>>>>> + order !=
>>>>>>>>>> HPAGE_PMD_ORDER) {
>>>>>>>>>> + page = rmqueue_pcplist(preferred_zone,
>>>>>>>>>> zone,
>>>>>>>>>> order,
>>>>>>>>>> + migratetype,
>>>>>>>>>> alloc_flags);
>>>>>>>>>> + if (likely(page))
>>>>>>>>>> + goto out;
>>>>>>>>>> + }
>>>>>>>>>
>>>>>>>>> This seems not ideal, because non-CMA THP gets no chance to use
>>>>>>>>> PCP.
>>>>>>>>> But it
>>>>>>>>> still seems better than causing the failure of CMA allocation.
>>>>>>>>>
>>>>>>>>> Is there a possible approach to avoiding adding CMA THP into
>>>>>>>>> pcp from
>>>>>>>>> the first
>>>>>>>>> beginning? Otherwise, we might need a separate PCP for CMA.
>>>>>>>>>
>>>>>>>
>>>>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
>>>>>>> adding CMA THP into pcp may incur a slight performance penalty.
>>>>>>>
>>>>>>
>>>>>> But the majority of movable pages aren't CMA, right?
>>>>>
>>>>>> Do we have an estimate for
>>>>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new
>>>>>> cacheline, which
>>>>>> the original intention for THP was to avoid by having only one
>>>>>> PCP[1]?
>>>>>>
>>>>>> [1]
>>>>>> https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/
>>>>>>
>>>>>
>>>>> The size of struct per_cpu_pages is 256 bytes in current code
>>>>> containing
>>>>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>> THP-sized
>>>>> allocations").
>>>>> crash> struct per_cpu_pages
>>>>> struct per_cpu_pages {
>>>>> spinlock_t lock;
>>>>> int count;
>>>>> int high;
>>>>> int high_min;
>>>>> int high_max;
>>>>> int batch;
>>>>> u8 flags;
>>>>> u8 alloc_factor;
>>>>> u8 expire;
>>>>> short free_count;
>>>>> struct list_head lists[13];
>>>>> }
>>>>> SIZE: 256
>>>>>
>>>>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP
>>>>> list
>>>>> for THP-sized allocations"), the size of struct per_cpu_pages is
>>>>> 272 bytes.
>>>>> crash> struct per_cpu_pages
>>>>> struct per_cpu_pages {
>>>>> spinlock_t lock;
>>>>> int count;
>>>>> int high;
>>>>> int high_min;
>>>>> int high_max;
>>>>> int batch;
>>>>> u8 flags;
>>>>> u8 alloc_factor;
>>>>> u8 expire;
>>>>> short free_count;
>>>>> struct list_head lists[15];
>>>>> }
>>>>> SIZE: 272
>>>>>
>>>>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>> THP-sized allocations") decrease one cacheline.
>>>>
>>>> the proposal is not reverting the patch but adding one CMA pcp.
>>>> so it is "struct list_head lists[14]"; in this case, the size is still
>>>> 256?
>>>>
>>>
>>> Yes, the size is still 256. If add one PCP list, we will have 2 PCP
>>> lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other
>>> PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that
>>> right?
>>
>> i am not quite sure about MIGRATE_RECLAIMABLE as we want to
>> CMA is only used by movable.
>> So it might be:
>> MOVABLE and NON-MOVABLE.
>
> One PCP list is used by UNMOVABLE pages, and the other PCP list is used
> by MOVABLE pages, seems it is feasible. UNMOVABLE PCP list contains
> MIGRATE_UNMOVABLE pages and MIGRATE_RECLAIMABLE pages, and MOVABLE PCP
> list contains MIGRATE_MOVABLE pages.
>
Is the following modification feasiable?
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define NR_PCP_THP 1
+#define NR_PCP_THP 2
+#define PCP_THP_MOVABLE 0
+#define PCP_THP_UNMOVABLE 1
#else
#define NR_PCP_THP 0
#endif
static inline unsigned int order_to_pindex(int migratetype, int order)
{
+ int pcp_type = migratetype;
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
if (order > PAGE_ALLOC_COSTLY_ORDER) {
VM_BUG_ON(order != HPAGE_PMD_ORDER);
- return NR_LOWORDER_PCP_LISTS;
+
+ if (migratetype != MIGRATE_MOVABLE)
+ pcp_type = PCP_THP_UNMOVABLE;
+ else
+ pcp_type = PCP_THP_MOVABLE;
+
+ return NR_LOWORDER_PCP_LISTS + pcp_type;
}
#else
VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
#endif
- return (MIGRATE_PCPTYPES * order) + migratetype;
+ return (MIGRATE_PCPTYPES * order) + pcp_type;
}
@@ -521,7 +529,7 @@ static inline int pindex_to_order(unsigned int pindex)
int order = pindex / MIGRATE_PCPTYPES;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- if (pindex == NR_LOWORDER_PCP_LISTS)
+ if (order > PAGE_ALLOC_COSTLY_ORDER)
order = HPAGE_PMD_ORDER;
#else
VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
>>>>>>> refer to it.
>>>>>>>
>>>>>>>
>>>>>>>>>> +#else
>>>>>>>>>> page = rmqueue_pcplist(preferred_zone,
>>>>>>>>>> zone, order,
>>>>>>>>>> migratetype,
>>>>>>>>>> alloc_flags);
>>>>>>>>>> if (likely(page))
>>>>>>>>>> goto out;
>>>>>>>>>> +#endif
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> page = rmqueue_buddy(preferred_zone, zone, order,
>>>>>>>>>> alloc_flags,
>>>>>>>>>> --
>>>>>>>>>> 2.7.4
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Barry
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>
Powered by blists - more mailing lists