linux-kernel - Re: [PATCH] mm: compaction: use the actual allocation context to determine the watermarks for costly order during async memory compaction

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4b3c4ecf-2a3f-4868-8b6a-5c1b1920ca79@126.com>
Date: Thu, 16 Jan 2025 09:33:21 +0800
From: Ge Yang <yangge1116@....com>
To: Vlastimil Babka <vbabka@...e.cz>, akpm@...ux-foundation.org
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, 21cnbao@...il.com,
 david@...hat.com, baolin.wang@...ux.alibaba.com, hannes@...xchg.org,
 liuzixing@...on.cn
Subject: Re: [PATCH] mm: compaction: use the actual allocation context to
 determine the watermarks for costly order during async memory compaction



在 2025/1/15 17:56, Vlastimil Babka 写道:
> On 1/15/25 09:31, yangge1116@....com wrote:
>> From: yangge <yangge1116@....com>
>>
>> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
>> of memory. I have configured 16GB of CMA memory on each NUMA node,
>> and starting a 32GB virtual machine with device passthrough is
>> extremely slow, taking almost an hour.
>>
>> Long term GUP cannot allocate memory from CMA area, so a maximum of
>> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
>> memory. There is 16GB of free CMA memory on a NUMA node, which is
>> sufficient to pass the order-0 watermark check, causing the
>> __compaction_suitable() function to  consistently return true.
>>
>> For costly allocations, if the __compaction_suitable() function always
>> returns true, it causes the __alloc_pages_slowpath() function to fail
>> to exit at the appropriate point. This prevents timely fallback to
>> allocating memory on other nodes, ultimately resulting in excessively
>> long virtual machine startup times.
>> Call trace:
>> __alloc_pages_slowpath
>>      if (compact_result == COMPACT_SKIPPED ||
>>          compact_result == COMPACT_DEFERRED)
>>          goto nopage; // should exit __alloc_pages_slowpath() from here
>>
>> We could use the real unmovable allocation context to have
>> __zone_watermark_unusable_free() subtract CMA pages, and thus we won't
>> pass the order-0 check anymore once the non-CMA part is exhausted. There
>> is some risk that in some different scenario the compaction could in
>> fact migrate pages from the exhausted non-CMA part of the zone to the
>> CMA part and succeed, and we'll skip it instead. But only __GFP_NORETRY
>> allocations should be affected in the immediate "goto nopage" when
>> compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY
>> anyway and won't fail without trying to compact-migrate the non-CMA
>> pageblocks into CMA pageblocks first, so it should be fine.
>>
>> After this fix, it only takes a few tens of seconds to start a 32GB
>> virtual machine with device passthrough functionality.
> 
> So did you verify it works? 
After multiple tests, it has been confirmed to work properly. Thank you.
I just realized there might be still cases it
> won't help. There might be enough free order-0 pages in the non-CMA
> pageblocks (so the additional check will not stop us) but fragmented and
> impossible to compact due to unmovable pages. Then we won't avoid your
> issue, right?
> 
The pages that are pinned are mostly Transparent Huge Pages (THP). 
Therefore, it is not common to find free order-0 pages in non-CMA 
pageblocks that are fragmented and impossible to compact due to the 
presence of unmovable pages. This patch can resolve my issue.
>> Link: https://lore.kernel.org/lkml/1736335854-548-1-git-send-email-yangge1116@126.com/
>> Signed-off-by: yangge <yangge1116@....com>
> 
> In case it really helps reliably:
> 
> Acked-by: Vlastimil Babka <vbabka@...e.cz>
> 
> Some nits below:
> 
>> ---
>>   mm/compaction.c | 31 +++++++++++++++++++++++++++----
>>   1 file changed, 27 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 07bd227..9032bb6 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -2490,7 +2490,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>>    */
>>   static enum compact_result
>>   compaction_suit_allocation_order(struct zone *zone, unsigned int order,
>> -				 int highest_zoneidx, unsigned int alloc_flags)
>> +				 int highest_zoneidx, unsigned int alloc_flags,
>> +				 bool async)
>>   {
>>   	unsigned long watermark;
>>   
>> @@ -2499,6 +2500,25 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order,
>>   			      alloc_flags))
>>   		return COMPACT_SUCCESS;
>>   
>> +	/*
>> +	 * For costly orders, during the async memory compaction process, use the
>> +	 * actual allocation context to determine the watermarks. There's some risk
>> +	 * that in some different scenario the compaction could in fact migrate
>> +	 * pages from the exhausted non-CMA part of the zone to the CMA part and
>> +	 * succeed, and we'll skip it instead. But only __GFP_NORETRY allocations
>> +	 * should be affected in the immediate "goto nopage" when compaction is
>> +	 * skipped, others will attempt with DEF_COMPACT_PRIORITY anyway and won't
>> +	 * fail without trying to compact-migrate the non-CMA pageblocks into CMA
>> +	 * pageblocks first, so it should be fine.
> 
> I think it's explaining too much about why not do this than why do this. How
> about:
> 
> For unmovable allocations (without ALLOC_CMA), check if there is enough free
> memory in the non-CMA pageblocks. Otherwise compaction could form the
> high-order page in CMA pageblocks, which would not help the allocation to
> succeed. However, limit the check to costly order async compaction (such as
> opportunistic THP attempts) because there is the possibility that compaction
> would migrate pages from non-CMA to CMA pageblock.
> 
>> +	 */
>> +	if (order > PAGE_ALLOC_COSTLY_ORDER && async) {
> 
> We could also check for !(alloc_flags & ALLOC_CMA) here to avoid the
> watermark check in the normal THP allocation case (not from pinned gup),
> because then it just repeats the watermark check that was done above.
> 
>> +		watermark = low_wmark_pages(zone) + compact_gap(order);
>> +		if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
>> +					   alloc_flags & ALLOC_CMA,
> 
> And then here we can just pass 0.
> 
>> +					   zone_page_state(zone, NR_FREE_PAGES)))
>> +			return COMPACT_SKIPPED;
>> +	}
>> +
>>   	if (!compaction_suitable(zone, order, highest_zoneidx))
>>   		return COMPACT_SKIPPED;
>>   
>> @@ -2534,7 +2554,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>>   	if (!is_via_compact_memory(cc->order)) {
>>   		ret = compaction_suit_allocation_order(cc->zone, cc->order,
>>   						       cc->highest_zoneidx,
>> -						       cc->alloc_flags);
>> +						       cc->alloc_flags,
>> +						       cc->mode == MIGRATE_ASYNC);
>>   		if (ret != COMPACT_CONTINUE)
>>   			return ret;
>>   	}
>> @@ -3037,7 +3058,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
>>   
>>   		ret = compaction_suit_allocation_order(zone,
>>   				pgdat->kcompactd_max_order,
>> -				highest_zoneidx, ALLOC_WMARK_MIN);
>> +				highest_zoneidx, ALLOC_WMARK_MIN,
>> +				0);
> 
> It's bool, so false instead of 0.
> 
>>   		if (ret == COMPACT_CONTINUE)
>>   			return true;
>>   	}
>> @@ -3078,7 +3100,8 @@ static void kcompactd_do_work(pg_data_t *pgdat)
>>   			continue;
>>   
>>   		ret = compaction_suit_allocation_order(zone,
>> -				cc.order, zoneid, ALLOC_WMARK_MIN);
>> +				cc.order, zoneid, ALLOC_WMARK_MIN,
>> +				cc.mode == MIGRATE_ASYNC);
> 
> We could also just pass false here as kcompactd uses MIGRATE_SYNC_LIGHT and
> has no real alloc_context.
> 
>>   		if (ret != COMPACT_CONTINUE)
>>   			continue;
>>