lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4ad51644-92de-47ca-af2a-bcb1866059d2@126.com>
Date: Tue, 21 Jan 2025 18:01:42 +0800
From: Ge Yang <yangge1116@....com>
To: akpm@...ux-foundation.org
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, 21cnbao@...il.com,
 david@...hat.com, baolin.wang@...ux.alibaba.com, hannes@...xchg.org,
 vbabka@...e.cz, liuzixing@...on.cn
Subject: Re: [PATCH V2] mm: compaction: use the actual allocation context to
 determine the watermarks for costly order during async memory compaction

This patch has been revised based on Vlastimil's suggestions. Please 
continue to review it. Thank you.

在 2025/1/16 9:33, yangge1116@....com 写道:
> From: yangge <yangge1116@....com>
> 
> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
> of memory. I have configured 16GB of CMA memory on each NUMA node,
> and starting a 32GB virtual machine with device passthrough is
> extremely slow, taking almost an hour.
> 
> Long term GUP cannot allocate memory from CMA area, so a maximum of
> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
> memory. There is 16GB of free CMA memory on a NUMA node, which is
> sufficient to pass the order-0 watermark check, causing the
> __compaction_suitable() function to  consistently return true.
> 
> For costly allocations, if the __compaction_suitable() function always
> returns true, it causes the __alloc_pages_slowpath() function to fail
> to exit at the appropriate point. This prevents timely fallback to
> allocating memory on other nodes, ultimately resulting in excessively
> long virtual machine startup times.
> Call trace:
> __alloc_pages_slowpath
>      if (compact_result == COMPACT_SKIPPED ||
>          compact_result == COMPACT_DEFERRED)
>          goto nopage; // should exit __alloc_pages_slowpath() from here
> 
> We could use the real unmovable allocation context to have
> __zone_watermark_unusable_free() subtract CMA pages, and thus we won't
> pass the order-0 check anymore once the non-CMA part is exhausted. There
> is some risk that in some different scenario the compaction could in
> fact migrate pages from the exhausted non-CMA part of the zone to the
> CMA part and succeed, and we'll skip it instead. But only __GFP_NORETRY
> allocations should be affected in the immediate "goto nopage" when
> compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY
> anyway and won't fail without trying to compact-migrate the non-CMA
> pageblocks into CMA pageblocks first, so it should be fine.
> 
> After this fix, it only takes a few tens of seconds to start a 32GB
> virtual machine with device passthrough functionality.
> 
> Link: https://lore.kernel.org/lkml/1736335854-548-1-git-send-email-yangge1116@126.com/
> Signed-off-by: yangge <yangge1116@....com>
> Acked-by: Vlastimil Babka <vbabka@...e.cz>
> ---
> 
> V2:
> - update code and message suggested by Vlastimil
> 
>   mm/compaction.c | 29 +++++++++++++++++++++++++----
>   1 file changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 07bd227..3de7b67 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2490,7 +2490,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>    */
>   static enum compact_result
>   compaction_suit_allocation_order(struct zone *zone, unsigned int order,
> -				 int highest_zoneidx, unsigned int alloc_flags)
> +				 int highest_zoneidx, unsigned int alloc_flags,
> +				 bool async)
>   {
>   	unsigned long watermark;
>   
> @@ -2499,6 +2500,23 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order,
>   			      alloc_flags))
>   		return COMPACT_SUCCESS;
>   
> +	/*
> +	 * For unmovable allocations (without ALLOC_CMA), check if there is enough
> +	 * free memory in the non-CMA pageblocks. Otherwise compaction could form
> +	 * the high-order page in CMA pageblocks, which would not help the
> +	 * allocation to succeed. However, limit the check to costly order async
> +	 * compaction (such as opportunistic THP attempts) because there is the
> +	 * possibility that compaction would migrate pages from non-CMA to CMA
> +	 * pageblock.
> +	 */
> +	if (order > PAGE_ALLOC_COSTLY_ORDER && async &&
> +	    !(alloc_flags & ALLOC_CMA)) {
> +		watermark = low_wmark_pages(zone) + compact_gap(order);
> +		if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
> +					   0, zone_page_state(zone, NR_FREE_PAGES)))
> +			return COMPACT_SKIPPED;
> +	}
> +
>   	if (!compaction_suitable(zone, order, highest_zoneidx))
>   		return COMPACT_SKIPPED;
>   
> @@ -2534,7 +2552,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>   	if (!is_via_compact_memory(cc->order)) {
>   		ret = compaction_suit_allocation_order(cc->zone, cc->order,
>   						       cc->highest_zoneidx,
> -						       cc->alloc_flags);
> +						       cc->alloc_flags,
> +						       cc->mode == MIGRATE_ASYNC);
>   		if (ret != COMPACT_CONTINUE)
>   			return ret;
>   	}
> @@ -3037,7 +3056,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
>   
>   		ret = compaction_suit_allocation_order(zone,
>   				pgdat->kcompactd_max_order,
> -				highest_zoneidx, ALLOC_WMARK_MIN);
> +				highest_zoneidx, ALLOC_WMARK_MIN,
> +				false);
>   		if (ret == COMPACT_CONTINUE)
>   			return true;
>   	}
> @@ -3078,7 +3098,8 @@ static void kcompactd_do_work(pg_data_t *pgdat)
>   			continue;
>   
>   		ret = compaction_suit_allocation_order(zone,
> -				cc.order, zoneid, ALLOC_WMARK_MIN);
> +				cc.order, zoneid, ALLOC_WMARK_MIN,
> +				false);
>   		if (ret != COMPACT_CONTINUE)
>   			continue;
>   


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ