linux-kernel - Re: [PATCH] mm: compaction: use the actual allocation context to determine the watermarks for costly order during async memory compaction

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d09cbac5-d3fd-4212-a138-8ab6496c8f4f@suse.cz>
Date: Wed, 15 Jan 2025 10:56:56 +0100
From: Vlastimil Babka <vbabka@...e.cz>
To: yangge1116@....com, akpm@...ux-foundation.org
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, 21cnbao@...il.com,
 david@...hat.com, baolin.wang@...ux.alibaba.com, hannes@...xchg.org,
 liuzixing@...on.cn
Subject: Re: [PATCH] mm: compaction: use the actual allocation context to
 determine the watermarks for costly order during async memory compaction

On 1/15/25 09:31, yangge1116@....com wrote:
> From: yangge <yangge1116@....com>
> 
> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
> of memory. I have configured 16GB of CMA memory on each NUMA node,
> and starting a 32GB virtual machine with device passthrough is
> extremely slow, taking almost an hour.
> 
> Long term GUP cannot allocate memory from CMA area, so a maximum of
> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
> memory. There is 16GB of free CMA memory on a NUMA node, which is
> sufficient to pass the order-0 watermark check, causing the
> __compaction_suitable() function to  consistently return true.
> 
> For costly allocations, if the __compaction_suitable() function always
> returns true, it causes the __alloc_pages_slowpath() function to fail
> to exit at the appropriate point. This prevents timely fallback to
> allocating memory on other nodes, ultimately resulting in excessively
> long virtual machine startup times.
> Call trace:
> __alloc_pages_slowpath
>     if (compact_result == COMPACT_SKIPPED ||
>         compact_result == COMPACT_DEFERRED)
>         goto nopage; // should exit __alloc_pages_slowpath() from here
> 
> We could use the real unmovable allocation context to have
> __zone_watermark_unusable_free() subtract CMA pages, and thus we won't
> pass the order-0 check anymore once the non-CMA part is exhausted. There
> is some risk that in some different scenario the compaction could in
> fact migrate pages from the exhausted non-CMA part of the zone to the
> CMA part and succeed, and we'll skip it instead. But only __GFP_NORETRY
> allocations should be affected in the immediate "goto nopage" when
> compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY
> anyway and won't fail without trying to compact-migrate the non-CMA
> pageblocks into CMA pageblocks first, so it should be fine.
> 
> After this fix, it only takes a few tens of seconds to start a 32GB
> virtual machine with device passthrough functionality.

So did you verify it works? I just realized there might be still cases it
won't help. There might be enough free order-0 pages in the non-CMA
pageblocks (so the additional check will not stop us) but fragmented and
impossible to compact due to unmovable pages. Then we won't avoid your
issue, right?

> Link: https://lore.kernel.org/lkml/1736335854-548-1-git-send-email-yangge1116@126.com/
> Signed-off-by: yangge <yangge1116@....com>

In case it really helps reliably:

Acked-by: Vlastimil Babka <vbabka@...e.cz>

Some nits below:

> ---
>  mm/compaction.c | 31 +++++++++++++++++++++++++++----
>  1 file changed, 27 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 07bd227..9032bb6 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2490,7 +2490,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>   */
>  static enum compact_result
>  compaction_suit_allocation_order(struct zone *zone, unsigned int order,
> -				 int highest_zoneidx, unsigned int alloc_flags)
> +				 int highest_zoneidx, unsigned int alloc_flags,
> +				 bool async)
>  {
>  	unsigned long watermark;
>  
> @@ -2499,6 +2500,25 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order,
>  			      alloc_flags))
>  		return COMPACT_SUCCESS;
>  
> +	/*
> +	 * For costly orders, during the async memory compaction process, use the
> +	 * actual allocation context to determine the watermarks. There's some risk
> +	 * that in some different scenario the compaction could in fact migrate
> +	 * pages from the exhausted non-CMA part of the zone to the CMA part and
> +	 * succeed, and we'll skip it instead. But only __GFP_NORETRY allocations
> +	 * should be affected in the immediate "goto nopage" when compaction is
> +	 * skipped, others will attempt with DEF_COMPACT_PRIORITY anyway and won't
> +	 * fail without trying to compact-migrate the non-CMA pageblocks into CMA
> +	 * pageblocks first, so it should be fine.

I think it's explaining too much about why not do this than why do this. How
about:

For unmovable allocations (without ALLOC_CMA), check if there is enough free
memory in the non-CMA pageblocks. Otherwise compaction could form the
high-order page in CMA pageblocks, which would not help the allocation to
succeed. However, limit the check to costly order async compaction (such as
opportunistic THP attempts) because there is the possibility that compaction
would migrate pages from non-CMA to CMA pageblock.

> +	 */
> +	if (order > PAGE_ALLOC_COSTLY_ORDER && async) {

We could also check for !(alloc_flags & ALLOC_CMA) here to avoid the
watermark check in the normal THP allocation case (not from pinned gup),
because then it just repeats the watermark check that was done above.

> +		watermark = low_wmark_pages(zone) + compact_gap(order);
> +		if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
> +					   alloc_flags & ALLOC_CMA,

And then here we can just pass 0.

> +					   zone_page_state(zone, NR_FREE_PAGES)))
> +			return COMPACT_SKIPPED;
> +	}
> +
>  	if (!compaction_suitable(zone, order, highest_zoneidx))
>  		return COMPACT_SKIPPED;
>  
> @@ -2534,7 +2554,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>  	if (!is_via_compact_memory(cc->order)) {
>  		ret = compaction_suit_allocation_order(cc->zone, cc->order,
>  						       cc->highest_zoneidx,
> -						       cc->alloc_flags);
> +						       cc->alloc_flags,
> +						       cc->mode == MIGRATE_ASYNC);
>  		if (ret != COMPACT_CONTINUE)
>  			return ret;
>  	}
> @@ -3037,7 +3058,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
>  
>  		ret = compaction_suit_allocation_order(zone,
>  				pgdat->kcompactd_max_order,
> -				highest_zoneidx, ALLOC_WMARK_MIN);
> +				highest_zoneidx, ALLOC_WMARK_MIN,
> +				0);

It's bool, so false instead of 0.

>  		if (ret == COMPACT_CONTINUE)
>  			return true;
>  	}
> @@ -3078,7 +3100,8 @@ static void kcompactd_do_work(pg_data_t *pgdat)
>  			continue;
>  
>  		ret = compaction_suit_allocation_order(zone,
> -				cc.order, zoneid, ALLOC_WMARK_MIN);
> +				cc.order, zoneid, ALLOC_WMARK_MIN,
> +				cc.mode == MIGRATE_ASYNC);

We could also just pass false here as kcompactd uses MIGRATE_SYNC_LIGHT and
has no real alloc_context.

>  		if (ret != COMPACT_CONTINUE)
>  			continue;
>