[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4b3c4ecf-2a3f-4868-8b6a-5c1b1920ca79@126.com>
Date: Thu, 16 Jan 2025 09:33:21 +0800
From: Ge Yang <yangge1116@....com>
To: Vlastimil Babka <vbabka@...e.cz>, akpm@...ux-foundation.org
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, 21cnbao@...il.com,
david@...hat.com, baolin.wang@...ux.alibaba.com, hannes@...xchg.org,
liuzixing@...on.cn
Subject: Re: [PATCH] mm: compaction: use the actual allocation context to
determine the watermarks for costly order during async memory compaction
在 2025/1/15 17:56, Vlastimil Babka 写道:
> On 1/15/25 09:31, yangge1116@....com wrote:
>> From: yangge <yangge1116@....com>
>>
>> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
>> of memory. I have configured 16GB of CMA memory on each NUMA node,
>> and starting a 32GB virtual machine with device passthrough is
>> extremely slow, taking almost an hour.
>>
>> Long term GUP cannot allocate memory from CMA area, so a maximum of
>> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
>> memory. There is 16GB of free CMA memory on a NUMA node, which is
>> sufficient to pass the order-0 watermark check, causing the
>> __compaction_suitable() function to consistently return true.
>>
>> For costly allocations, if the __compaction_suitable() function always
>> returns true, it causes the __alloc_pages_slowpath() function to fail
>> to exit at the appropriate point. This prevents timely fallback to
>> allocating memory on other nodes, ultimately resulting in excessively
>> long virtual machine startup times.
>> Call trace:
>> __alloc_pages_slowpath
>> if (compact_result == COMPACT_SKIPPED ||
>> compact_result == COMPACT_DEFERRED)
>> goto nopage; // should exit __alloc_pages_slowpath() from here
>>
>> We could use the real unmovable allocation context to have
>> __zone_watermark_unusable_free() subtract CMA pages, and thus we won't
>> pass the order-0 check anymore once the non-CMA part is exhausted. There
>> is some risk that in some different scenario the compaction could in
>> fact migrate pages from the exhausted non-CMA part of the zone to the
>> CMA part and succeed, and we'll skip it instead. But only __GFP_NORETRY
>> allocations should be affected in the immediate "goto nopage" when
>> compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY
>> anyway and won't fail without trying to compact-migrate the non-CMA
>> pageblocks into CMA pageblocks first, so it should be fine.
>>
>> After this fix, it only takes a few tens of seconds to start a 32GB
>> virtual machine with device passthrough functionality.
>
> So did you verify it works?
After multiple tests, it has been confirmed to work properly. Thank you.
I just realized there might be still cases it
> won't help. There might be enough free order-0 pages in the non-CMA
> pageblocks (so the additional check will not stop us) but fragmented and
> impossible to compact due to unmovable pages. Then we won't avoid your
> issue, right?
>
The pages that are pinned are mostly Transparent Huge Pages (THP).
Therefore, it is not common to find free order-0 pages in non-CMA
pageblocks that are fragmented and impossible to compact due to the
presence of unmovable pages. This patch can resolve my issue.
>> Link: https://lore.kernel.org/lkml/1736335854-548-1-git-send-email-yangge1116@126.com/
>> Signed-off-by: yangge <yangge1116@....com>
>
> In case it really helps reliably:
>
> Acked-by: Vlastimil Babka <vbabka@...e.cz>
>
> Some nits below:
>
>> ---
>> mm/compaction.c | 31 +++++++++++++++++++++++++++----
>> 1 file changed, 27 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 07bd227..9032bb6 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -2490,7 +2490,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>> */
>> static enum compact_result
>> compaction_suit_allocation_order(struct zone *zone, unsigned int order,
>> - int highest_zoneidx, unsigned int alloc_flags)
>> + int highest_zoneidx, unsigned int alloc_flags,
>> + bool async)
>> {
>> unsigned long watermark;
>>
>> @@ -2499,6 +2500,25 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order,
>> alloc_flags))
>> return COMPACT_SUCCESS;
>>
>> + /*
>> + * For costly orders, during the async memory compaction process, use the
>> + * actual allocation context to determine the watermarks. There's some risk
>> + * that in some different scenario the compaction could in fact migrate
>> + * pages from the exhausted non-CMA part of the zone to the CMA part and
>> + * succeed, and we'll skip it instead. But only __GFP_NORETRY allocations
>> + * should be affected in the immediate "goto nopage" when compaction is
>> + * skipped, others will attempt with DEF_COMPACT_PRIORITY anyway and won't
>> + * fail without trying to compact-migrate the non-CMA pageblocks into CMA
>> + * pageblocks first, so it should be fine.
>
> I think it's explaining too much about why not do this than why do this. How
> about:
>
> For unmovable allocations (without ALLOC_CMA), check if there is enough free
> memory in the non-CMA pageblocks. Otherwise compaction could form the
> high-order page in CMA pageblocks, which would not help the allocation to
> succeed. However, limit the check to costly order async compaction (such as
> opportunistic THP attempts) because there is the possibility that compaction
> would migrate pages from non-CMA to CMA pageblock.
>
>> + */
>> + if (order > PAGE_ALLOC_COSTLY_ORDER && async) {
>
> We could also check for !(alloc_flags & ALLOC_CMA) here to avoid the
> watermark check in the normal THP allocation case (not from pinned gup),
> because then it just repeats the watermark check that was done above.
>
>> + watermark = low_wmark_pages(zone) + compact_gap(order);
>> + if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
>> + alloc_flags & ALLOC_CMA,
>
> And then here we can just pass 0.
>
>> + zone_page_state(zone, NR_FREE_PAGES)))
>> + return COMPACT_SKIPPED;
>> + }
>> +
>> if (!compaction_suitable(zone, order, highest_zoneidx))
>> return COMPACT_SKIPPED;
>>
>> @@ -2534,7 +2554,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>> if (!is_via_compact_memory(cc->order)) {
>> ret = compaction_suit_allocation_order(cc->zone, cc->order,
>> cc->highest_zoneidx,
>> - cc->alloc_flags);
>> + cc->alloc_flags,
>> + cc->mode == MIGRATE_ASYNC);
>> if (ret != COMPACT_CONTINUE)
>> return ret;
>> }
>> @@ -3037,7 +3058,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
>>
>> ret = compaction_suit_allocation_order(zone,
>> pgdat->kcompactd_max_order,
>> - highest_zoneidx, ALLOC_WMARK_MIN);
>> + highest_zoneidx, ALLOC_WMARK_MIN,
>> + 0);
>
> It's bool, so false instead of 0.
>
>> if (ret == COMPACT_CONTINUE)
>> return true;
>> }
>> @@ -3078,7 +3100,8 @@ static void kcompactd_do_work(pg_data_t *pgdat)
>> continue;
>>
>> ret = compaction_suit_allocation_order(zone,
>> - cc.order, zoneid, ALLOC_WMARK_MIN);
>> + cc.order, zoneid, ALLOC_WMARK_MIN,
>> + cc.mode == MIGRATE_ASYNC);
>
> We could also just pass false here as kcompactd uses MIGRATE_SYNC_LIGHT and
> has no real alloc_context.
>
>> if (ret != COMPACT_CONTINUE)
>> continue;
>>
Powered by blists - more mailing lists