linux-kernel - Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <E0353116-F4D7-4039-B89F-35B7F7C9C2B6@nvidia.com>
Date: Wed, 20 Nov 2024 11:46:31 -0500
From: Zi Yan <ziy@...dia.com>
To: David Hildenbrand <david@...hat.com>
Cc: Yu Zhao <yuzhao@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>,
 Muchun Song <muchun.song@...ux.dev>,
 "Matthew Wilcox (Oracle)" <willy@...radead.org>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP

On 20 Nov 2024, at 11:13, David Hildenbrand wrote:

> On 20.11.24 16:55, Zi Yan wrote:
>> On 19 Nov 2024, at 11:52, David Hildenbrand wrote:
>>
>>> On 19.11.24 17:49, David Hildenbrand wrote:
>>>> On 19.11.24 17:41, Zi Yan wrote:
>>>>> On 19 Nov 2024, at 11:31, David Hildenbrand wrote:
>>>>>
>>>>>> On 19.11.24 17:12, Zi Yan wrote:
>>>>>>> On 19 Nov 2024, at 10:29, David Hildenbrand wrote:
>>>>>>>
>>>>>>>>> +/* Split a multi-block free page into its individual pageblocks. */
>>>>>>>>> +static void split_large_buddy(struct zone *zone, struct page *page,
>>>>>>>>> +			      unsigned long pfn, int order, fpi_t fpi)
>>>>>>>>> +{
>>>>>>>>> +	unsigned long end = pfn + (1 << order);
>>>>>>>>> +
>>>>>>>>> +	VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
>>>>>>>>> +	/* Caller removed page from freelist, buddy info cleared! */
>>>>>>>>> +	VM_WARN_ON_ONCE(PageBuddy(page));
>>>>>>>>> +
>>>>>>>>> +	if (order > pageblock_order)
>>>>>>>>> +		order = pageblock_order;
>>>>>>>>> +
>>>>>>>>> +	while (pfn != end) {
>>>>>>>>> +		int mt = get_pfnblock_migratetype(page, pfn);
>>>>>>>>> +
>>>>>>>>> +		__free_one_page(page, pfn, zone, order, mt, fpi);
>>>>>>>>> +		pfn += 1 << order;
>>>>>>>>> +		page = pfn_to_page(pfn);
>>>>>>>>> +	}
>>>>>>>>> +}
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> stumbling over this while digging through the code ....
>>>>>>>>
>>>>>>>>> +
>>>>>>>>>       static void free_one_page(struct zone *zone, struct page *page,
>>>>>>>>>       			  unsigned long pfn, unsigned int order,
>>>>>>>>>       			  fpi_t fpi_flags)
>>>>>>>>>       {
>>>>>>>>>       	unsigned long flags;
>>>>>>>>> -	int migratetype;
>>>>>>>>>        	spin_lock_irqsave(&zone->lock, flags);
>>>>>>>>> -	migratetype = get_pfnblock_migratetype(page, pfn);
>>>>>>>>> -	__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
>>>>>>>>
>>>>>>>> This change is rather undesired:
>>>>>>>>
>>>>>>>> via __free_pages_core()->__free_pages_ok() we can easily end up here with order=MAX_PAGE_ORDER.
>>>>>>>
>>>>>>> Do you have a concrete example? PMD THP on x86_64 is pageblock_order.
>>>>>>> We do not have PMD level mTHP yet. Any other possible source?
>>>>>>>
>>>>>>>>
>>>>>>>> What your new code will do is split this perfectly reasonable MAX_PAGE_ORDER chunk via split_large_buddy() into pageblock-sized chunks, and let the buddy merging logic undo our unnecessary splitting.
>>>>>>>>
>>>>>>>> Is there a way to avoid this and just process the whole MAX_PAGE_ORDER chunk like we used to?
>>>>>>>
>>>>>>> Probably split_large_buddy() can check the migratetypes of the to-be-freed
>>>>>>> page, if order > pageblock_order. If all migratetypes are the same, the page
>>>>>>> can be freed at MAX_PAGE_ORDER, otherwise pageblock_order.
>>>>>>
>>>>>> Thinking about this: why do we care about the migratetype?
>>>>>>
>>>>>> We only have to fallback to pageblocks if any pageblock is "MIGRATE_ISOLATE" (and maybe MIGRATE_CMA), but not all. Otherwise, we can just ignore the migratetype (or rather overwrite it)
>>>>>
>>>>> There are VM_WARN_ONCEs around *_free_list() operations to make sure
>>>>> page migratetype matches the migratetype of the list it is on. Ignoring
>>>>> migratetype would trigger these WARNs. Overwriting it would work but
>>>>> free page migratetype accounting needs to be taken care of.
>>>>>
>>>>> An implicit reason is that __free_one_page() does not support >MAX_PAGE_ORDER
>>>>> and gigantic hugetlb folios are freed via free_one_page(), where
>>>>> split_large_buddy() is used to work around the limitation.
>>>>
>>>> Yes, I saw that change. But that can be easily identified cased by
>>>> unlikely(order > MAX_PAGE_ORDER) to handle that rare case in a special way.
>>>>
>>>>    > > For the two memory init cases you mentioned in the other email,
>>>> maybe a new
>>>>> fpi flag to make free_one_page() use __free_one_page() for them,
>>>>> since migratetypes should be the same across MAX_PAGE_ORDER range there, right?
>>>>
>>>> In the context of alloc_frozen_range()/free_frozen_range() I want to
>>>> free MAX_PAGE_ORDER chunks when possible, and not have some odd logic in
>>>> the freeing path undo some of that effort.
>>>
>>> Adding a pointer to that discussion:
>>>
>>> https://lkml.kernel.org/r/ZzZdnuZBf-xgiwD2@casper.infradead.org
>>
>> Thanks.
>>
>> So you are thinking about something like this:
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b6958333054d..3d3341dc1ad1 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1254,7 +1254,12 @@ static void free_one_page(struct zone *zone, struct page *page,
>>   	unsigned long flags;
>>
>>   	spin_lock_irqsave(&zone->lock, flags);
>> -	split_large_buddy(zone, page, pfn, order, fpi_flags);
>> +	if (unlikely(order > MAX_PAGE_ORDER))
>> +		split_large_buddy(zone, page, pfn, order, fpi_flags);> +	else {
>> +		int migratetype = get_pfnblock_migratetype(page, pfn);
>> +		__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
>> +	}
>>   	spin_unlock_irqrestore(&zone->lock, flags);
>>
>>   	__count_vm_events(PGFREE, 1 << order);
>>
>
> Something in that direction, but likely something like:
>
> if (likely(order <= pageblock_order)) {
> 	int migratetype = get_pfnblock_migratetype(page, pfn);
> 	__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
> } else if (order <= MAX_PAGE_ORDER) {
> 	/* We might have to split when some pageblocks differ. */
> 	maybe_split_large_buddy(zone, page, pfn, order, fpi_flags);
> } else {
> 	/* The buddy works in max MAX_PAGE_ORDER chunks. */
> 	for /* each MAX_PAGE_ORDER chunk */
> 		maybe_split_large_buddy(zone, page, pfn, MAX_PAGE_ORDER, fpi_flags);
> 		pfn += MAX_ORDER_NR_PAGES;
> 		page = ...
> 	}	
> }
>
> Whereby maybe_split_large_buddy would check if it has to do anything with the
> pageblocks, or whether it can just call __free_one_page(order). It would only
> accept order <=
>
> In the future with free_frozen_pages(), the last else case would vanish.
>
>>
>> Is it possible to have a MAX_PAGE_ORDER hugetlb folio? If no, we are good.
>
> Hm, I wouldn't rule it out. To be future proof, we'd likely should add a
> way to handle it. Shouldn't be too hard and expensive: simply read all
> pageblock migratetypes.
>
> I wonder if there are configs (no page isolation?) where we don't have to
> check anything.

CONFIG_CONTIG_ALLOC? CONFIG_MEMORY_ISOLATION seems too much.

>
>> If yes, alloc_contig_range() can change the migratetype of one of half that folio
>> and during migration phase, that folio will be freed via __free_one_page()
>> and causing migratetype mismatch.
>
> Can you remind me how that happens?

There is a MAX_PAGE_ORDER hugetlb folio across two pageblocks and
alloc_contig_range() wants to get a range that overlaps with first pageblock
of that hugetlb folio. All pageblocks are marked as isolated first during
start_isolate_page_range(), then __alloc_contig_migrate_range() isolates
and migrates the hugetlb folio. During migration, the original hugetlb
folio is freed and since it is MAX_PAGE_ORDER and based on my code above,
it is freed via __free_one_page() with two different migratetypes.

But with your maybe_split_large_buddy(), it would work.

>
> Likely, we isolate a single pageblock, and shortly after we free a larger
> page that covers multiple pageblocks? So this could affect other
> alloc_contig_range() users as well, I assume?



Best Regards,
Yan, Zi