linux-kernel - Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <393367d3-eb69-4255-a7aa-dcbcbe7a403e@redhat.com>
Date: Wed, 20 Nov 2024 17:13:19 +0100
From: David Hildenbrand <david@...hat.com>
To: Zi Yan <ziy@...dia.com>
Cc: Yu Zhao <yuzhao@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>,
 Muchun Song <muchun.song@...ux.dev>,
 "Matthew Wilcox (Oracle)" <willy@...radead.org>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP

On 20.11.24 16:55, Zi Yan wrote:
> On 19 Nov 2024, at 11:52, David Hildenbrand wrote:
> 
>> On 19.11.24 17:49, David Hildenbrand wrote:
>>> On 19.11.24 17:41, Zi Yan wrote:
>>>> On 19 Nov 2024, at 11:31, David Hildenbrand wrote:
>>>>
>>>>> On 19.11.24 17:12, Zi Yan wrote:
>>>>>> On 19 Nov 2024, at 10:29, David Hildenbrand wrote:
>>>>>>
>>>>>>>> +/* Split a multi-block free page into its individual pageblocks. */
>>>>>>>> +static void split_large_buddy(struct zone *zone, struct page *page,
>>>>>>>> +			      unsigned long pfn, int order, fpi_t fpi)
>>>>>>>> +{
>>>>>>>> +	unsigned long end = pfn + (1 << order);
>>>>>>>> +
>>>>>>>> +	VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
>>>>>>>> +	/* Caller removed page from freelist, buddy info cleared! */
>>>>>>>> +	VM_WARN_ON_ONCE(PageBuddy(page));
>>>>>>>> +
>>>>>>>> +	if (order > pageblock_order)
>>>>>>>> +		order = pageblock_order;
>>>>>>>> +
>>>>>>>> +	while (pfn != end) {
>>>>>>>> +		int mt = get_pfnblock_migratetype(page, pfn);
>>>>>>>> +
>>>>>>>> +		__free_one_page(page, pfn, zone, order, mt, fpi);
>>>>>>>> +		pfn += 1 << order;
>>>>>>>> +		page = pfn_to_page(pfn);
>>>>>>>> +	}
>>>>>>>> +}
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> stumbling over this while digging through the code ....
>>>>>>>
>>>>>>>> +
>>>>>>>>       static void free_one_page(struct zone *zone, struct page *page,
>>>>>>>>       			  unsigned long pfn, unsigned int order,
>>>>>>>>       			  fpi_t fpi_flags)
>>>>>>>>       {
>>>>>>>>       	unsigned long flags;
>>>>>>>> -	int migratetype;
>>>>>>>>        	spin_lock_irqsave(&zone->lock, flags);
>>>>>>>> -	migratetype = get_pfnblock_migratetype(page, pfn);
>>>>>>>> -	__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
>>>>>>>
>>>>>>> This change is rather undesired:
>>>>>>>
>>>>>>> via __free_pages_core()->__free_pages_ok() we can easily end up here with order=MAX_PAGE_ORDER.
>>>>>>
>>>>>> Do you have a concrete example? PMD THP on x86_64 is pageblock_order.
>>>>>> We do not have PMD level mTHP yet. Any other possible source?
>>>>>>
>>>>>>>
>>>>>>> What your new code will do is split this perfectly reasonable MAX_PAGE_ORDER chunk via split_large_buddy() into pageblock-sized chunks, and let the buddy merging logic undo our unnecessary splitting.
>>>>>>>
>>>>>>> Is there a way to avoid this and just process the whole MAX_PAGE_ORDER chunk like we used to?
>>>>>>
>>>>>> Probably split_large_buddy() can check the migratetypes of the to-be-freed
>>>>>> page, if order > pageblock_order. If all migratetypes are the same, the page
>>>>>> can be freed at MAX_PAGE_ORDER, otherwise pageblock_order.
>>>>>
>>>>> Thinking about this: why do we care about the migratetype?
>>>>>
>>>>> We only have to fallback to pageblocks if any pageblock is "MIGRATE_ISOLATE" (and maybe MIGRATE_CMA), but not all. Otherwise, we can just ignore the migratetype (or rather overwrite it)
>>>>
>>>> There are VM_WARN_ONCEs around *_free_list() operations to make sure
>>>> page migratetype matches the migratetype of the list it is on. Ignoring
>>>> migratetype would trigger these WARNs. Overwriting it would work but
>>>> free page migratetype accounting needs to be taken care of.
>>>>
>>>> An implicit reason is that __free_one_page() does not support >MAX_PAGE_ORDER
>>>> and gigantic hugetlb folios are freed via free_one_page(), where
>>>> split_large_buddy() is used to work around the limitation.
>>>
>>> Yes, I saw that change. But that can be easily identified cased by
>>> unlikely(order > MAX_PAGE_ORDER) to handle that rare case in a special way.
>>>
>>>    > > For the two memory init cases you mentioned in the other email,
>>> maybe a new
>>>> fpi flag to make free_one_page() use __free_one_page() for them,
>>>> since migratetypes should be the same across MAX_PAGE_ORDER range there, right?
>>>
>>> In the context of alloc_frozen_range()/free_frozen_range() I want to
>>> free MAX_PAGE_ORDER chunks when possible, and not have some odd logic in
>>> the freeing path undo some of that effort.
>>
>> Adding a pointer to that discussion:
>>
>> https://lkml.kernel.org/r/ZzZdnuZBf-xgiwD2@casper.infradead.org
> 
> Thanks.
> 
> So you are thinking about something like this:
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b6958333054d..3d3341dc1ad1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1254,7 +1254,12 @@ static void free_one_page(struct zone *zone, struct page *page,
>   	unsigned long flags;
> 
>   	spin_lock_irqsave(&zone->lock, flags);
> -	split_large_buddy(zone, page, pfn, order, fpi_flags);
> +	if (unlikely(order > MAX_PAGE_ORDER))
> +		split_large_buddy(zone, page, pfn, order, fpi_flags);> +	else {
> +		int migratetype = get_pfnblock_migratetype(page, pfn);
> +		__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
> +	}
>   	spin_unlock_irqrestore(&zone->lock, flags);
> 
>   	__count_vm_events(PGFREE, 1 << order);
> 

Something in that direction, but likely something like:

if (likely(order <= pageblock_order)) {
	int migratetype = get_pfnblock_migratetype(page, pfn);
	__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
} else if (order <= MAX_PAGE_ORDER) {
	/* We might have to split when some pageblocks differ. */
	maybe_split_large_buddy(zone, page, pfn, order, fpi_flags);
} else {
	/* The buddy works in max MAX_PAGE_ORDER chunks. */
	for /* each MAX_PAGE_ORDER chunk */
		maybe_split_large_buddy(zone, page, pfn, MAX_PAGE_ORDER, fpi_flags);
		pfn += MAX_ORDER_NR_PAGES;
		page = ...
	}	
}

Whereby maybe_split_large_buddy would check if it has to do anything with the
pageblocks, or whether it can just call __free_one_page(order). It would only
accept order <=

In the future with free_frozen_pages(), the last else case would vanish.

> 
> Is it possible to have a MAX_PAGE_ORDER hugetlb folio? If no, we are good.

Hm, I wouldn't rule it out. To be future proof, we'd likely should add a
way to handle it. Shouldn't be too hard and expensive: simply read all
pageblock migratetypes.

I wonder if there are configs (no page isolation?) where we don't have to
check anything.

> If yes, alloc_contig_range() can change the migratetype of one of half that folio
> and during migration phase, that folio will be freed via __free_one_page()
> and causing migratetype mismatch.

Can you remind me how that happens?

Likely, we isolate a single pageblock, and shortly after we free a larger
page that covers multiple pageblocks? So this could affect other
alloc_contig_range() users as well, I assume?

Thanks!

-- 
Cheers,

David / dhildenb