linux-kernel - Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <16391A0B-4E3C-4F3E-8904-9034DC87DC52@nvidia.com>
Date: Wed, 20 Nov 2024 10:55:09 -0500
From: Zi Yan <ziy@...dia.com>
To: David Hildenbrand <david@...hat.com>
Cc: Yu Zhao <yuzhao@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>,
 Muchun Song <muchun.song@...ux.dev>,
 "Matthew Wilcox (Oracle)" <willy@...radead.org>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH mm-unstable v2 1/3] mm/contig_alloc: support __GFP_COMP

On 19 Nov 2024, at 11:52, David Hildenbrand wrote:

> On 19.11.24 17:49, David Hildenbrand wrote:
>> On 19.11.24 17:41, Zi Yan wrote:
>>> On 19 Nov 2024, at 11:31, David Hildenbrand wrote:
>>>
>>>> On 19.11.24 17:12, Zi Yan wrote:
>>>>> On 19 Nov 2024, at 10:29, David Hildenbrand wrote:
>>>>>
>>>>>>> +/* Split a multi-block free page into its individual pageblocks. */
>>>>>>> +static void split_large_buddy(struct zone *zone, struct page *page,
>>>>>>> +			      unsigned long pfn, int order, fpi_t fpi)
>>>>>>> +{
>>>>>>> +	unsigned long end = pfn + (1 << order);
>>>>>>> +
>>>>>>> +	VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
>>>>>>> +	/* Caller removed page from freelist, buddy info cleared! */
>>>>>>> +	VM_WARN_ON_ONCE(PageBuddy(page));
>>>>>>> +
>>>>>>> +	if (order > pageblock_order)
>>>>>>> +		order = pageblock_order;
>>>>>>> +
>>>>>>> +	while (pfn != end) {
>>>>>>> +		int mt = get_pfnblock_migratetype(page, pfn);
>>>>>>> +
>>>>>>> +		__free_one_page(page, pfn, zone, order, mt, fpi);
>>>>>>> +		pfn += 1 << order;
>>>>>>> +		page = pfn_to_page(pfn);
>>>>>>> +	}
>>>>>>> +}
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> stumbling over this while digging through the code ....
>>>>>>
>>>>>>> +
>>>>>>>      static void free_one_page(struct zone *zone, struct page *page,
>>>>>>>      			  unsigned long pfn, unsigned int order,
>>>>>>>      			  fpi_t fpi_flags)
>>>>>>>      {
>>>>>>>      	unsigned long flags;
>>>>>>> -	int migratetype;
>>>>>>>       	spin_lock_irqsave(&zone->lock, flags);
>>>>>>> -	migratetype = get_pfnblock_migratetype(page, pfn);
>>>>>>> -	__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
>>>>>>
>>>>>> This change is rather undesired:
>>>>>>
>>>>>> via __free_pages_core()->__free_pages_ok() we can easily end up here with order=MAX_PAGE_ORDER.
>>>>>
>>>>> Do you have a concrete example? PMD THP on x86_64 is pageblock_order.
>>>>> We do not have PMD level mTHP yet. Any other possible source?
>>>>>
>>>>>>
>>>>>> What your new code will do is split this perfectly reasonable MAX_PAGE_ORDER chunk via split_large_buddy() into pageblock-sized chunks, and let the buddy merging logic undo our unnecessary splitting.
>>>>>>
>>>>>> Is there a way to avoid this and just process the whole MAX_PAGE_ORDER chunk like we used to?
>>>>>
>>>>> Probably split_large_buddy() can check the migratetypes of the to-be-freed
>>>>> page, if order > pageblock_order. If all migratetypes are the same, the page
>>>>> can be freed at MAX_PAGE_ORDER, otherwise pageblock_order.
>>>>
>>>> Thinking about this: why do we care about the migratetype?
>>>>
>>>> We only have to fallback to pageblocks if any pageblock is "MIGRATE_ISOLATE" (and maybe MIGRATE_CMA), but not all. Otherwise, we can just ignore the migratetype (or rather overwrite it)
>>>
>>> There are VM_WARN_ONCEs around *_free_list() operations to make sure
>>> page migratetype matches the migratetype of the list it is on. Ignoring
>>> migratetype would trigger these WARNs. Overwriting it would work but
>>> free page migratetype accounting needs to be taken care of.
>>>
>>> An implicit reason is that __free_one_page() does not support >MAX_PAGE_ORDER
>>> and gigantic hugetlb folios are freed via free_one_page(), where
>>> split_large_buddy() is used to work around the limitation.
>>
>> Yes, I saw that change. But that can be easily identified cased by
>> unlikely(order > MAX_PAGE_ORDER) to handle that rare case in a special way.
>>
>>   > > For the two memory init cases you mentioned in the other email,
>> maybe a new
>>> fpi flag to make free_one_page() use __free_one_page() for them,
>>> since migratetypes should be the same across MAX_PAGE_ORDER range there, right?
>>
>> In the context of alloc_frozen_range()/free_frozen_range() I want to
>> free MAX_PAGE_ORDER chunks when possible, and not have some odd logic in
>> the freeing path undo some of that effort.
>
> Adding a pointer to that discussion:
>
> https://lkml.kernel.org/r/ZzZdnuZBf-xgiwD2@casper.infradead.org

Thanks.

So you are thinking about something like this:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b6958333054d..3d3341dc1ad1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1254,7 +1254,12 @@ static void free_one_page(struct zone *zone, struct page *page,
 	unsigned long flags;

 	spin_lock_irqsave(&zone->lock, flags);
-	split_large_buddy(zone, page, pfn, order, fpi_flags);
+	if (unlikely(order > MAX_PAGE_ORDER))
+		split_large_buddy(zone, page, pfn, order, fpi_flags);
+	else {
+		int migratetype = get_pfnblock_migratetype(page, pfn);
+		__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
+	}
 	spin_unlock_irqrestore(&zone->lock, flags);

 	__count_vm_events(PGFREE, 1 << order);


Is it possible to have a MAX_PAGE_ORDER hugetlb folio? If no, we are good.
If yes, alloc_contig_range() can change the migratetype of one of half that folio
and during migration phase, that folio will be freed via __free_one_page()
and causing migratetype mismatch.

Best Regards,
Yan, Zi