linux-kernel - Re: [PATCH v1 1/2] mm/page_alloc: conditionally split > pageblock_order pages in free_one_page() and move_freepages_block

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b99aa897-2c09-4944-9558-16b45ebbbd94@redhat.com>
Date: Mon, 9 Dec 2024 23:10:01 +0100
From: David Hildenbrand <david@...hat.com>
To: Zi Yan <ziy@...dia.com>, Vlastimil Babka <vbabka@...e.cz>
Cc: linux-kernel@...r.kernel.org, Johannes Weiner <hannes@...xchg.org>,
 linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
 Yu Zhao <yuzhao@...gle.com>
Subject: Re: [PATCH v1 1/2] mm/page_alloc: conditionally split >
 pageblock_order pages in free_one_page() and move_freepages_block_isolate()

On 09.12.24 22:42, Zi Yan wrote:
> On 9 Dec 2024, at 16:35, David Hildenbrand wrote:
> 
>> On 09.12.24 20:23, Zi Yan wrote:
>>> On 9 Dec 2024, at 14:01, Vlastimil Babka wrote:
>>>
>>>> On 12/6/24 10:59, David Hildenbrand wrote:
>>>>> Let's special-case for the common scenarios that:
>>>>>
>>>>> (a) We are freeing pages <= pageblock_order
>>>>> (b) We are freeing a page <= MAX_PAGE_ORDER and all pageblocks match
>>>>>       (especially, no mixture of isolated and non-isolated pageblocks)
>>>>
>>>> Well in many of those cases we could also just adjust the pageblocks... But
>>>> perhaps they indeed shouldn't differ in the first place, unless there's an
>>>> isolation attempt.
>>>>
>>>>> When we encounter a > MAX_PAGE_ORDER page, it can only come from
>>>>> alloc_contig_range(), and we can process MAX_PAGE_ORDER chunks.
>>>>>
>>>>> When we encounter a >pageblock_order <= MAX_PAGE_ORDER page,
>>>>> check whether all pageblocks match, and if so (common case), don't
>>>>> split them up just for the buddy to merge them back.
>>>>>
>>>>> This makes sure that when we free MAX_PAGE_ORDER chunks to the buddy,
>>>>> for example during system startups, memory onlining, or when isolating
>>>>> consecutive pageblocks via alloc_contig_range()/memory offlining, that
>>>>> we don't unnecessarily split up what we'll immediately merge again,
>>>>> because the migratetypes match.
>>>>>
>>>>> Rename split_large_buddy() to __free_one_page_maybe_split(), to make it
>>>>> clearer what's happening, and handle in it only natural buddy orders,
>>>>> not the alloc_contig_range(__GFP_COMP) special case: handle that in
>>>>> free_one_page() only.
>>>>>
>>>>> Signed-off-by: David Hildenbrand <david@...hat.com>
>>>>
>>>> Acked-by: Vlastimil Babka <vbabka@...e.cz
>>>>
>>>> Hm but noticed something:
>>>>
>>>>> +static void __free_one_page_maybe_split(struct zone *zone, struct page *page,
>>>>> +		unsigned long pfn, int order, fpi_t fpi_flags)
>>>>> +{
>>>>> +	const unsigned long end_pfn = pfn + (1 << order);
>>>>> +	int mt = get_pfnblock_migratetype(page, pfn);
>>>>> +
>>>>> +	VM_WARN_ON_ONCE(order > MAX_PAGE_ORDER);
>>>>>    	VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
>>>>>    	/* Caller removed page from freelist, buddy info cleared! */
>>>>>    	VM_WARN_ON_ONCE(PageBuddy(page));
>>>>>
>>>>> -	if (order > pageblock_order)
>>>>> -		order = pageblock_order;
>>>>> -
>>>>> -	while (pfn != end) {
>>>>> -		int mt = get_pfnblock_migratetype(page, pfn);
>>>>> +	/*
>>>>> +	 * With CONFIG_MEMORY_ISOLATION, we might be freeing MAX_ORDER_NR_PAGES
>>>>> +	 * pages that cover pageblocks with different migratetypes; for example
>>>>> +	 * only some migratetypes might be MIGRATE_ISOLATE. In that (unlikely)
>>>>> +	 * case, fallback to freeing individual pageblocks so they get put
>>>>> +	 * onto the right lists.
>>>>> +	 */
>>>>> +	if (!IS_ENABLED(CONFIG_MEMORY_ISOLATION) ||
>>>>> +	    likely(order <= pageblock_order) ||
>>>>> +	    pfnblock_migratetype_equal(pfn + pageblock_nr_pages, end_pfn, mt)) {
>>>>> +		__free_one_page(page, pfn, zone, order, mt, fpi_flags);
>>>>> +		return;
>>>>> +	}
>>>>>
>>>>> -		__free_one_page(page, pfn, zone, order, mt, fpi);
>>>>> -		pfn += 1 << order;
>>>>> +	while (pfn != end_pfn) {
>>>>> +		mt = get_pfnblock_migratetype(page, pfn);
>>>>> +		__free_one_page(page, pfn, zone, pageblock_order, mt, fpi_flags);
>>>>> +		pfn += pageblock_nr_pages;
>>>>>    		page = pfn_to_page(pfn);
>>>>
>>>> This predates your patch, but seems potentially dangerous to attempt
>>>> pfn_to_page(end_pfn) with SPARSEMEM and no vmemmap and the end_pfn perhaps
>>>> being just outside of the valid range? Should we change that?
>>>>
>>>> But seems this code was initially introduced as part of Johannes'
>>>> migratetype hygiene series.
>>>
>>> It starts as split_free_page() from commit b2c9e2fbba32 ("mm: make
>>> alloc_contig_range work at pageblock granularity”), but harmless since
>>> it is only used to split a buddy page. Then commit fd919a85cd55 ("mm:
>>> page_isolation: prepare for hygienic freelists") refactored it, which
>>> should be fine, since it is still used for the same purpose in page
>>> isolation. Then commit e98337d11bbd ("mm/contig_alloc: support __GFP_COMP")
>>> used it for gigantic hugetlb.
>>>
>>> For SPARSEMEM && !SPARSEMEM_VMEMMAP, PFNs are contiguous, vmemmap might not
>>> be. The code above using pfn in the loop might be fine. And since order
>>> is provided, unless the caller is providing a falsely large order, pfn
>>> should be valid. Or am I missing anything?
>>
>> I think the question is, what happens when we call pfn_to_page() on a PFN that falls into a memory section that is either offline, doesn't have a memmap, or does not exist.
>>
>> With CONFIG_SPARSEMEM, we do a
>>
>> struct mem_section *__sec = __pfn_to_section(__pfn)
>> __section_mem_map_addr(__sec) + __pfn;
>>
>> __pfn_to_section() can return NULL, in which case __section_mem_map_addr() would dereference NULL.
>>
>> I assume it ould happen in corner cases, if we'd exceed NR_SECTION_ROOTS. (IOW, large memory, and we free a page that is at the very end of physical memory).
>>
>> Likely, we should do the pfn_to_page() before the __free_one_page() call.
> 
> Got it. Both you and Vlastimil gave the same corner case issue.
> I agree that doing pfn_to_page() before the __free_one_page() could get rid of
> the concern.

Thanks you both for the review. I'll resend a v2 tomorrow, including a 
patch to fix that up first.

-- 
Cheers,

David / dhildenb