linux-kernel - Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <762CA634-053A-41DD-8ED7-895374640858@nvidia.com>
Date:   Wed, 20 Sep 2023 13:23:18 -0400
From:   Zi Yan <ziy@...dia.com>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     Vlastimil Babka <vbabka@...e.cz>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Kefeng Wang <wangkefeng.wang@...wei.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, David Hildenbrand <david@...hat.com>
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On 20 Sep 2023, at 12:04, Johannes Weiner wrote:

> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>> On 9/20/23 03:38, Zi Yan wrote:
>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>
>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>
>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>  		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>
>>>>>>>  		/* Do not cross zone boundaries */
>>>>>>> 	+#if 0
>>>>>>>  		if (!zone_spans_pfn(zone, start))
>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>> 	+#else
>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>> 	+		start = pfn;
>>>>>>> 	+#endif
>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>> 	 		return false;
>>>>>>> 	I can still trigger warnings.
>>>>>>
>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>
>>>>>
>>>>> Just to be really clear,
>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>   path WITHOUT your change.
>>>>>
>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>
>>>>> I went back and reran focusing on the specific migrate type.
>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>
>>>>> I could be wrong, but I do not think your patch changes things.
>>>>
>>>> Got it. Thanks for the clarification.
>>>>>
>>>>>>>
>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>> script.
>>>>>>>
>>>>>>> Zi asked about my config, so it is attached.
>>>>>>
>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>> trying. Thanks.
>>>>>>
>>>>>
>>>>> Perhaps try running both scripts in parallel?
>>>>
>>>> Yes. It seems to do the trick.
>>>>
>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>
>>>> I am able to reproduce it with the script below:
>>>>
>>>> while true; do
>>>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>  echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>  wait
>>>>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>  echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>> done
>>>>
>>>> I will look into the issue.
>>
>> Nice!
>>
>> I managed to reproduce it ONCE, triggering it not even a second after
>> starting the script. But I can't seem to do it twice, even after
>> several reboots and letting it run for minutes.
>
> I managed to reproduce it reliably by cutting the nr_hugepages
> parameters respectively in half.
>
> The one that triggers for me is always MIGRATE_ISOLATE. With some
> printk-tracing, the scenario seems to be this:
>
> #0                                                   #1
> start_isolate_page_range()
>   isolate_single_pageblock()
>     set_migratetype_isolate(tail)
>       lock zone->lock
>       move_freepages_block(tail) // nop
>       set_pageblock_migratetype(tail)
>       unlock zone->lock
>                                                      del_page_from_freelist(head)
>                                                      expand(head, head_mt)
>                                                        WARN(head_mt != tail_mt)
>     start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>     for (pfn = start_pfn, pfn < end_pfn)
>       if (PageBuddy())
>         split_free_page(head)
>
> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
> lock. The move_freepages_block() does nothing because the PageBuddy()
> is set on the pageblock to the left. Once we drop the lock, the buddy
> gets allocated and the expand() puts things on the wrong list. The
> splitting code that handles MAX_ORDER blocks runs *after* the tail
> type is set and the lock has been dropped, so it's too late.

Yes, this is the issue I can confirm as well. But it is intentional to enable
allocating a contiguous range at pageblock granularity instead of MAX_ORDER
granularity. With your changes below, it no longer works, because if there
is an unmovable page in
[ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
the allocation fails but it would succeed in current implementation.

I think a proper fix would be to make move_freepages_block() split the
MAX_ORDER page and put the split pages in the right migratetype free lists.

I am working on that.

>
> I think this would work fine if we always set MIGRATE_ISOLATE in a
> linear fashion, with start and end aligned to MAX_ORDER. Then we also
> wouldn't have to split things.
>
> There are two reasons this doesn't happen today:
>
> 1. The isolation range is rounded to pageblocks, not MAX_ORDER. In
>    this test case they always seem aligned, but it's not
>    guaranteed. However,
>
> 2. start_isolate_page_range() explicitly breaks ordering by doing the
>    last block in the range before the center. It's that last block
>    that triggers the race with __rmqueue_smallest -> expand() for me.
>
> With the below patch I can no longer reproduce the issue:
>
> ---
>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index b5c7a9d21257..b7c8730bf0e2 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -538,8 +538,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>  	unsigned long pfn;
>  	struct page *page;
>  	/* isolation is done at page block granularity */
> -	unsigned long isolate_start = pageblock_start_pfn(start_pfn);
> -	unsigned long isolate_end = pageblock_align(end_pfn);
> +	unsigned long isolate_start = ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES);
> +	unsigned long isolate_end = ALIGN(end_pfn, MAX_ORDER_NR_PAGES);
>  	int ret;
>  	bool skip_isolation = false;
>
> @@ -549,17 +549,6 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>  	if (ret)
>  		return ret;
>
> -	if (isolate_start == isolate_end - pageblock_nr_pages)
> -		skip_isolation = true;
> -
> -	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
> -	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
> -			skip_isolation, migratetype);
> -	if (ret) {
> -		unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
> -		return ret;
> -	}
> -
>  	/* skip isolated pageblocks at the beginning and end */
>  	for (pfn = isolate_start + pageblock_nr_pages;
>  	     pfn < isolate_end - pageblock_nr_pages;
> @@ -568,12 +557,21 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>  		if (page && set_migratetype_isolate(page, migratetype, flags,
>  					start_pfn, end_pfn)) {
>  			undo_isolate_page_range(isolate_start, pfn, migratetype);
> -			unset_migratetype_isolate(
> -				pfn_to_page(isolate_end - pageblock_nr_pages),
> -				migratetype);
>  			return -EBUSY;
>  		}
>  	}
> +
> +	if (isolate_start == isolate_end - pageblock_nr_pages)
> +		skip_isolation = true;
> +
> +	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
> +	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
> +			skip_isolation, migratetype);
> +	if (ret) {
> +		undo_isolate_page_range(isolate_start, pfn, migratetype);
> +		return ret;
> +	}
> +
>  	return 0;
>  }
>
> @@ -591,8 +589,8 @@ void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>  {
>  	unsigned long pfn;
>  	struct page *page;
> -	unsigned long isolate_start = pageblock_start_pfn(start_pfn);
> -	unsigned long isolate_end = pageblock_align(end_pfn);
> +	unsigned long isolate_start = ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES);
> +	unsigned long isolate_end = ALIGN(end_pfn, MAX_ORDER_NR_PAGES);
>
>  	for (pfn = isolate_start;
>  	     pfn < isolate_end;


--
Best Regards,
Yan, Zi

Download attachment "signature.asc" of type "application/pgp-signature" (855 bytes)