lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6DD1F426-A87D-47B7-B27F-043B399CBEDA@nvidia.com>
Date:   Wed, 27 Sep 2023 23:22:37 -0400
From:   Zi Yan <ziy@...dia.com>
To:     David Hildenbrand <david@...hat.com>
Cc:     Johannes Weiner <hannes@...xchg.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Kefeng Wang <wangkefeng.wang@...wei.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On 26 Sep 2023, at 14:19, David Hildenbrand wrote:

> On 21.09.23 16:47, Zi Yan wrote:
>> On 21 Sep 2023, at 6:19, David Hildenbrand wrote:
>>
>>> On 21.09.23 04:31, Zi Yan wrote:
>>>> On 20 Sep 2023, at 13:23, Zi Yan wrote:
>>>>
>>>>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
>>>>>
>>>>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>>>>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>>>>>>> On 9/20/23 03:38, Zi Yan wrote:
>>>>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>>>>>>
>>>>>>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>>>>>>
>>>>>>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>>>>>>    		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>>>>>>
>>>>>>>>>>>>    		/* Do not cross zone boundaries */
>>>>>>>>>>>> 	+#if 0
>>>>>>>>>>>>    		if (!zone_spans_pfn(zone, start))
>>>>>>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>>>>>>> 	+#else
>>>>>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>>>>>>> 	+		start = pfn;
>>>>>>>>>>>> 	+#endif
>>>>>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>>>>>>> 	 		return false;
>>>>>>>>>>>> 	I can still trigger warnings.
>>>>>>>>>>>
>>>>>>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Just to be really clear,
>>>>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>>>>>>     path WITHOUT your change.
>>>>>>>>>>
>>>>>>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>>>>>>
>>>>>>>>>> I went back and reran focusing on the specific migrate type.
>>>>>>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>>>>>>
>>>>>>>>>> I could be wrong, but I do not think your patch changes things.
>>>>>>>>>
>>>>>>>>> Got it. Thanks for the clarification.
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>>>>>>> script.
>>>>>>>>>>>>
>>>>>>>>>>>> Zi asked about my config, so it is attached.
>>>>>>>>>>>
>>>>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>>>>>>> trying. Thanks.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Perhaps try running both scripts in parallel?
>>>>>>>>>
>>>>>>>>> Yes. It seems to do the trick.
>>>>>>>>>
>>>>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>>>>>>
>>>>>>>>> I am able to reproduce it with the script below:
>>>>>>>>>
>>>>>>>>> while true; do
>>>>>>>>>    echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>>>>>>    echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>>>>>>    wait
>>>>>>>>>    echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>>>>>>    echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>>>>>>> done
>>>>>>>>>
>>>>>>>>> I will look into the issue.
>>>>>>>
>>>>>>> Nice!
>>>>>>>
>>>>>>> I managed to reproduce it ONCE, triggering it not even a second after
>>>>>>> starting the script. But I can't seem to do it twice, even after
>>>>>>> several reboots and letting it run for minutes.
>>>>>>
>>>>>> I managed to reproduce it reliably by cutting the nr_hugepages
>>>>>> parameters respectively in half.
>>>>>>
>>>>>> The one that triggers for me is always MIGRATE_ISOLATE. With some
>>>>>> printk-tracing, the scenario seems to be this:
>>>>>>
>>>>>> #0                                                   #1
>>>>>> start_isolate_page_range()
>>>>>>     isolate_single_pageblock()
>>>>>>       set_migratetype_isolate(tail)
>>>>>>         lock zone->lock
>>>>>>         move_freepages_block(tail) // nop
>>>>>>         set_pageblock_migratetype(tail)
>>>>>>         unlock zone->lock
>>>>>>                                                        del_page_from_freelist(head)
>>>>>>                                                        expand(head, head_mt)
>>>>>>                                                          WARN(head_mt != tail_mt)
>>>>>>       start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>>>>>>       for (pfn = start_pfn, pfn < end_pfn)
>>>>>>         if (PageBuddy())
>>>>>>           split_free_page(head)
>>>>>>
>>>>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
>>>>>> lock. The move_freepages_block() does nothing because the PageBuddy()
>>>>>> is set on the pageblock to the left. Once we drop the lock, the buddy
>>>>>> gets allocated and the expand() puts things on the wrong list. The
>>>>>> splitting code that handles MAX_ORDER blocks runs *after* the tail
>>>>>> type is set and the lock has been dropped, so it's too late.
>>>>>
>>>>> Yes, this is the issue I can confirm as well. But it is intentional to enable
>>>>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
>>>>> granularity. With your changes below, it no longer works, because if there
>>>>> is an unmovable page in
>>>>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
>>>>> the allocation fails but it would succeed in current implementation.
>>>>>
>>>>> I think a proper fix would be to make move_freepages_block() split the
>>>>> MAX_ORDER page and put the split pages in the right migratetype free lists.
>>>>>
>>>>> I am working on that.
>>>>
>>>> After spending half a day on this, I think it is much harder than I thought
>>>> to get alloc_contig_range() working with the freelist migratetype hygiene
>>>> patchset. Because alloc_contig_range() relies on racy migratetype changes:
>>>>
>>>> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
>>>> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
>>>> free list yet.
>>>>
>>>> 2. later in the process, isolate_freepages_range() is used to actually grab
>>>> the free pages.
>>>>
>>>> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
>>>> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
>>>> in-use pages. But it is not the case when alloc_contig_range() work on
>>>> pageblock aligned ranges. Now during isolation phase, free or in-use pages
>>>> will need to be split to get their subpages into the right free lists.
>>>>
>>>> 4. the hardest case is when a in-use page sits across two pageblocks, currently,
>>>> the code just isolate one pageblock, migrate the page, and let split_free_page()
>>>> to correct the free list later. But to strictly enforce freelist migratetype
>>>> hygiene, extra work is needed at free page path to split the free page into
>>>> the right freelists.
>>>>
>>>> I need more time to think about how to get alloc_contig_range() properly.
>>>> Help is needed for the bullet point 4.
>>>
>>>
>>> I once raised that we should maybe try making MIGRATE_ISOLATE a flag that preserves the original migratetype. Not sure if that would help here in any way.
>>
>> I have that in my backlog since you asked and have been delaying it. ;) Hopefully
>
> It's complicated and I wish I would have had more time to review it
> back then ... or now to clean it up later.
>
> Unfortunately, nobody else did have the time to review it back then ... maybe we can
> do better next time. David doesn't scale.
>
> Doing page migration from inside start_isolate_page_range()->isolate_single_pageblock()
> really is sub-optimal (and mostly code duplication from alloc_contig_range).

I felt the same when I wrote the code. But I thought it was the only way out.

>
>> I can do it after I fix this. That change might or might not help only if we make
>> some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
>> overwrite existing migratetype, the code might not need to split a page and move
>> it to MIGRATE_ISOLATE freelist?
>
> Did someone test how memory offlining plays along with that? (I can try myself
> within the next 1-2 weeks)
>
> There [mm/memory_hotplug.c:offline_pages] we always cover full MAX_ORDER ranges,
> though.
>
> ret = start_isolate_page_range(start_pfn, end_pfn,
> 			       MIGRATE_MOVABLE,
> 			       MEMORY_OFFLINE | REPORT_FAILURE,
> 			       GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL);

Since a full MAX_ORDER range is passed, no free page split will happen.

>
>>
>> The fundamental issue in alloc_contig_range() is that to work at
>> pageblock level, a page (>pageblock_order) can have one part is isolated and
>> the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
>> now checks first pageblock migratetype, so such a page needs to be removed
>> from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
>> finally put back to multiple free lists. This needs to be done at isolation stage
>> before free pages are removed from their free lists (the stage after isolation).
>
> One idea was to always isolate larger chunks, and handle movability checks/split/etc
> at a later stage. Once isolation would be decoupled from the actual/original migratetype,
> the could have been easier to handle (especially some corner cases I had in mind back then).

I think it is a good idea. When I coded alloc_contig_range() up, I tried to
accommodate existing set_migratetype_isolate(), which calls has_unmovable_pages().
If these two are decoupled, set_migrateype_isolate() can work on MAX_ORDER-aligned
ranges and has_unmovable_pages() can still work on pageblock-aligned ranges.
Let me give this a try.

>
>> If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
>> in their original migratetype and check migratetype before allocating a page,
>> that might help. But that might add extra work (e.g., splitting a partially
>> isolated free page before allocation) in the really hot code path, which is not
>> desirable.
>
> With MIGRATE_ISOLATE being a separate flag, one idea was to have not a single
> separate isolate list, but one per "proper migratetype". But again, just some random
> thoughts I had back then, I never had sufficient time to think it all through.

Got it. I will think about it.

One question on separate MIGRATE_ISOLATE:

the implementation I have in mind is that MIGRATE_ISOLATE will need a dedicated flag
bit instead of being one of migratetype. But now there are 5 migratetypes +
MIGRATE_ISOLATE and PB_migratetype_bits is 3, so an extra migratetype_bit is needed.
But current migratetype implementation is a word-based operation, requiring
NR_PAGEBLOCK_BITS to be divisor of BITS_PER_LONG. This means NR_PAGEBLOCK_BITS
needs to be increased from 4 to 8 to meet the requirement, wasting a lot of space.
An alternative is to have a separate array for MIGRATE_ISOLATE, which requires
additional changes. Let me know if you have a better idea. Thanks.



--
Best Regards,
Yan, Zi

Download attachment "signature.asc" of type "application/pgp-signature" (855 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ