linux-kernel - Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230919184731.GC112714@monkey>
Date:   Tue, 19 Sep 2023 11:47:31 -0700
From:   Mike Kravetz <mike.kravetz@...cle.com>
To:     Johannes Weiner <hannes@...xchg.org>, Zi Yan <ziy@...dia.com>
Cc:     Vlastimil Babka <vbabka@...e.cz>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Kefeng Wang <wangkefeng.wang@...wei.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On 09/19/23 02:49, Johannes Weiner wrote:
> On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
> > On 09/18/23 10:52, Johannes Weiner wrote:
> > > On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
> > > > On 9/16/23 21:57, Mike Kravetz wrote:
> > > > > On 09/15/23 10:16, Johannes Weiner wrote:
> > > > >> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> > > > > 
> > > > > With the patch below applied, a slightly different workload triggers the
> > > > > following warnings.  It seems related, and appears to go away when
> > > > > reverting the series.
> > > > > 
> > > > > [  331.595382] ------------[ cut here ]------------
> > > > > [  331.596665] page type is 5, passed migratetype is 1 (nr=512)
> > > > > [  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
> > > > 
> > > > Initially I thought this demonstrates the possible race I was suggesting in
> > > > reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
> > > > are trying to get a MOVABLE page from a CMA page block, which is something
> > > > that's normally done and the pageblock stays CMA. So yeah if the warnings
> > > > are to stay, they need to handle this case. Maybe the same can happen with
> > > > HIGHATOMIC blocks?
> 
> Ok, the CMA thing gave me pause because Mike's pagetypeinfo didn't
> show any CMA pages.
> 
> 5 is actually MIGRATE_ISOLATE - see the double use of 3 for PCPTYPES
> and HIGHATOMIC.
> 
> > > This means we have an order-10 page where one half is MOVABLE and the
> > > other is CMA.
> 
> This means the scenario is different:
> 
> We get a MAX_ORDER page off the MOVABLE freelist. The removal checks
> that the first pageblock is indeed MOVABLE. During the expand, the
> second pageblock turns out to be of type MIGRATE_ISOLATE.
> 
> The page allocator wouldn't have merged those types. It triggers a bit
> too fast to be a race condition.
> 
> It appears that MIGRATE_ISOLATE is simply set on the tail pageblock
> while the head is on the list, and then stranded there.
> 
> Could this be an issue in the page_isolation code? Maybe a range
> rounding error?
> 
> Zi Yan, does this ring a bell for you?
> 
> I don't quite see how my patches could have caused this. But AFAICS we
> also didn't have warnings for this scenario so it could be an old bug.
> 
> > > Mike, could you describe the workload that is triggering this?
> > 
> > This 'slightly different workload' is actually a slightly different
> > environment.  Sorry for mis-speaking!  The slight difference is that this
> > environment does not use the 'alloc hugetlb gigantic pages from CMA'
> > (hugetlb_cma) feature that triggered the previous issue.
> > 
> > This is still on a 16G VM.  Kernel command line here is:
> > "BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.6.0-rc1-next-20230913+
> > root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0
> > console=ttyS0,115200 audit=0 selinux=0 transparent_hugepage=always
> > hugetlb_free_vmemmap=on"
> > 
> > The workload is just running this script:
> > while true; do
> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> >  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > done
> > 
> > > 
> > > Does this reproduce instantly and reliably?
> > > 
> > 
> > It is not 'instant' but will reproduce fairly reliably within a minute
> > or so.
> > 
> > Note that the 'echo 4 > .../hugepages-1048576kB/nr_hugepages' is going
> > to end up calling alloc_contig_pages -> alloc_contig_range.  Those pages
> > will eventually be freed via __free_pages(folio, 9).
> 
> No luck reproducing this yet, but I have a question. In that crash
> stack trace, the expand() is called via this:
> 
>  [  331.645847]  get_page_from_freelist+0x3ed/0x1040
>  [  331.646837]  ? prepare_alloc_pages.constprop.0+0x197/0x1b0
>  [  331.647977]  __alloc_pages+0xec/0x240
>  [  331.648783]  alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
>  [  331.649912]  __alloc_fresh_hugetlb_folio+0x157/0x230
>  [  331.650938]  alloc_pool_huge_folio+0xad/0x110
>  [  331.651909]  set_max_huge_pages+0x17d/0x390
> 
> I don't see an __alloc_fresh_hugetlb_folio() in my tree. Only
> alloc_fresh_hugetlb_folio(), which has this:
> 
>         if (hstate_is_gigantic(h))
>                 folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask);
>         else
>                 folio = alloc_buddy_hugetlb_folio(h, gfp_mask,
>                                 nid, nmask, node_alloc_noretry);
> 
> where gigantic is defined as the order exceeding MAX_ORDER, which
> should be the case for 1G pages on x86.
> 
> So the crashing stack must be from a 2M allocation, no? I'm confused
> how that could happen with the above test case.

Sorry for causing the confusion!

When I originally saw the warnings pop up, I was running the above script
as well as another that only allocated order 9 hugetlb pages:

while true; do
	echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
	echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
done

The warnings were actually triggered by allocations in this second script.

However, when reporting the warnings I wanted to include the simplest
way to recreate.  And, I noticed that that second script running in
parallel was not required.  Again, sorry for the confusion!  Here is a
warning triggered via the alloc_contig_range path only running the one
script.

[  107.275821] ------------[ cut here ]------------
[  107.277001] page type is 0, passed migratetype is 1 (nr=512)
[  107.278379] WARNING: CPU: 1 PID: 886 at mm/page_alloc.c:699 del_page_from_free_list+0x137/0x170
[  107.280514] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic joydev 9p snd_hda_intel netfs snd_intel_dspcfg snd_hda_codec snd_hwdep 9pnet_virtio snd_hda_core snd_seq snd_seq_device 9pnet virtio_balloon snd_pcm snd_timer snd soundcore virtio_net net_failover failover virtio_console virtio_blk crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
[  107.291033] CPU: 1 PID: 886 Comm: bash Not tainted 6.6.0-rc2-next-20230919-dirty #35
[  107.293000] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[  107.295187] RIP: 0010:del_page_from_free_list+0x137/0x170
[  107.296618] Code: c6 05 20 9b 35 01 01 e8 b7 fb ff ff 44 89 f1 44 89 e2 48 c7 c7 d8 ab 22 82 48 89 c6 b8 01 00 00 00 d3 e0 89 c1 e8 e9 99 df ff <0f> 0b e9 03 ff ff ff 48 c7 c6 10 ac 22 82 48 89 df e8 f3 e0 fc ff
[  107.301236] RSP: 0018:ffffc90003ba7a70 EFLAGS: 00010086
[  107.302535] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
[  107.304467] RDX: 0000000000000004 RSI: ffffffff8224e9de RDI: 00000000ffffffff
[  107.306289] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
[  107.308135] R10: 00000000ffffdfff R11: ffffffff824660e0 R12: 0000000000000001
[  107.309956] R13: ffff88827fffcd80 R14: 0000000000000009 R15: 00000000001ffc00
[  107.311839] FS:  00007fabb8cba740(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
[  107.314695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  107.316159] CR2: 00007f41ba01acf0 CR3: 0000000282ed4006 CR4: 0000000000370ee0
[  107.317971] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  107.319783] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  107.321575] Call Trace:
[  107.322314]  <TASK>
[  107.323002]  ? del_page_from_free_list+0x137/0x170
[  107.324380]  ? __warn+0x7d/0x130
[  107.325341]  ? del_page_from_free_list+0x137/0x170
[  107.326627]  ? report_bug+0x18d/0x1c0
[  107.327632]  ? prb_read_valid+0x17/0x20
[  107.328711]  ? handle_bug+0x41/0x70
[  107.329685]  ? exc_invalid_op+0x13/0x60
[  107.330787]  ? asm_exc_invalid_op+0x16/0x20
[  107.331937]  ? del_page_from_free_list+0x137/0x170
[  107.333189]  __free_one_page+0x2ab/0x6f0
[  107.334375]  free_pcppages_bulk+0x169/0x210
[  107.335575]  drain_pages_zone+0x3f/0x50
[  107.336691]  __drain_all_pages+0xe2/0x1e0
[  107.337843]  alloc_contig_range+0x143/0x280
[  107.339026]  alloc_contig_pages+0x210/0x270
[  107.340200]  alloc_fresh_hugetlb_folio+0xa6/0x270
[  107.341529]  alloc_pool_huge_page+0x7d/0x100
[  107.342745]  set_max_huge_pages+0x162/0x340
[  107.345059]  nr_hugepages_store_common+0x91/0xf0
[  107.346329]  kernfs_fop_write_iter+0x108/0x1f0
[  107.347547]  vfs_write+0x207/0x400
[  107.348543]  ksys_write+0x63/0xe0
[  107.349511]  do_syscall_64+0x37/0x90
[  107.350543]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  107.351940] RIP: 0033:0x7fabb8daee87
[  107.352819] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  107.356373] RSP: 002b:00007ffc02737478 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  107.358103] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fabb8daee87
[  107.359695] RDX: 0000000000000002 RSI: 000055fe584a1620 RDI: 0000000000000001
[  107.361258] RBP: 000055fe584a1620 R08: 000000000000000a R09: 00007fabb8e460c0
[  107.362842] R10: 00007fabb8e45fc0 R11: 0000000000000246 R12: 0000000000000002
[  107.364385] R13: 00007fabb8e82520 R14: 0000000000000002 R15: 00007fabb8e82720
[  107.365968]  </TASK>
[  107.366534] ---[ end trace 0000000000000000 ]---
[  121.542474] ------------[ cut here ]------------

Perhaps that is another piece of information in that the warning can be
triggered via both allocation paths.

To be perfectly clear, here is what I did today:
- built next-20230919.  It does not contain your series
  	I could not recreate the issue.
- Added your series and the patch to remove
  VM_BUG_ON_PAGE(is_migrate_isolate(mt), page) from free_pcppages_bulk
	I could recreate the issue while running only the one script.
	The warning above is from that run.
- Added this suggested patch from Zi
	diff --git a/mm/page_alloc.c b/mm/page_alloc.c
	index 1400e674ab86..77a4aea31a7f 100644
	--- a/mm/page_alloc.c
	+++ b/mm/page_alloc.c
	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
 		end = pageblock_end_pfn(pfn) - 1;
 
 		/* Do not cross zone boundaries */
	+#if 0
 		if (!zone_spans_pfn(zone, start))
			start = zone->zone_start_pfn;
	+#else
	+	if (!zone_spans_pfn(zone, start))
	+		start = pfn;
	+#endif
	 	if (!zone_spans_pfn(zone, end))
	 		return false;
	I can still trigger warnings.

One idea about recreating the issue is that it may have to do with size
of my VM (16G) and the requested allocation sizes 4G.  However, I tried
to really stress the allocations by increasing the number of hugetlb
pages requested and that did not help.  I also noticed that I only seem
to get two warnings and then they stop, even if I continue to run the
script.
 
Zi asked about my config, so it is attached.
-- 
Mike Kravetz

View attachment "mike.config" of type "text/plain" (158654 bytes)