[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF8kJuMFxM25aCgqreMDaYziQ-SuZxfq9uK0CwwquQdsH-iGQw@mail.gmail.com>
Date: Mon, 17 Jun 2024 16:47:46 -0700
From: Chris Li <chrisl@...nel.org>
To: Hugh Dickins <hughd@...gle.com>
Cc: Barry Song <21cnbao@...il.com>, akpm@...ux-foundation.org, baohua@...nel.org,
kaleshsingh@...gle.com, kasong@...cent.com, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, ryan.roberts@....com, ying.huang@...el.com
Subject: Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order
On Mon, Jun 17, 2024 at 4:00 PM Hugh Dickins <hughd@...gle.com> wrote:
>
> On Mon, 17 Jun 2024, Chris Li wrote:
> > On Sat, Jun 15, 2024 at 1:47 AM Barry Song <21cnbao@...il.com> wrote:
> > > On Sat, Jun 15, 2024 at 2:59 PM Andrew Morton <akpm@...ux-foundation.org> wrote:
> > > > On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li <chrisl@...nel.org> wrote:
> > > >
> > > > > > I'm having trouble understanding the overall impact of this on users.
> > > > > > We fail the mTHP swap allocation and fall back, but things continue to
> > > > > > operate OK?
> > > > >
> > > > > Continue to operate OK in the sense that the mTHP will have to split
> > > > > into 4K pages before the swap out, aka the fall back. The swap out and
> > > > > swap in can continue to work as 4K pages, not as the mTHP. Due to the
> > > > > fallback, the mTHP based zsmalloc compression with 64K buffer will not
> > > > > happen. That is the effect of the fallback. But mTHP swap out and swap
> > > > > in is relatively new, it is not really a regression.
> > > >
> > > > Sure, but it's pretty bad to merge a new feature only to have it
> > > > ineffective after a few hours use.
> ....
> > > >
> > > $ /home/barry/develop/linux/mthp_swpout_test -s
> > >
> > > [ 1013.535798] ------------[ cut here ]------------
> > > [ 1013.538886] expecting order 4 got 0
> >
> > This warning means there is a bug in this series somewhere I need to hunt down.
> > The V1 has the same warning but I haven't heard it get triggered in
> > V1, it is something new in V2.
> >
> > Andrew, please consider removing the series from mm-unstable until I
> > resolve this warning assert.
>
> Agreed: I was glad to see it go into mm-unstable last week, that made
> it easier to include in testing (or harder to avoid!), but my conclusion
> is that it's not ready yet (and certainly not suitable for 6.10 hotfix).
>
> I too saw this "expecting order 4 got 0" once-warning every boot (from
> ordinary page reclaim rather than from madvise_pageout shown below),
> shortly after starting my tmpfs swapping load. But I never saw any bad
> effect immediately after it: actual crashes came a few minutes later.
>
> (And I'm not seeing the warning at all now, with the change I made: that
> doesn't tell us much, since what I have leaves out 2/2 entirely; but it
> does suggest that it's more important to follow up the crashes, and
> maybe when they are satisfactorily fixed, the warning will be fixed too.)
>
> Most crashes have been on that VM_BUG_ON(ci - si->cluster_info != idx)
> in alloc_cluster(). And when I poked around, it was usually (always?)
> the case that si->free_clusters was empty, so list_first_entry() not
> good at all. A few other crashes were GPFs, but I didn't pay much
> attention to them, thinking the alloc_cluster() one best to pursue.
>
> I reverted both patches from mm-everything, and had no problem.
> I added back 1/2 expecting it to be harmless ("no real function
> change in this patch"), but was surprised to get those same
> "expecting order 4 got 0" warnings and VM_BUG_ONs and GPFs:
> so have spent most time trying to get 1/2 working.
>
> This patch on top of 1/2, restoring when cluster_is_free(ci) can
> be seen to change, appears to have eliminated all those problems:
>
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -418,6 +418,7 @@ static struct swap_cluster_info *alloc_c
>
> VM_BUG_ON(ci - si->cluster_info != idx);
> list_del(&ci->list);
> + ci->state = CLUSTER_STATE_PER_CPU;
> ci->count = 0;
> return ci;
> }
> @@ -543,10 +544,6 @@ new_cluster:
> if (tmp == SWAP_NEXT_INVALID) {
> if (!list_empty(&si->free_clusters)) {
> ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> - list_del(&ci->list);
> - spin_lock(&ci->lock);
> - ci->state = CLUSTER_STATE_PER_CPU;
> - spin_unlock(&ci->lock);
> tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
> } else if (!list_empty(&si->discard_clusters)) {
> /*
>
Thanks for the nice bug report. That is my bad.
Both you and Ying point out the critical bug here: The cluster was
removed from the free list inside try_ssd() and in the case of
conflict() failure followed by alloc_cluster(). It allocates from the
cluster, it can remove the same cluster from the list again. That is
the path I haven't considered well.
All this attempt of allocation in try_ssd() but can have possible
conflict and perform the dance in alloc_cluster() make things very
complicated. In the try_ssd() when we have the cluster lock, can we
just perform the actual allocation with lock held? There should not be
conflict with the cluster lock protection, right?
Chris
> Delighted to have made progress after many attempts, I went to apply 2/2
> on top, but found that it builds upon those scan_swap_map_try_ssd_cluster()
> changes I've undone. I gave up at that point and hand back to you, Chris,
> hoping that you will understand scan_swap_map_ssd_cluster_conflict() etc
> much better than I ever shall!
>
> Clarifications on my load: all swapping to SSD, but discard not enabled;
> /sys/kernel/mm/transparent_hugepage/ enabled always, shmem_enabled force,
> hugepages-64kB/enabled always, hugepages-64kB/shmem_enabled always;
> swapoff between iterations, did not appear relevant to problems; x86_64.
>
> Hugh
>
> >
> > > [ 1013.540622] WARNING: CPU: 3 PID: 104 at mm/swapfile.c:600 scan_swap_map_try_ssd_cluster+0x340/0x370
> > > [ 1013.544460] Modules linked in:
> > > [ 1013.545411] CPU: 3 PID: 104 Comm: mthp_swpout_tes Not tainted 6.10.0-rc3-ga12328d9fb85-dirty #285
> > > [ 1013.545990] Hardware name: linux,dummy-virt (DT)
> > > [ 1013.546585] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> > > [ 1013.547136] pc : scan_swap_map_try_ssd_cluster+0x340/0x370
> > > [ 1013.547768] lr : scan_swap_map_try_ssd_cluster+0x340/0x370
> > > [ 1013.548263] sp : ffff8000863e32e0
> > > [ 1013.548723] x29: ffff8000863e32e0 x28: 0000000000000670 x27: 0000000000000660
> > > [ 1013.549626] x26: 0000000000000010 x25: ffff0000c1692108 x24: ffff0000c27c4800
> > > [ 1013.550470] x23: 2e8ba2e8ba2e8ba3 x22: fffffdffbf7df2c0 x21: ffff0000c27c48b0
> > > [ 1013.551285] x20: ffff800083a946d0 x19: 0000000000000004 x18: ffffffffffffffff
> > > [ 1013.552263] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800084b13568
> > > [ 1013.553292] x14: ffffffffffffffff x13: ffff800084b13566 x12: 6e69746365707865
> > > [ 1013.554423] x11: fffffffffffe0000 x10: ffff800083b18b68 x9 : ffff80008014c874
> > > [ 1013.555231] x8 : 00000000ffffefff x7 : ffff800083b16318 x6 : 0000000000002850
> > > [ 1013.555965] x5 : 40000000fffff1ae x4 : 0000000000000fff x3 : 0000000000000000
> > > [ 1013.556779] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000c24a1bc0
> > > [ 1013.557627] Call trace:
> > > [ 1013.557960] scan_swap_map_try_ssd_cluster+0x340/0x370
> > > [ 1013.558498] get_swap_pages+0x23c/0xc20
> > > [ 1013.558899] folio_alloc_swap+0x5c/0x248
> > > [ 1013.559544] add_to_swap+0x40/0xf0
> > > [ 1013.559904] shrink_folio_list+0x6dc/0xf20
> > > [ 1013.560289] reclaim_folio_list+0x8c/0x168
> > > [ 1013.560710] reclaim_pages+0xfc/0x178
> > > [ 1013.561079] madvise_cold_or_pageout_pte_range+0x8d8/0xf28
> > > [ 1013.561524] walk_pgd_range+0x390/0x808
> > > [ 1013.561920] __walk_page_range+0x1e0/0x1f0
> > > [ 1013.562370] walk_page_range+0x1f0/0x2c8
> > > [ 1013.562888] madvise_pageout+0xf8/0x280
> > > [ 1013.563388] madvise_vma_behavior+0x314/0xa20
> > > [ 1013.563982] madvise_walk_vmas+0xc0/0x128
> > > [ 1013.564386] do_madvise.part.0+0x110/0x558
> > > [ 1013.564792] __arm64_sys_madvise+0x68/0x88
> > > [ 1013.565333] invoke_syscall+0x50/0x128
> > > [ 1013.565737] el0_svc_common.constprop.0+0x48/0xf8
> > > [ 1013.566285] do_el0_svc+0x28/0x40
> > > [ 1013.566667] el0_svc+0x50/0x150
> > > [ 1013.567094] el0t_64_sync_handler+0x13c/0x158
> > > [ 1013.567501] el0t_64_sync+0x1a4/0x1a8
> > > [ 1013.568058] irq event stamp: 0
> > > [ 1013.568661] hardirqs last enabled at (0): [<0000000000000000>] 0x0
> > > [ 1013.569560] hardirqs last disabled at (0): [<ffff8000800add44>] copy_process+0x654/0x19a8
> > > [ 1013.570167] softirqs last enabled at (0): [<ffff8000800add44>] copy_process+0x654/0x19a8
> > > [ 1013.570846] softirqs last disabled at (0): [<0000000000000000>] 0x0
> > > [ 1013.571330] ---[ end trace 0000000000000000 ]---
Powered by blists - more mailing lists