[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=NSGbjG-bhjje4ga2n4xNFBdiFTZV8TRz+qSc_cvmxUJg@mail.gmail.com>
Date: Tue, 5 Aug 2025 10:08:16 -0700
From: Nhat Pham <nphamcs@...il.com>
To: Kairui Song <kasong@...cent.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
Kemeng Shi <shikemeng@...weicloud.com>, Chris Li <chrisl@...nel.org>,
Baoquan He <bhe@...hat.com>, Barry Song <baohua@...nel.org>,
"Huang, Ying" <ying.huang@...ux.alibaba.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] mm, swap: don't scan every fragment cluster
On Mon, Aug 4, 2025 at 10:24 AM Kairui Song <ryncsn@...il.com> wrote:
>
> From: Kairui Song <kasong@...cent.com>
>
> Fragment clusters were mostly failing high order allocation already.
> The reason we scan it now is that a swap slot may get freed without
> releasing the swap cache, so a swap map entry will end up in HAS_CACHE
> only status, and the cluster won't be moved back to non-full or free
> cluster list.
>
> Usually this only happens for !SWP_SYNCHRONOUS_IO devices when the swap
> device usage is low (!vm_swap_full()) since swap will try to lazy free
> the swap cache.
>
> It's unlikely to cause any real issue. Fragmentation is only an issue
> when the device is getting full, and by that time, swap will already
> be releasing the swap cache aggressively. And swap cache reclaim happens
> when the allocator scans a cluster too. Scanning one fragment cluster
> should be good enough to reclaim these pinned slots.
>
> And besides, only high order allocation requires iterating over a
> cluster list, order 0 allocation will succeed on the first attempt.
> And high order allocation failure isn't a serious problem.
>
> So the iteration of fragment clusters is trivial, but it will slow down
> mTHP allocation by a lot when the fragment cluster list is long.
> So it's better to drop this fragment cluster iteration design. Only
> scanning one fragment cluster is good enough in case any cluster is
> stuck in the fragment list; this ensures order 0 allocation never
> falls, and large allocations still have an acceptable success rate.
>
> Test on a 48c96t system, build linux kernel using 10G ZRAM, make -j48,
> defconfig with 768M cgroup memory limit, on top of tmpfs, 4K folio
> only:
>
> Before: sys time: 4407.28s
> After: sys time: 4425.22s
>
> Change to make -j96, 2G memory limit, 64kB mTHP enabled, and 10G ZRAM:
>
> Before: sys time: 10230.22s 64kB/swpout: 1793044 64kB/swpout_fallback: 17653
> After: sys time: 5527.90s 64kB/swpout: 1789358 64kB/swpout_fallback: 17813
>
> Change to 8G ZRAM:
>
> Before: sys time: 21929.17s 64kB/swpout: 1634681 64kB/swpout_fallback: 173056
> After: sys time: 6121.01s 64kB/swpout: 1638155 64kB/swpout_fallback: 189562
>
> Change to use 10G brd device with SWP_SYNCHRONOUS_IO flag removed:
>
> Before: sys time: 7368.41s 64kB/swpout:1787599 swpout_fallback: 0
> After: sys time: 7338.27s 64kB/swpout:1783106 swpout_fallback: 0
>
> Change to use 8G brd device with SWP_SYNCHRONOUS_IO flag removed:
>
> Before: sys time: 28139.60s 64kB/swpout:1645421 swpout_fallback: 148408
> After: sys time: 8941.90s 64kB/swpout:1592973 swpout_fallback: 265010
>
> The performance is a lot better and large order allocation failure rate
> is only very slightly higher or unchanged.
>
> Signed-off-by: Kairui Song <kasong@...cent.com>
LGTM. I've been learning a lot about cluster-based allocation design
from your code for my vswap prototype. Thanks, Kairui!
FWIW:
Acked-by: Nhat Pham <nphamcs@...il.com>
Powered by blists - more mailing lists