linux-kernel - Re: [PATCH 1/2] mm, swap: don't scan every fragment cluster

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKEwX=NSGbjG-bhjje4ga2n4xNFBdiFTZV8TRz+qSc_cvmxUJg@mail.gmail.com>
Date: Tue, 5 Aug 2025 10:08:16 -0700
From: Nhat Pham <nphamcs@...il.com>
To: Kairui Song <kasong@...cent.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>, 
	Kemeng Shi <shikemeng@...weicloud.com>, Chris Li <chrisl@...nel.org>, 
	Baoquan He <bhe@...hat.com>, Barry Song <baohua@...nel.org>, 
	"Huang, Ying" <ying.huang@...ux.alibaba.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] mm, swap: don't scan every fragment cluster

On Mon, Aug 4, 2025 at 10:24 AM Kairui Song <ryncsn@...il.com> wrote:
>
> From: Kairui Song <kasong@...cent.com>
>
> Fragment clusters were mostly failing high order allocation already.
> The reason we scan it now is that a swap slot may get freed without
> releasing the swap cache, so a swap map entry will end up in HAS_CACHE
> only status, and the cluster won't be moved back to non-full or free
> cluster list.
>
> Usually this only happens for !SWP_SYNCHRONOUS_IO devices when the swap
> device usage is low (!vm_swap_full()) since swap will try to lazy free
> the swap cache.
>
> It's unlikely to cause any real issue. Fragmentation is only an issue
> when the device is getting full, and by  that time, swap will already
> be releasing the swap cache aggressively. And swap cache reclaim happens
> when the allocator scans a cluster too. Scanning one fragment cluster
> should be good enough to reclaim these pinned slots.
>
> And besides, only high order allocation requires iterating over a
> cluster list, order 0 allocation will succeed on the first attempt.
> And high order allocation failure isn't a serious problem.
>
> So the iteration of fragment clusters is trivial, but it will slow down
> mTHP allocation by a lot when the fragment cluster list is long.
> So it's better to drop this fragment cluster iteration design. Only
> scanning one fragment cluster is good enough in case any cluster is
> stuck in the fragment list; this ensures order 0 allocation never
> falls, and large allocations still have an acceptable success rate.
>
> Test on a 48c96t system, build linux kernel using 10G ZRAM, make -j48,
> defconfig with 768M cgroup memory limit, on top of tmpfs, 4K folio
> only:
>
> Before: sys time: 4407.28s
> After:  sys time: 4425.22s
>
> Change to make -j96, 2G memory limit, 64kB mTHP enabled, and 10G ZRAM:
>
> Before: sys time: 10230.22s  64kB/swpout: 1793044  64kB/swpout_fallback: 17653
> After:  sys time: 5527.90s   64kB/swpout: 1789358  64kB/swpout_fallback: 17813
>
> Change to 8G ZRAM:
>
> Before: sys time: 21929.17s  64kB/swpout: 1634681  64kB/swpout_fallback: 173056
> After:  sys time: 6121.01s   64kB/swpout: 1638155  64kB/swpout_fallback: 189562
>
> Change to use 10G brd device with SWP_SYNCHRONOUS_IO flag removed:
>
> Before: sys time: 7368.41s  64kB/swpout:1787599  swpout_fallback: 0
> After:  sys time: 7338.27s  64kB/swpout:1783106  swpout_fallback: 0
>
> Change to use 8G brd device with SWP_SYNCHRONOUS_IO flag removed:
>
> Before: sys time: 28139.60s 64kB/swpout:1645421  swpout_fallback: 148408
> After:  sys time: 8941.90s  64kB/swpout:1592973  swpout_fallback: 265010
>
> The performance is a lot better and large order allocation failure rate
> is only very slightly higher or unchanged.
>
> Signed-off-by: Kairui Song <kasong@...cent.com>

LGTM. I've been learning a lot about cluster-based allocation design
from your code for my vswap prototype. Thanks, Kairui!

FWIW:
Acked-by: Nhat Pham <nphamcs@...il.com>