linux-kernel - Re: [PATCH 1/2] mm, swap: don't scan every fragment cluster

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7Bu3PmDzrEptRUifzZqkQX1cZQ-2DQjjmvQiknJvtGGPQ@mail.gmail.com>
Date: Wed, 6 Aug 2025 11:02:43 +0800
From: Kairui Song <ryncsn@...il.com>
To: Chris Li <chrisl@...nel.org>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>, 
	Kemeng Shi <shikemeng@...weicloud.com>, Nhat Pham <nphamcs@...il.com>, 
	Baoquan He <bhe@...hat.com>, Barry Song <baohua@...nel.org>, 
	"Huang, Ying" <ying.huang@...ux.alibaba.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] mm, swap: don't scan every fragment cluster

Chris Li <chrisl@...nel.org> 于 2025年8月6日周三 07:30写道：
>
> Looks good to me with minor nit picks on commit messages and comments.
>
> Let me know if you will refresh a version or not.

I'll send a V2 to improve the series. I think no code change is
needed, the change log can be improved.

> Nit: I suggest the patch title use positive terms, something along the lines:
> "Only scan one cluster in fragment list"
> "Don't scan" seems to describe what the patch does not do rather than
> what the patch does.

Good idea.

>
> On Mon, Aug 4, 2025 at 10:24 AM Kairui Song <ryncsn@...il.com> wrote:
> >
> > From: Kairui Song <kasong@...cent.com>
> >
> > Fragment clusters were mostly failing high order allocation already.
> > The reason we scan it now is that a swap slot may get freed without
> > releasing the swap cache, so a swap map entry will end up in HAS_CACHE
> > only status, and the cluster won't be moved back to non-full or free
> > cluster list.
> >
> > Usually this only happens for !SWP_SYNCHRONOUS_IO devices when the swap
>
> Nit: Please clarify what "this" here means. I assume scanning fragment lists.
> From the context it can almost mean "map entry will end up in HAS_CACHE".

Yes.

>
>
> > device usage is low (!vm_swap_full()) since swap will try to lazy free
> > the swap cache.
> >
> > It's unlikely to cause any real issue. Fragmentation is only an issue
> > when the device is getting full, and by  that time, swap will already
> > be releasing the swap cache aggressively. And swap cache reclaim happens
> > when the allocator scans a cluster too. Scanning one fragment cluster
> > should be good enough to reclaim these pinned slots.
> >
> > And besides, only high order allocation requires iterating over a
> > cluster list, order 0 allocation will succeed on the first attempt.
> > And high order allocation failure isn't a serious problem.
> >
> > So the iteration of fragment clusters is trivial, but it will slow down
> > mTHP allocation by a lot when the fragment cluster list is long.
> > So it's better to drop this fragment cluster iteration design. Only
> > scanning one fragment cluster is good enough in case any cluster is
> > stuck in the fragment list; this ensures order 0 allocation never
> > falls, and large allocations still have an acceptable success rate.
> >
> > Test on a 48c96t system, build linux kernel using 10G ZRAM, make -j48,
> > defconfig with 768M cgroup memory limit, on top of tmpfs, 4K folio
> > only:
> >
> > Before: sys time: 4407.28s
> > After:  sys time: 4425.22s
> >
> > Change to make -j96, 2G memory limit, 64kB mTHP enabled, and 10G ZRAM:
> >
> > Before: sys time: 10230.22s  64kB/swpout: 1793044  64kB/swpout_fallback: 17653
> > After:  sys time: 5527.90s   64kB/swpout: 1789358  64kB/swpout_fallback: 17813
> >
> > Change to 8G ZRAM:
> >
> > Before: sys time: 21929.17s  64kB/swpout: 1634681  64kB/swpout_fallback: 173056
> > After:  sys time: 6121.01s   64kB/swpout: 1638155  64kB/swpout_fallback: 189562
> >
> > Change to use 10G brd device with SWP_SYNCHRONOUS_IO flag removed:
> >
> > Before: sys time: 7368.41s  64kB/swpout:1787599  swpout_fallback: 0
> > After:  sys time: 7338.27s  64kB/swpout:1783106  swpout_fallback: 0
> >
> > Change to use 8G brd device with SWP_SYNCHRONOUS_IO flag removed:
> >
> > Before: sys time: 28139.60s 64kB/swpout:1645421  swpout_fallback: 148408
> > After:  sys time: 8941.90s  64kB/swpout:1592973  swpout_fallback: 265010
> >
> > The performance is a lot better and large order allocation failure rate
> > is only very slightly higher or unchanged.
> >
> > Signed-off-by: Kairui Song <kasong@...cent.com>
> > ---
> >  include/linux/swap.h |  1 -
> >  mm/swapfile.c        | 30 ++++++++----------------------
> >  2 files changed, 8 insertions(+), 23 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 2fe6ed2cc3fd..a060d102e0d1 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -310,7 +310,6 @@ struct swap_info_struct {
> >                                         /* list of cluster that contains at least one free slot */
> >         struct list_head frag_clusters[SWAP_NR_ORDERS];
> >                                         /* list of cluster that are fragmented or contented */
> > -       atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
>
> Nit: please have some comment in the commit log that why remove the
> frag_cluster_nr counter.
> I feel this change can be split out from the main change of this
> patch. The main performance improvement is from only scanning one
> fragment cluster rather than the full list right? Delete the counter
> helps, but in a much smaller number.

RIght, I can split this into two patches, removing the counter has
basically no measurable performance effect, it's just no longer used
after this change.

>
> Chris
>