linux-kernel - Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACePvbWn1pAJg8xGfJFrWxpkJDakJA0GfgJc0W80EUYS6YhBAg@mail.gmail.com>
Date: Mon, 20 Oct 2025 23:48:20 -0700
From: Chris Li <chrisl@...nel.org>
To: Kairui Song <ryncsn@...il.com>
Cc: YoungJun Park <youngjun.park@....com>, linux-mm@...ck.org, 
	Andrew Morton <akpm@...ux-foundation.org>, Kemeng Shi <shikemeng@...weicloud.com>, 
	Nhat Pham <nphamcs@...il.com>, Baoquan He <bhe@...hat.com>, Barry Song <baohua@...nel.org>, 
	Baolin Wang <baolin.wang@...ux.alibaba.com>, David Hildenbrand <david@...hat.com>, 
	"Matthew Wilcox (Oracle)" <willy@...radead.org>, Ying Huang <ying.huang@...ux.alibaba.com>, 
	linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation

On Wed, Oct 15, 2025 at 9:46 AM Kairui Song <ryncsn@...il.com> wrote:
>
> On Wed, Oct 15, 2025 at 2:24 PM Kairui Song <ryncsn@...il.com> wrote:
> >
> > On Wed, Oct 15, 2025 at 12:00 PM Chris Li <chrisl@...nel.org> wrote:
> > >
> > > On Tue, Oct 14, 2025 at 2:27 PM Chris Li <chrisl@...nel.org> wrote:
> > > >
> > > > On Sun, Oct 12, 2025 at 9:49 AM Kairui Song <ryncsn@...il.com> wrote:
> > > > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > > > > index cb2392ed8e0e..0d1924f6f495 100644
> > > > > > > --- a/mm/swapfile.c
> > > > > > > +++ b/mm/swapfile.c
> > > > > > > @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> > > > > > >                         goto done;
> > > > > > >         }
> > > > > > >
> > > > > > > -       /*
> > > > > > > -        * We don't have free cluster but have some clusters in discarding,
> > > > > > > -        * do discard now and reclaim them.
> > > > > > > -        */
> > > > > > > -       if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> > > > > > > -               goto new_cluster;
> > > > > >
> > > > > > Assume you follow my suggestion.
> > > > > > Change this to some function to detect if there is a pending discard
> > > > > > on this device. Return to the caller indicating that you need a
> > > > > > discard for this device that has a pending discard.
> > > > > > Add an output argument to indicate the discard device "discard" if needed.
> > > > >
> > > > > The problem I just realized is that, if we just bail out here, we are
> > > > > forbidding order 0 to steal if there is any discarding cluster. We
> > > > > just return here to let the caller handle the discard outside
> > > > > the lock.
> > > >
> > > > Oh, yes, there might be a bit of change in behavior. However I can't
> > > > see it is such a bad thing if we wait for the pending discard to
> > > > complete before stealing and fragmenting the existing folio list. We
> > > > will have less fragments compared to the original result. Again, my
> > > > point is not that we always keep 100% the old behavior, then there is
> > > > no room for improvement.
> > > >
> > > > My point is that, are we doing the best we can in that situation,
> > > > regardless how unlikely it is.
> > > >
> > > > >
> > > > > It may just discard the cluster just fine, then retry from free clusters.
> > > > > Then everything is fine, that's the easy part.
> > > >
> > > > Ack.
> > > >
> > > > > But it might also fail, and interestingly, in the failure case we need
> > > >
> > > > Can you spell out the failure case you have in mind? Do you mean the
> > > > discard did happen but another thread stole "the recently discarded
> > > > then became free cluster"?
> > > >
> > > > Anyway, in such a case, the swap allocator should continue and find
> > > > out we don't have things to discard now, it will continue to the
> > > > "steal from other order > 0 list".
> > > >
> > > > > to try again as well. It might fail with a race with another discard,
> > > > > in that case order 0 steal is still feasible. Or it fail with
> > > > > get_swap_device_info (we have to release the device to return here),
> > > > > in that case we should go back to the plist and try other devices.
> > > >
> > > > When stealing from the other order >0 list failed, we should try
> > > > another device in the plist.
> > > >
> > > > >
> > > > > This is doable but seems kind of fragile, we'll have something like
> > > > > this in the folio_alloc_swap function:
> > > > >
> > > > > local_lock(&percpu_swap_cluster.lock);
> > > > > if (!swap_alloc_fast(&entry, order))
> > > > >     swap_alloc_slow(&entry, order, &discard_si);
> > > > > local_unlock(&percpu_swap_cluster.lock);
> > > > >
> > > > > +if (discard_si) {
> > > >
> > > > I feel the discard logic should be inside the swap_alloc_slow().
> > > > There is a  plist_for_each_entry_safe(), inside that loop to do the
> > > > discard and retry().
> > > > If I previously suggested it change in here, sorry I have changed my
> > > > mind after reasoning the code a bit more.
> > >
> > > Actually now I have given it a bit more thought, one thing I realized
> > > is that you might need to hold the percpu_swap_cluster lock all the
> > > time during allocation. That might force you to do the release lock
> > > and discard in the current position.
> > >
> > > If that is the case, then just making the small change in your patch
> > > to allow hold waiting to discard before trying the fragmentation list
> > > might be good enough.
> > >
> > > Chris
> > >
> >
> > Thanks, I was composing a reply on this and just saw your new comment.
> > I agree with this.
>
> Hmm, it turns out modifying V1 to handle non-order 0 allocation
> failure also has some minor issues. Every mTHP SWAP allocation failure
> will have a slight higher overhead due to the discard check. V1 is
> fine since it only checks discard for order 0, and order 0 alloc
> failure is uncommon and usually means OOM already.
>
> I'm not saying V1 is the final solution, but I think maybe we can just
> keep V1 as it is? That's easier for a stable backport too, and this is

I am fine with that, assuming you need to adjust the presentation to
push V1 as hotfix. I can ack your newer  patch to adjust the
presentation.

> doing far better than what it was like. The sync discard was added in
> 2013 and the later added percpu cluster at the same year never treated
> it carefully. And the discard during allocation after recent swap
> allocator rework has been kind of broken for a while.
>
> To optimize it further in a clean way, we have to reverse the
> allocator's handling order of the plist and fast / slow path. Current
> order is local_lock -> fast -> slow (plist).

I like that. I think that is the eventual way to go. I want to see how
it integrates with the swap.tiers patches. If you let me pick, I would
go straight with this one for 6.19.

>
> We can walk the plist first, then do the fast / slow path: plist (or
> maybe something faster than plist but handles the priority) ->
> local_lock -> fast -> slow (bonus: this is more friendly to RT kernels
> too I think). That way we don't need to rewalk the plist after
> releasing the local_lock and save a lot of trouble. I remember I
> discussed with Youngjun on this sometime ago in the mail list, I know
> things have changed a lot but some ideas seems are still similar. I
> think his series is moving the percpu cluster into each device (si),
> we can also move the local_lock there, which is just what I'm talking
> about here.

Ack. We will need to see both patches to figure out how to integrate
them together. Right now we have two moving parts. More to the point
that we get the eventual patch sooner.

Chris