[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMgjq7DT13YwgFvGHPAn7ibb5E6XQDRf0P1XoFjz24ekF11C4g@mail.gmail.com>
Date: Tue, 21 Oct 2025 16:44:35 +0800
From: Kairui Song <ryncsn@...il.com>
To: Chris Li <chrisl@...nel.org>
Cc: YoungJun Park <youngjun.park@....com>, linux-mm@...ck.org,
Andrew Morton <akpm@...ux-foundation.org>, Kemeng Shi <shikemeng@...weicloud.com>,
Nhat Pham <nphamcs@...il.com>, Baoquan He <bhe@...hat.com>, Barry Song <baohua@...nel.org>,
Baolin Wang <baolin.wang@...ux.alibaba.com>, David Hildenbrand <david@...hat.com>,
"Matthew Wilcox (Oracle)" <willy@...radead.org>, Ying Huang <ying.huang@...ux.alibaba.com>,
linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation
On Tue, Oct 21, 2025 at 3:05 PM Chris Li <chrisl@...nel.org> wrote:
>
> On Wed, Oct 15, 2025 at 9:46 AM Kairui Song <ryncsn@...il.com> wrote:
> >
> > On Wed, Oct 15, 2025 at 2:24 PM Kairui Song <ryncsn@...il.com> wrote:
> > >
> > > On Wed, Oct 15, 2025 at 12:00 PM Chris Li <chrisl@...nel.org> wrote:
> > > >
> > > > On Tue, Oct 14, 2025 at 2:27 PM Chris Li <chrisl@...nel.org> wrote:
> > > > >
> > > > > On Sun, Oct 12, 2025 at 9:49 AM Kairui Song <ryncsn@...il.com> wrote:
> > > > > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > > > > > index cb2392ed8e0e..0d1924f6f495 100644
> > > > > > > > --- a/mm/swapfile.c
> > > > > > > > +++ b/mm/swapfile.c
> > > > > > > > @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> > > > > > > > goto done;
> > > > > > > > }
> > > > > > > >
> > > > > > > > - /*
> > > > > > > > - * We don't have free cluster but have some clusters in discarding,
> > > > > > > > - * do discard now and reclaim them.
> > > > > > > > - */
> > > > > > > > - if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> > > > > > > > - goto new_cluster;
> > > > > > >
> > > > > > > Assume you follow my suggestion.
> > > > > > > Change this to some function to detect if there is a pending discard
> > > > > > > on this device. Return to the caller indicating that you need a
> > > > > > > discard for this device that has a pending discard.
> > > > > > > Add an output argument to indicate the discard device "discard" if needed.
> > > > > >
> > > > > > The problem I just realized is that, if we just bail out here, we are
> > > > > > forbidding order 0 to steal if there is any discarding cluster. We
> > > > > > just return here to let the caller handle the discard outside
> > > > > > the lock.
> > > > >
> > > > > Oh, yes, there might be a bit of change in behavior. However I can't
> > > > > see it is such a bad thing if we wait for the pending discard to
> > > > > complete before stealing and fragmenting the existing folio list. We
> > > > > will have less fragments compared to the original result. Again, my
> > > > > point is not that we always keep 100% the old behavior, then there is
> > > > > no room for improvement.
> > > > >
> > > > > My point is that, are we doing the best we can in that situation,
> > > > > regardless how unlikely it is.
> > > > >
> > > > > >
> > > > > > It may just discard the cluster just fine, then retry from free clusters.
> > > > > > Then everything is fine, that's the easy part.
> > > > >
> > > > > Ack.
> > > > >
> > > > > > But it might also fail, and interestingly, in the failure case we need
> > > > >
> > > > > Can you spell out the failure case you have in mind? Do you mean the
> > > > > discard did happen but another thread stole "the recently discarded
> > > > > then became free cluster"?
> > > > >
> > > > > Anyway, in such a case, the swap allocator should continue and find
> > > > > out we don't have things to discard now, it will continue to the
> > > > > "steal from other order > 0 list".
> > > > >
> > > > > > to try again as well. It might fail with a race with another discard,
> > > > > > in that case order 0 steal is still feasible. Or it fail with
> > > > > > get_swap_device_info (we have to release the device to return here),
> > > > > > in that case we should go back to the plist and try other devices.
> > > > >
> > > > > When stealing from the other order >0 list failed, we should try
> > > > > another device in the plist.
> > > > >
> > > > > >
> > > > > > This is doable but seems kind of fragile, we'll have something like
> > > > > > this in the folio_alloc_swap function:
> > > > > >
> > > > > > local_lock(&percpu_swap_cluster.lock);
> > > > > > if (!swap_alloc_fast(&entry, order))
> > > > > > swap_alloc_slow(&entry, order, &discard_si);
> > > > > > local_unlock(&percpu_swap_cluster.lock);
> > > > > >
> > > > > > +if (discard_si) {
> > > > >
> > > > > I feel the discard logic should be inside the swap_alloc_slow().
> > > > > There is a plist_for_each_entry_safe(), inside that loop to do the
> > > > > discard and retry().
> > > > > If I previously suggested it change in here, sorry I have changed my
> > > > > mind after reasoning the code a bit more.
> > > >
> > > > Actually now I have given it a bit more thought, one thing I realized
> > > > is that you might need to hold the percpu_swap_cluster lock all the
> > > > time during allocation. That might force you to do the release lock
> > > > and discard in the current position.
> > > >
> > > > If that is the case, then just making the small change in your patch
> > > > to allow hold waiting to discard before trying the fragmentation list
> > > > might be good enough.
> > > >
> > > > Chris
> > > >
> > >
> > > Thanks, I was composing a reply on this and just saw your new comment.
> > > I agree with this.
> >
> > Hmm, it turns out modifying V1 to handle non-order 0 allocation
> > failure also has some minor issues. Every mTHP SWAP allocation failure
> > will have a slight higher overhead due to the discard check. V1 is
> > fine since it only checks discard for order 0, and order 0 alloc
> > failure is uncommon and usually means OOM already.
> >
> > I'm not saying V1 is the final solution, but I think maybe we can just
> > keep V1 as it is? That's easier for a stable backport too, and this is
>
> I am fine with that, assuming you need to adjust the presentation to
> push V1 as hotfix. I can ack your newer patch to adjust the
> presentation.
Thanks, I'll update it then.
> > doing far better than what it was like. The sync discard was added in
> > 2013 and the later added percpu cluster at the same year never treated
> > it carefully. And the discard during allocation after recent swap
> > allocator rework has been kind of broken for a while.
> >
> > To optimize it further in a clean way, we have to reverse the
> > allocator's handling order of the plist and fast / slow path. Current
> > order is local_lock -> fast -> slow (plist).
>
> I like that. I think that is the eventual way to go. I want to see how
> it integrates with the swap.tiers patches. If you let me pick, I would
> go straight with this one for 6.19.
>
> >
> > We can walk the plist first, then do the fast / slow path: plist (or
> > maybe something faster than plist but handles the priority) ->
> > local_lock -> fast -> slow (bonus: this is more friendly to RT kernels
> > too I think). That way we don't need to rewalk the plist after
> > releasing the local_lock and save a lot of trouble. I remember I
> > discussed with Youngjun on this sometime ago in the mail list, I know
> > things have changed a lot but some ideas seems are still similar. I
> > think his series is moving the percpu cluster into each device (si),
> > we can also move the local_lock there, which is just what I'm talking
> > about here.
>
> Ack. We will need to see both patches to figure out how to integrate
> them together. Right now we have two moving parts. More to the point
> that we get the eventual patch sooner.
BTW I found there are some minor cleanups needed, mostly trivial, I'll
include them in the next update I think.
Powered by blists - more mailing lists