[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7BJcxGzrnr+EeO6_ZC7dAn0_WmWn8DX8gSPfyYiY4S3Ug@mail.gmail.com>
Date: Wed, 15 Oct 2025 14:24:01 +0800
From: Kairui Song <ryncsn@...il.com>
To: Chris Li <chrisl@...nel.org>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
Kemeng Shi <shikemeng@...weicloud.com>, Nhat Pham <nphamcs@...il.com>,
Baoquan He <bhe@...hat.com>, Barry Song <baohua@...nel.org>,
Baolin Wang <baolin.wang@...ux.alibaba.com>, David Hildenbrand <david@...hat.com>,
"Matthew Wilcox (Oracle)" <willy@...radead.org>, Ying Huang <ying.huang@...ux.alibaba.com>,
linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation
On Wed, Oct 15, 2025 at 12:00 PM Chris Li <chrisl@...nel.org> wrote:
>
> On Tue, Oct 14, 2025 at 2:27 PM Chris Li <chrisl@...nel.org> wrote:
> >
> > On Sun, Oct 12, 2025 at 9:49 AM Kairui Song <ryncsn@...il.com> wrote:
> > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > > index cb2392ed8e0e..0d1924f6f495 100644
> > > > > --- a/mm/swapfile.c
> > > > > +++ b/mm/swapfile.c
> > > > > @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> > > > > goto done;
> > > > > }
> > > > >
> > > > > - /*
> > > > > - * We don't have free cluster but have some clusters in discarding,
> > > > > - * do discard now and reclaim them.
> > > > > - */
> > > > > - if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> > > > > - goto new_cluster;
> > > >
> > > > Assume you follow my suggestion.
> > > > Change this to some function to detect if there is a pending discard
> > > > on this device. Return to the caller indicating that you need a
> > > > discard for this device that has a pending discard.
> > > > Add an output argument to indicate the discard device "discard" if needed.
> > >
> > > The problem I just realized is that, if we just bail out here, we are
> > > forbidding order 0 to steal if there is any discarding cluster. We
> > > just return here to let the caller handle the discard outside
> > > the lock.
> >
> > Oh, yes, there might be a bit of change in behavior. However I can't
> > see it is such a bad thing if we wait for the pending discard to
> > complete before stealing and fragmenting the existing folio list. We
> > will have less fragments compared to the original result. Again, my
> > point is not that we always keep 100% the old behavior, then there is
> > no room for improvement.
> >
> > My point is that, are we doing the best we can in that situation,
> > regardless how unlikely it is.
> >
> > >
> > > It may just discard the cluster just fine, then retry from free clusters.
> > > Then everything is fine, that's the easy part.
> >
> > Ack.
> >
> > > But it might also fail, and interestingly, in the failure case we need
> >
> > Can you spell out the failure case you have in mind? Do you mean the
> > discard did happen but another thread stole "the recently discarded
> > then became free cluster"?
> >
> > Anyway, in such a case, the swap allocator should continue and find
> > out we don't have things to discard now, it will continue to the
> > "steal from other order > 0 list".
> >
> > > to try again as well. It might fail with a race with another discard,
> > > in that case order 0 steal is still feasible. Or it fail with
> > > get_swap_device_info (we have to release the device to return here),
> > > in that case we should go back to the plist and try other devices.
> >
> > When stealing from the other order >0 list failed, we should try
> > another device in the plist.
> >
> > >
> > > This is doable but seems kind of fragile, we'll have something like
> > > this in the folio_alloc_swap function:
> > >
> > > local_lock(&percpu_swap_cluster.lock);
> > > if (!swap_alloc_fast(&entry, order))
> > > swap_alloc_slow(&entry, order, &discard_si);
> > > local_unlock(&percpu_swap_cluster.lock);
> > >
> > > +if (discard_si) {
> >
> > I feel the discard logic should be inside the swap_alloc_slow().
> > There is a plist_for_each_entry_safe(), inside that loop to do the
> > discard and retry().
> > If I previously suggested it change in here, sorry I have changed my
> > mind after reasoning the code a bit more.
>
> Actually now I have given it a bit more thought, one thing I realized
> is that you might need to hold the percpu_swap_cluster lock all the
> time during allocation. That might force you to do the release lock
> and discard in the current position.
>
> If that is the case, then just making the small change in your patch
> to allow hold waiting to discard before trying the fragmentation list
> might be good enough.
>
> Chris
>
Thanks, I was composing a reply on this and just saw your new comment.
I agree with this.
Powered by blists - more mailing lists