[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMgjq7CELW_s5ok-2NHSFzK3SQKQKHB3VRLGnFaGGxe5c-eCvA@mail.gmail.com>
Date: Fri, 24 Oct 2025 12:00:27 +0800
From: Kairui Song <ryncsn@...il.com>
To: YoungJun Park <youngjun.park@....com>
Cc: Chris Li <chrisl@...nel.org>, linux-mm@...ck.org,
Andrew Morton <akpm@...ux-foundation.org>, Kemeng Shi <shikemeng@...weicloud.com>,
Nhat Pham <nphamcs@...il.com>, Baoquan He <bhe@...hat.com>, Barry Song <baohua@...nel.org>,
Baolin Wang <baolin.wang@...ux.alibaba.com>, David Hildenbrand <david@...hat.com>,
"Matthew Wilcox (Oracle)" <willy@...radead.org>, Ying Huang <ying.huang@...ux.alibaba.com>,
linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation
On Tue, Oct 21, 2025 at 3:34 PM YoungJun Park <youngjun.park@....com> wrote:
>
> > > Thanks, I was composing a reply on this and just saw your new comment.
> > > I agree with this.
> >
> > Hmm, it turns out modifying V1 to handle non-order 0 allocation
> > failure also has some minor issues. Every mTHP SWAP allocation failure
> > will have a slight higher overhead due to the discard check. V1 is
> > fine since it only checks discard for order 0, and order 0 alloc
> > failure is uncommon and usually means OOM already.
>
> Looking at the original proposed patch.
>
> + spin_lock(&swap_avail_lock);
> + plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
> + spin_unlock(&swap_avail_lock);
> + if (get_swap_device_info(si)) {
> + if (si->flags & SWP_PAGE_DISCARD)
> + ret = swap_do_scheduled_discard(si);
> + put_swap_device(si);
> + }
> + if (ret)
> + break;
>
> if ret is true and we break,
> wouldn’t that cause spin_unlock to run without the lock being held?
Thanks for catching this! Right, I need to return directly instead of
break. I've fixed that.
>
> + spin_lock(&swap_avail_lock);
> + }
> + spin_unlock(&swap_avail_lock); <- unlocked without lock grab.
> +
> + return ret;
> +}
>
> > I'm not saying V1 is the final solution, but I think maybe we can just
> > keep V1 as it is? That's easier for a stable backport too, and this is
> > doing far better than what it was like. The sync discard was added in
> > 2013 and the later added percpu cluster at the same year never treated
> > it carefully. And the discard during allocation after recent swap
> > allocator rework has been kind of broken for a while.
> >
> > To optimize it further in a clean way, we have to reverse the
> > allocator's handling order of the plist and fast / slow path. Current
> > order is local_lock -> fast -> slow (plist).
> > We can walk the plist first, then do the fast / slow path: plist (or
> > maybe something faster than plist but handles the priority) ->
> > local_lock -> fast -> slow (bonus: this is more friendly to RT kernels
>
> I think the idea is good, but when approaching it that way,
> I am curious about rotation handling.
>
> In the current code, rotation is always done when traversing the plist in the slow path.
> If we traverse the plist first, how should rotation be handled?
That's a very good question, things always get tricky when it comes to
the details...
> 1. Do a naive rotation at plist traversal time.
> (But then fast path might allocate from an si we didn’t select.)
> 2. Rotate when allocating in the slow path.
> (But between releasing swap_avail_lock, we might access an si that wasn’t rotated.)
>
> Both cases could break rotation behavior — what do you think?
I think cluster level rotating is better, it prevents things from
going too fragmented and spreads the workload between devices in a
helpful way, but just my guess.
We can change the rotation behavior if the test shows some other
strategy is better.
Maybe we'll need something with a better design, like a alloc counter
for rotation. And if we look at the plist before the fast path we may
need to do some optimization for the plist lock too...
Powered by blists - more mailing lists