linux-kernel - Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAMgjq7CELW_s5ok-2NHSFzK3SQKQKHB3VRLGnFaGGxe5c-eCvA@mail.gmail.com>
Date: Fri, 24 Oct 2025 12:00:27 +0800
From: Kairui Song <ryncsn@...il.com>
To: YoungJun Park <youngjun.park@....com>
Cc: Chris Li <chrisl@...nel.org>, linux-mm@...ck.org, 
	Andrew Morton <akpm@...ux-foundation.org>, Kemeng Shi <shikemeng@...weicloud.com>, 
	Nhat Pham <nphamcs@...il.com>, Baoquan He <bhe@...hat.com>, Barry Song <baohua@...nel.org>, 
	Baolin Wang <baolin.wang@...ux.alibaba.com>, David Hildenbrand <david@...hat.com>, 
	"Matthew Wilcox (Oracle)" <willy@...radead.org>, Ying Huang <ying.huang@...ux.alibaba.com>, 
	linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation

On Tue, Oct 21, 2025 at 3:34 PM YoungJun Park <youngjun.park@....com> wrote:
>
> > > Thanks, I was composing a reply on this and just saw your new comment.
> > > I agree with this.
> >
> > Hmm, it turns out modifying V1 to handle non-order 0 allocation
> > failure also has some minor issues. Every mTHP SWAP allocation failure
> > will have a slight higher overhead due to the discard check. V1 is
> > fine since it only checks discard for order 0, and order 0 alloc
> > failure is uncommon and usually means OOM already.
>
> Looking at the original proposed patch.
>
>  +      spin_lock(&swap_avail_lock);
>  +      plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
>  +              spin_unlock(&swap_avail_lock);
>  +              if (get_swap_device_info(si)) {
>  +                      if (si->flags & SWP_PAGE_DISCARD)
>  +                              ret = swap_do_scheduled_discard(si);
>  +                      put_swap_device(si);
>  +              }
>  +              if (ret)
>  +                      break;
>
> if ret is true and we break,
> wouldn’t that cause spin_unlock to run without the lock being held?

Thanks for catching this! Right, I need to return directly instead of
break. I've fixed that.

>
>  +              spin_lock(&swap_avail_lock);
>  +      }
>  +      spin_unlock(&swap_avail_lock); <- unlocked without lock grab.
>  +
>  +      return ret;
>  +}
>
> > I'm not saying V1 is the final solution, but I think maybe we can just
> > keep V1 as it is? That's easier for a stable backport too, and this is
> > doing far better than what it was like. The sync discard was added in
> > 2013 and the later added percpu cluster at the same year never treated
> > it carefully. And the discard during allocation after recent swap
> > allocator rework has been kind of broken for a while.
> >
> > To optimize it further in a clean way, we have to reverse the
> > allocator's handling order of the plist and fast / slow path. Current
> > order is local_lock -> fast -> slow (plist).
> > We can walk the plist first, then do the fast / slow path: plist (or
> > maybe something faster than plist but handles the priority) ->
> > local_lock -> fast -> slow (bonus: this is more friendly to RT kernels
>
> I think the idea is good, but when approaching it that way,
> I am curious about rotation handling.
>
> In the current code, rotation is always done when traversing the plist in the slow path.
> If we traverse the plist first, how should rotation be handled?

That's a very good question, things always get tricky when it comes to
the details...

> 1. Do a naive rotation at plist traversal time.
> (But then fast path might allocate from an si we didn’t select.)
> 2. Rotate when allocating in the slow path.
> (But between releasing swap_avail_lock, we might access an si that wasn’t rotated.)
>
> Both cases could break rotation behavior — what do you think?

I think cluster level rotating is better, it prevents things from
going too fragmented and spreads the workload between devices in a
helpful way, but just my guess.

We can change the rotation behavior if the test shows some other
strategy is better.

Maybe we'll need something with a better design, like a alloc counter
for rotation. And if we look at the plist before the fast path we may
need to do some optimization for the plist lock too...