linux-kernel - Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aPc3lmbJEVTXoV6h@yjaykim-PowerEdge-T330>
Date: Tue, 21 Oct 2025 16:34:46 +0900
From: YoungJun Park <youngjun.park@....com>
To: Kairui Song <ryncsn@...il.com>
Cc: Chris Li <chrisl@...nel.org>, linux-mm@...ck.org,
	Andrew Morton <akpm@...ux-foundation.org>,
	Kemeng Shi <shikemeng@...weicloud.com>,
	Nhat Pham <nphamcs@...il.com>, Baoquan He <bhe@...hat.com>,
	Barry Song <baohua@...nel.org>,
	Baolin Wang <baolin.wang@...ux.alibaba.com>,
	David Hildenbrand <david@...hat.com>,
	"Matthew Wilcox (Oracle)" <willy@...radead.org>,
	Ying Huang <ying.huang@...ux.alibaba.com>,
	linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during
 allocation

> > Thanks, I was composing a reply on this and just saw your new comment.
> > I agree with this.
> 
> Hmm, it turns out modifying V1 to handle non-order 0 allocation
> failure also has some minor issues. Every mTHP SWAP allocation failure
> will have a slight higher overhead due to the discard check. V1 is
> fine since it only checks discard for order 0, and order 0 alloc
> failure is uncommon and usually means OOM already.

Looking at the original proposed patch.

 +	spin_lock(&swap_avail_lock);
 +	plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
 +		spin_unlock(&swap_avail_lock);
 +		if (get_swap_device_info(si)) {
 +			if (si->flags & SWP_PAGE_DISCARD)
 +				ret = swap_do_scheduled_discard(si);
 +			put_swap_device(si);
 +		}
 +		if (ret)
 +			break;

if ret is true and we break, 
wouldn’t that cause spin_unlock to run without the lock being held?

 +		spin_lock(&swap_avail_lock);
 +	}
 +	spin_unlock(&swap_avail_lock); <- unlocked without lock grab.
 +
 +	return ret;
 +}

> I'm not saying V1 is the final solution, but I think maybe we can just
> keep V1 as it is? That's easier for a stable backport too, and this is
> doing far better than what it was like. The sync discard was added in
> 2013 and the later added percpu cluster at the same year never treated
> it carefully. And the discard during allocation after recent swap
> allocator rework has been kind of broken for a while.
> 
> To optimize it further in a clean way, we have to reverse the
> allocator's handling order of the plist and fast / slow path. Current
> order is local_lock -> fast -> slow (plist).
> We can walk the plist first, then do the fast / slow path: plist (or
> maybe something faster than plist but handles the priority) ->
> local_lock -> fast -> slow (bonus: this is more friendly to RT kernels

I think the idea is good, but when approaching it that way, 
I am curious about rotation handling.

In the current code, rotation is always done when traversing the plist in the slow path.
If we traverse the plist first, how should rotation be handled?

1. Do a naive rotation at plist traversal time. 
(But then fast path might allocate from an si we didn’t select.)
2. Rotate when allocating in the slow path. 
(But between releasing swap_avail_lock, we might access an si that wasn’t rotated.)

Both cases could break rotation behavior — what do you think?

> too I think). That way we don't need to rewalk the plist after
> releasing the local_lock and save a lot of trouble. I remember I
> discussed with Youngjun on this sometime ago in the mail list, I know

Recapping your earlier idea: cache only the swap device per cgroup in percpu, 
and keep the cluster inside the swap device.
Applied to swap tiers, cache only the percpu si per tier, 
and keep the cluster in the swap device.
This seems to fit well with your previous suggestion.

However, since we shifted from per-cgroup swap priority to swap tier, 
and will re-submit RFC for swap tier, we’ll need to revisit the discussion.

Youngjun Park