linux-kernel - Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250129105920.7a4bffa1@fangorn>
Date: Wed, 29 Jan 2025 10:59:20 -0500
From: Rik van Riel <riel@...riel.com>
To: Qi Zheng <zhengqi.arch@...edance.com>
Cc: Peter Zijlstra <peterz@...radead.org>, David Hildenbrand
 <david@...hat.com>, kernel test robot <oliver.sang@...el.com>,
 oe-lkp@...ts.linux.dev, lkp@...el.com, linux-kernel@...r.kernel.org, Andrew
 Morton <akpm@...ux-foundation.org>, Dave Hansen
 <dave.hansen@...ux.intel.com>, Andy Lutomirski <luto@...nel.org>, Catalin
 Marinas <catalin.marinas@....com>, David Rientjes <rientjes@...gle.com>,
 Hugh Dickins <hughd@...gle.com>, Jann Horn <jannh@...gle.com>, Lorenzo
 Stoakes <lorenzo.stoakes@...cle.com>, Matthew Wilcox <willy@...radead.org>,
 Mel Gorman <mgorman@...e.de>, Muchun Song <muchun.song@...ux.dev>, Peter Xu
 <peterx@...hat.com>, Will Deacon <will@...nel.org>, Zach O'Keefe
 <zokeefe@...gle.com>, Dan Carpenter <dan.carpenter@...aro.org>, "Paul E.
 McKenney" <paulmck@...nel.org>, Frederic Weisbecker <frederic@...nel.org>,
 Neeraj Upadhyay <neeraj.upadhyay@...nel.org>
Subject: Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec
 63.0% regression

On Wed, 29 Jan 2025 16:14:01 +0800
Qi Zheng <zhengqi.arch@...edance.com> wrote:

>
> It seems that the pcp lock is held when doing tlb_remove_table_rcu(), so
> trylock fails, then bypassing PCP and calling free_one_page() directly,
> which leads to the hot spot of zone lock.

Below is a tentative fix for the issue. It is kind of a big hammer,
and maybe the RCU people have a better idea on how to solve this
problem, but it may be worth giving this a try to see if it helps
with the regression you identified.

---8<---

From 2b0302f821d1fc94c968ac533dcc62b9ffe00c38 Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel@...riel.com>
Date: Wed, 29 Jan 2025 10:51:51 -0500
Subject: [PATCH 2/2] mm,rcu: prevent RCU callbacks from running with pcp lock
 held

Enabling MMU_GATHER_RCU_TABLE_FREE can create contention on the
zone->lock.  This turns out to be because in some configurations
RCU callbacks are called when IRQs are re-enabled inside
rmqueue_bulk, while the CPU is still holding the per-cpu pages lock.

That results in the RCU callbacks being unable to grab the
PCP lock, and taking the slow path with the zone->lock for
each item freed.

Speed things up by blocking RCU callbacks while holding the
PCP lock.

Signed-off-by: Rik van Riel <riel@...riel.com>
Reported-by: Qi Zheng <zhengqi.arch@...edance.com>
---
 mm/page_alloc.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e469c7ef9a4..b3c4002ab0ab 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3036,6 +3036,13 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 		return NULL;
 	}

+	/*
+	 * Prevent RCU callbacks from being run from the spin_lock_irqrestore
+	 * inside rmqueue_bulk, while the pcp lock is held; that would result
+	 * in each RCU free taking the zone->lock, which can be very slow.
+	 */
+	rcu_read_lock();
+
 	/*
 	 * On allocation, reduce the number of pages that are batch freed.
 	 * See nr_pcp_free() where free_factor is increased for subsequent
@@ -3046,6 +3053,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
 	pcp_spin_unlock(pcp);
 	pcp_trylock_finish(UP_flags);
+	rcu_read_unlock();
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 		zone_statistics(preferred_zone, zone, 1);
-- 
2.47.1