lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <wuuz7itgcjb7vu466k6nwxfjiy4ytx7ip3yvauqucwlpqqibri@bpxnpevzermg>
Date: Wed, 20 Aug 2025 13:58:15 +0100
From: Kiryl Shutsemau <kirill@...temov.name>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Joshua Hahn <joshua.hahnjy@...il.com>, 
	Johannes Weiner <hannes@...xchg.org>, Chris Mason <clm@...com>, 
	Andrew Morton <akpm@...ux-foundation.org>, Vlastimil Babka <vbabka@...e.cz>, 
	Suren Baghdasaryan <surenb@...le.com>, Michal Hocko <mhocko@...e.com>, 
	Brendan Jackman <jackmanb@...gle.com>, Zi Yan <ziy@...dia.com>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in
 batch freeing

On Tue, Aug 19, 2025 at 10:15:39AM -0700, Shakeel Butt wrote:
> On Tue, Aug 19, 2025 at 10:15:13AM +0100, Kiryl Shutsemau wrote:
> > On Mon, Aug 18, 2025 at 11:58:03AM -0700, Joshua Hahn wrote:
> > > While testing workloads with high sustained memory pressure on large machines
> > > (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> > > Further investigation showed that the lock in free_pcppages_bulk was being held
> > > for a long time, even being held while 2k+ pages were being freed.
> > > 
> > > Instead of holding the lock for the entirety of the freeing, check to see if
> > > the zone lock is contended every pcp->batch pages. If there is contention,
> > > relinquish the lock so that other processors have a change to grab the lock
> > > and perform critical work.
> > 
> > Hm. It doesn't necessary to be contention on the lock, but just that you
> > holding the lock for too long so the CPU is not available for the scheduler.
> > 
> > > In our fleet, we have seen that performing batched lock freeing has led to
> > > significantly lower rates of softlockups, while incurring relatively small
> > > regressions (relative to the workload and relative to the variation).
> > > 
> > > The following are a few synthetic benchmarks:
> > > 
> > > Test 1: Small machine (30G RAM, 36 CPUs)
> > > 
> > > stress-ng --vm 30 --vm-bytes 1G -M -t 100
> > > +----------------------+---------------+-----------+
> > > |        Metric        | Variation (%) | Delta (%) |
> > > +----------------------+---------------+-----------+
> > > | bogo ops             |        0.0076 |   -0.0183 |
> > > | bogo ops/s (real)    |        0.0064 |   -0.0207 |
> > > | bogo ops/s (usr+sys) |        0.3151 |   +0.4141 |
> > > +----------------------+---------------+-----------+
> > > 
> > > stress-ng --vm 20 --vm-bytes 3G -M -t 100
> > > +----------------------+---------------+-----------+
> > > |        Metric        | Variation (%) | Delta (%) |
> > > +----------------------+---------------+-----------+
> > > | bogo ops             |        0.0295 |   -0.0078 |
> > > | bogo ops/s (real)    |        0.0267 |   -0.0177 |
> > > | bogo ops/s (usr+sys) |        1.7079 |   -0.0096 |
> > > +----------------------+---------------+-----------+
> > > 
> > > Test 2: Big machine (250G RAM, 176 CPUs)
> > > 
> > > stress-ng --vm 50 --vm-bytes 5G -M -t 100
> > > +----------------------+---------------+-----------+
> > > |        Metric        | Variation (%) | Delta (%) |
> > > +----------------------+---------------+-----------+
> > > | bogo ops             |        0.0362 |   -0.0187 |
> > > | bogo ops/s (real)    |        0.0391 |   -0.0220 |
> > > | bogo ops/s (usr+sys) |        2.9603 |   +1.3758 |
> > > +----------------------+---------------+-----------+
> > > 
> > > stress-ng --vm 10 --vm-bytes 30G -M -t 100
> > > +----------------------+---------------+-----------+
> > > |        Metric        | Variation (%) | Delta (%) |
> > > +----------------------+---------------+-----------+
> > > | bogo ops             |        2.3130 |   -0.0754 |
> > > | bogo ops/s (real)    |        3.3069 |   -0.8579 |
> > > | bogo ops/s (usr+sys) |        4.0369 |   -1.1985 |
> > > +----------------------+---------------+-----------+
> > > 
> > > Suggested-by: Chris Mason <clm@...com>
> > > Co-developed-by: Johannes Weiner <hannes@...xchg.org>
> > > Signed-off-by: Joshua Hahn <joshua.hahnjy@...il.com>
> > > 
> > > ---
> > >  mm/page_alloc.c | 15 ++++++++++++++-
> > >  1 file changed, 14 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index a8a84c3b5fe5..bd7a8da3e159 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1238,6 +1238,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > >  	 * below while (list_empty(list)) loop.
> > >  	 */
> > >  	count = min(pcp->count, count);
> > > +	if (!count)
> > > +		return;
> > >  
> > >  	/* Ensure requested pindex is drained first. */
> > >  	pindex = pindex - 1;
> > > @@ -1247,6 +1249,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > >  	while (count > 0) {
> > >  		struct list_head *list;
> > >  		int nr_pages;
> > > +		int batch = min(count, pcp->batch);
> > >  
> > >  		/* Remove pages from lists in a round-robin fashion. */
> > >  		do {
> > > @@ -1267,12 +1270,22 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > >  
> > >  			/* must delete to avoid corrupting pcp list */
> > >  			list_del(&page->pcp_list);
> > > +			batch -= nr_pages;
> > >  			count -= nr_pages;
> > >  			pcp->count -= nr_pages;
> > >  
> > >  			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
> > >  			trace_mm_page_pcpu_drain(page, order, mt);
> > > -		} while (count > 0 && !list_empty(list));
> > > +		} while (batch > 0 && !list_empty(list));
> > > +
> > > +		/*
> > > +		 * Prevent starving the lock for other users; every pcp->batch
> > > +		 * pages freed, relinquish the zone lock if it is contended.
> > > +		 */
> > > +		if (count && spin_is_contended(&zone->lock)) {
> > 
> > I would rather drop the count thing and do something like this:
> > 
> > 		if (need_resched() || spin_needbreak(&zone->lock) {
> > 			spin_unlock_irqrestore(&zone->lock, flags);
> > 			cond_resched();
> 
> Can this function be called from non-sleepable context?

No, it cannot.

And looking at the locking context -- caller holds pcp->lock -- looks
like my proposal with need_resched()/cond_resched() doesn't work.

We need to either push for wider rework and make cond_resched() happen
upper by the stack or ignore it and have cpu_relax() called on the lock
drop.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ