lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <k6fpx5adh45t4jrxgiccq7acubwcgmi746crggxi6e4oihtvpt@thks5zrn53n3>
Date: Tue, 19 Aug 2025 10:15:13 +0100
From: Kiryl Shutsemau <kirill@...temov.name>
To: Joshua Hahn <joshua.hahnjy@...il.com>
Cc: Johannes Weiner <hannes@...xchg.org>, Chris Mason <clm@...com>, 
	Andrew Morton <akpm@...ux-foundation.org>, Vlastimil Babka <vbabka@...e.cz>, 
	Suren Baghdasaryan <surenb@...le.com>, Michal Hocko <mhocko@...e.com>, 
	Brendan Jackman <jackmanb@...gle.com>, Zi Yan <ziy@...dia.com>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in
 batch freeing

On Mon, Aug 18, 2025 at 11:58:03AM -0700, Joshua Hahn wrote:
> While testing workloads with high sustained memory pressure on large machines
> (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> Further investigation showed that the lock in free_pcppages_bulk was being held
> for a long time, even being held while 2k+ pages were being freed.
> 
> Instead of holding the lock for the entirety of the freeing, check to see if
> the zone lock is contended every pcp->batch pages. If there is contention,
> relinquish the lock so that other processors have a change to grab the lock
> and perform critical work.

Hm. It doesn't necessary to be contention on the lock, but just that you
holding the lock for too long so the CPU is not available for the scheduler.

> In our fleet, we have seen that performing batched lock freeing has led to
> significantly lower rates of softlockups, while incurring relatively small
> regressions (relative to the workload and relative to the variation).
> 
> The following are a few synthetic benchmarks:
> 
> Test 1: Small machine (30G RAM, 36 CPUs)
> 
> stress-ng --vm 30 --vm-bytes 1G -M -t 100
> +----------------------+---------------+-----------+
> |        Metric        | Variation (%) | Delta (%) |
> +----------------------+---------------+-----------+
> | bogo ops             |        0.0076 |   -0.0183 |
> | bogo ops/s (real)    |        0.0064 |   -0.0207 |
> | bogo ops/s (usr+sys) |        0.3151 |   +0.4141 |
> +----------------------+---------------+-----------+
> 
> stress-ng --vm 20 --vm-bytes 3G -M -t 100
> +----------------------+---------------+-----------+
> |        Metric        | Variation (%) | Delta (%) |
> +----------------------+---------------+-----------+
> | bogo ops             |        0.0295 |   -0.0078 |
> | bogo ops/s (real)    |        0.0267 |   -0.0177 |
> | bogo ops/s (usr+sys) |        1.7079 |   -0.0096 |
> +----------------------+---------------+-----------+
> 
> Test 2: Big machine (250G RAM, 176 CPUs)
> 
> stress-ng --vm 50 --vm-bytes 5G -M -t 100
> +----------------------+---------------+-----------+
> |        Metric        | Variation (%) | Delta (%) |
> +----------------------+---------------+-----------+
> | bogo ops             |        0.0362 |   -0.0187 |
> | bogo ops/s (real)    |        0.0391 |   -0.0220 |
> | bogo ops/s (usr+sys) |        2.9603 |   +1.3758 |
> +----------------------+---------------+-----------+
> 
> stress-ng --vm 10 --vm-bytes 30G -M -t 100
> +----------------------+---------------+-----------+
> |        Metric        | Variation (%) | Delta (%) |
> +----------------------+---------------+-----------+
> | bogo ops             |        2.3130 |   -0.0754 |
> | bogo ops/s (real)    |        3.3069 |   -0.8579 |
> | bogo ops/s (usr+sys) |        4.0369 |   -1.1985 |
> +----------------------+---------------+-----------+
> 
> Suggested-by: Chris Mason <clm@...com>
> Co-developed-by: Johannes Weiner <hannes@...xchg.org>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@...il.com>
> 
> ---
>  mm/page_alloc.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a8a84c3b5fe5..bd7a8da3e159 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1238,6 +1238,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  	 * below while (list_empty(list)) loop.
>  	 */
>  	count = min(pcp->count, count);
> +	if (!count)
> +		return;
>  
>  	/* Ensure requested pindex is drained first. */
>  	pindex = pindex - 1;
> @@ -1247,6 +1249,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  	while (count > 0) {
>  		struct list_head *list;
>  		int nr_pages;
> +		int batch = min(count, pcp->batch);
>  
>  		/* Remove pages from lists in a round-robin fashion. */
>  		do {
> @@ -1267,12 +1270,22 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  
>  			/* must delete to avoid corrupting pcp list */
>  			list_del(&page->pcp_list);
> +			batch -= nr_pages;
>  			count -= nr_pages;
>  			pcp->count -= nr_pages;
>  
>  			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
>  			trace_mm_page_pcpu_drain(page, order, mt);
> -		} while (count > 0 && !list_empty(list));
> +		} while (batch > 0 && !list_empty(list));
> +
> +		/*
> +		 * Prevent starving the lock for other users; every pcp->batch
> +		 * pages freed, relinquish the zone lock if it is contended.
> +		 */
> +		if (count && spin_is_contended(&zone->lock)) {

I would rather drop the count thing and do something like this:

		if (need_resched() || spin_needbreak(&zone->lock) {
			spin_unlock_irqrestore(&zone->lock, flags);
			cond_resched();
			spin_lock_irqsave(&zone->lock, flags);
		}

> +			spin_unlock_irqrestore(&zone->lock, flags);
> +			spin_lock_irqsave(&zone->lock, flags);
> +		}
>  	}
>  
>  	spin_unlock_irqrestore(&zone->lock, flags);
> 
> base-commit: 137a6423b60fe0785aada403679d3b086bb83062
> -- 
> 2.47.3

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ