linux-kernel - [PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250325171915.14384-1-nikhil.dhama@amd.com>
Date: Tue, 25 Mar 2025 22:49:15 +0530
From: Nikhil Dhama <nikhil.dhama@....com>
To: <akpm@...ux-foundation.org>, <ying.huang@...ux.alibaba.com>
CC: Nikhil Dhama <nikhil.dhama@....com>, Ying Huang
	<huang.ying.caritas@...il.com>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>, Bharata B Rao <bharata@....com>, Raghavendra
	<raghavendra.kodsarathimmappa@....com>
Subject: [PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation

In old pcp design, pcp->free_factor gets incremented in nr_pcp_free()
which is invoked by free_pcppages_bulk(). So, it used to increase 
free_factor by 1 only when we try to reduce the size of pcp list or
flush for high order.
and free_high used to trigger only for order > 0 and order <
costly_order and free_factor > 0.

and free_factor used to scale down by a factor of 2 on every successful
allocation. 

for iperf3 I noticed that with older design in kernel v6.6, pcp list was
drained mostly when pcp->count > high (more often when count goes above
530). and most of the time free_factor was 0, triggering very few 
high order flushes.

Whereas in the current design, free_factor is changed to free_count to keep
track of the number of pages freed contiguously, 
and with this design for iperf3, pcp list is getting flushed more 
frequently because free_high heuristics is triggered more often now.

In current design, free_count is incremented on every deallocation,
irrespective of whether pcp list was reduced or not. And logic to
trigger free_high is if free_count goes above batch (which is 63) and
there are two contiguous page free without any allocation. 
(and with cache slice optimisation).

With this design, I observed that high order pcp list is drained as soon 
as both count and free_count goes about 63.

and due to this more aggressive high order flushing, applications 
doing contiguous high order allocation will require to go to global list
more frequently.

On a 2-node AMD machine with 384 vCPUs on each node, 
connected via Mellonox connectX-7, I am seeing a ~30% performance 
reduction if we scale number of iperf3 client/server pairs from 32 to 64. 

So, though this new design reduced the time to detect high order flushes, 
but for application which are allocating high order pages more
frequently it may be flushing the high order list pre-maturely.
This motivates towards tuning on how late or early we should flush
high order lists.

for free_high heuristics. I tried to scale batch and tune it, 
which will delay the free_high flushes.

			score	# free_high
-----------		-----	-----------
v6.6 (base)		100	 	4
v6.12 (batch*1)		 69	      170
batch*2			 69	      150
batch*4			 74	      101
batch*5			100	       53
batch*6			100	       36
batch*8			100		3

scaling batch for free_high heuristics with a factor of 5 or above restores
the performance, as it is reducing the number of high order flushes.

On 2-node AMD server with 384 vCPUs each,score for other benchmarks with 
patch v2 along with iperf3 are as follows:

                     iperf3    lmbench3            netperf         kbuild
                              (AF_UNIX)      (SCTP_STREAM_MANY)
                    -------   ---------      -----------------     ------
v6.6 (base)            100          100                  100          100
v6.12                   69          113                 98.5         98.8
v6.12 with patch       100        112.5                100.1         99.6 

for network workloads, clients and server are running on different
machines conneted via Mellanox Connect-7 NIC. 

number of free_high:
		     iperf3    lmbench3            netperf         kbuild
                              (AF_UNIX)      (SCTP_STREAM_MANY)
                    -------   ---------      -----------------     ------
v6.6 (base)              5          12                   6           2
v6.12                  170          11                  92           2
v6.12 with patch    	58          11                	34           2

Signed-off-by: Nikhil Dhama <nikhil.dhama@....com>
Cc: Andrew Morton <akpm@...ux-foundation.org>
Cc: Ying Huang <huang.ying.caritas@...il.com>
Cc: linux-mm@...ck.org
Cc: linux-kernel@...r.kernel.org
Cc: Bharata B Rao <bharata@....com>
Cc: Raghavendra <raghavendra.kodsarathimmappa@....com>

---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b6958333054d..326d5fbae353 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2617,7 +2617,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	 * stops will be drained from vmstat refresh context.
 	 */
 	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
-		free_high = (pcp->free_count >= batch &&
+		free_high = (pcp->free_count >= (batch*5) &&
 			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
 			     (!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
 			      pcp->count >= READ_ONCE(batch)));
-- 
2.25.1