linux-kernel - Re: [RFC PATCH 0/5] Enhancements to Page Migration with Batch Offloading via DMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Zm0SWZKcRrngCUUW@casper.infradead.org>
Date: Sat, 15 Jun 2024 05:02:33 +0100
From: Matthew Wilcox <willy@...radead.org>
To: Shivank Garg <shivankg@....com>
Cc: akpm@...ux-foundation.org, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org, bharata@....com,
	raghavendra.kodsarathimmappa@....com, Michael.Day@....com,
	dmaengine@...r.kernel.org, vkoul@...nel.org
Subject: Re: [RFC PATCH 0/5] Enhancements to Page Migration with Batch
 Offloading via DMA

On Sat, Jun 15, 2024 at 03:45:20AM +0530, Shivank Garg wrote:
> We conducted experiments to measure folio copy overheads for page
> migration from a remote node to a local NUMA node, modeling page
> promotions for different workload sizes (4KB, 2MB, 256MB and 1GB).
> 
> Setup Information: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT
> Enabled), 1 NUMA node connected to each socket.
> Linux Kernel 6.8.0, DVFS set to Performance, and cpuinfo_cur_freq: 2 GHz.
> THP, compaction, numa_balancing are disabled to reduce interfernce.
> 
> migrate_pages() { <- t1
> 	..
> 	<- t2
> 	folio_copy()
> 	<- t3 
> 	..
> } <- t4
> 
> overheads Fraction, F= (t3-t2)/(t4-t1)
> Measurement: Mean ± SD is measured in cpu_cycles/page
> Generic Kernel
> 4KB::   migrate_pages:17799.00±4278.25  folio_copy:794±232.87  F:0.0478±0.0199
> 2MB::   migrate_pages:3478.42±94.93  folio_copy:493.84±28.21  F:0.1418±0.0050
> 256MB:: migrate_pages:3668.56±158.47  folio_copy:815.40±171.76  F:0.2206±0.0371
> 1GB::   migrate_pages:3769.98±55.79  folio_copy:804.68±60.07  F:0.2132±0.0134
> 
> Results with patched kernel:
> 1. Offload disabled - folios batch-move using CPU
> 4KB::   migrate_pages:14941.60±2556.53  folio_copy:799.60±211.66  F:0.0554±0.0190
> 2MB::   migrate_pages:3448.44±83.74  folio_copy:533.34±37.81  F:0.1545±0.0085
> 256MB:: migrate_pages:3723.56±132.93  folio_copy:907.64±132.63  F:0.2427±0.0270
> 1GB::   migrate_pages:3788.20±46.65  folio_copy:888.46±49.50  F:0.2344±0.0107
> 
> 2. Offload enabled - folios batch-move using DMAengine
> 4KB::   migrate_pages:46739.80±4827.15  folio_copy:32222.40±3543.42  F:0.6904±0.0423
> 2MB::   migrate_pages:13798.10±205.33  folio_copy:10971.60±202.50  F:0.7951±0.0033
> 256MB:: migrate_pages:13217.20±163.99  folio_copy:10431.20±167.25  F:0.7891±0.0029
> 1GB::   migrate_pages:13309.70±113.93  folio_copy:10410.00±117.77  F:0.7821±0.0023

You haven't measured the important thing though -- what's the cost _to
userspace_?  When the CPU does the copy, the data is now cache-hot in
that CPU's cache.  When the DMA engine does the copy, it's not cache-hot
in any CPU.

Now, this may not be a big problem.  I don't think we do anything to
ensure that the CPU that is going to access the folio in userspace is
the one which does the copy.

But your methodology is wrong.