linux-kernel - Re: [PATCH v3] Weighted interleave auto-tuning

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z5Mr8WQGEZZjp9Uu@casper.infradead.org>
Date: Fri, 24 Jan 2025 05:58:09 +0000
From: Matthew Wilcox <willy@...radead.org>
To: Joshua Hahn <joshua.hahnjy@...il.com>
Cc: gourry@...rry.net, hyeonggon.yoo@...com, ying.huang@...ux.alibaba.com,
	rafael@...nel.org, lenb@...nel.org, gregkh@...uxfoundation.org,
	akpm@...ux-foundation.org, honggyu.kim@...com, rakie.kim@...com,
	dan.j.williams@...el.com, Jonathan.Cameron@...wei.com,
	dave.jiang@...el.com, horen.chuang@...ux.dev, hannes@...xchg.org,
	linux-kernel@...r.kernel.org, linux-acpi@...r.kernel.org,
	linux-mm@...ck.org, kernel-team@...a.com
Subject: Re: [PATCH v3] Weighted interleave auto-tuning

On Wed, Jan 15, 2025 at 10:58:54AM -0800, Joshua Hahn wrote:
> On machines with multiple memory nodes, interleaving page allocations
> across nodes allows for better utilization of each node's bandwidth.
> Previous work by Gregory Price [1] introduced weighted interleave, which
> allowed for pages to be allocated across NUMA nodes according to
> user-set ratios.

I still don't get it.  You always want memory to be on the local node or
the fabric gets horribly congested and slows you right down.  But you're
not really talking about NUMA, are you?  You're talking about CXL.

And CXL is terrible for bandwidth.  I just ran the numbers.

On a current Intel top-end CPU, we're looking at 8x DDR5-4800 DIMMs,
each with a bandwidth of 38.4GB/s for a total of 300GB/s.

For each CXL lane, you take a lane of PCIe gen5 away.  So that's
notionally 32Gbit/s, or 4GB/s per lane.  But CXL is crap, and you'll be
lucky to get 3 cachelines per 256 byte packet, dropping you down to 3GB/s.
You're not going to use all 80 lanes for CXL (presumably these CPUs are
going to want to do I/O somehow), so maybe allocate 20 of them to CXL.
That's 60GB/s, or a 20% improvement in bandwidth.  On top of that,
it's slow, with a minimum of 10ns latency penalty just from the CXL
encode/decode penalty.

Putting page cache in the CXL seems like nonsense to me.  I can see it
making sense to swap to CXL, or allocating anonymous memory for tasks
with low priority on it.  But I just can't see the point of putting
pagecache on CXL.