[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z5O2QATuhvRnygcx@gourry-fedora-PF4VCD3F>
Date: Fri, 24 Jan 2025 10:48:16 -0500
From: Gregory Price <gourry@...rry.net>
To: Matthew Wilcox <willy@...radead.org>
Cc: Joshua Hahn <joshua.hahnjy@...il.com>, hyeonggon.yoo@...com,
ying.huang@...ux.alibaba.com, rafael@...nel.org, lenb@...nel.org,
gregkh@...uxfoundation.org, akpm@...ux-foundation.org,
honggyu.kim@...com, rakie.kim@...com, dan.j.williams@...el.com,
Jonathan.Cameron@...wei.com, dave.jiang@...el.com,
horen.chuang@...ux.dev, hannes@...xchg.org,
linux-kernel@...r.kernel.org, linux-acpi@...r.kernel.org,
linux-mm@...ck.org, kernel-team@...a.com
Subject: Re: [PATCH v3] Weighted interleave auto-tuning
On Fri, Jan 24, 2025 at 05:58:09AM +0000, Matthew Wilcox wrote:
> On Wed, Jan 15, 2025 at 10:58:54AM -0800, Joshua Hahn wrote:
> > On machines with multiple memory nodes, interleaving page allocations
> > across nodes allows for better utilization of each node's bandwidth.
> > Previous work by Gregory Price [1] introduced weighted interleave, which
> > allowed for pages to be allocated across NUMA nodes according to
> > user-set ratios.
>
> I still don't get it. You always want memory to be on the local node or
> the fabric gets horribly congested and slows you right down. But you're
> not really talking about NUMA, are you? You're talking about CXL.
>
> And CXL is terrible for bandwidth. I just ran the numbers.
>
> On a current Intel top-end CPU, we're looking at 8x DDR5-4800 DIMMs,
> each with a bandwidth of 38.4GB/s for a total of 300GB/s.
>
> For each CXL lane, you take a lane of PCIe gen5 away. So that's
> notionally 32Gbit/s, or 4GB/s per lane. But CXL is crap, and you'll be
> lucky to get 3 cachelines per 256 byte packet, dropping you down to 3GB/s.
> You're not going to use all 80 lanes for CXL (presumably these CPUs are
> going to want to do I/O somehow), so maybe allocate 20 of them to CXL.
> That's 60GB/s, or a 20% improvement in bandwidth. On top of that,
> it's slow, with a minimum of 10ns latency penalty just from the CXL
> encode/decode penalty.
>
>From the original - the performance tests show considerable opportunity
in the scenarios where DRAM bandwidth is pressured - as you can either
1) Lower the DRAM bandwidth pressure by offloading some cachelines to
CXL - reducing latency on DRAM and reducing average latency overall.
The latency cost on CXL lines gets amortized over all DRAM fetches
no longer hitting stalls.
2) Under full-pressure scenarios (DRAM and CXL are saturated), the
additional lanes / buffers provide more concurrent fetches - i.e.
you're just doing more work (and avoiding going to storage).
This is the weaker of the two scenarios.
No one is proposing we switch the default policy to weighted interleave.
= Performance summary =
(tests may have different configurations, see extended info below)
1) MLC (W2) : +38% over DRAM. +264% over default interleave.
MLC (W5) : +40% over DRAM. +226% over default interleave.
2) Stream : -6% to +4% over DRAM, +430% over default interleave.
3) XSBench : +19% over DRAM. +47% over default interleave.
=====================================================================
Performance tests - MLC
>From - Ravi Jonnalagadda <ravis.opensrc@...ron.com>
Hardware: Single-socket, multiple CXL memory expanders.
Workload: W2
Data Signature: 2:1 read:write
DRAM only bandwidth (GBps): 298.8
DRAM + CXL (default interleave) (GBps): 113.04
DRAM + CXL (weighted interleave)(GBps): 412.5
Gain over DRAM only: 1.38x
Gain over default interleave: 2.64x
Workload: W5
Data Signature: 1:1 read:write
DRAM only bandwidth (GBps): 273.2
DRAM + CXL (default interleave) (GBps): 117.23
DRAM + CXL (weighted interleave)(GBps): 382.7
Gain over DRAM only: 1.4x
Gain over default interleave: 2.26x
=====================================================================
Performance test - Stream
>From - Gregory Price <gregory.price@...verge.com>
Hardware: Single socket, single CXL expander
Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times
Default interleave : -78% (slower than DRAM)
Global weighting : -6% to +4% (workload dependant)
mbind weights : +2.5% to +4% (consistently better than DRAM)
=====================================================================
Performance tests - XSBench
>From - Hyeongtak Ji <hyeongtak.ji@...com>
Hardware: Single socket, Single CXL memory Expander
NUMA node 0: 56 logical cores, 128 GB memory
NUMA node 2: 96 GB CXL memory
Threads: 56
Lookups: 170,000,000
Summary: +19% over DRAM. +47% over default interleave.
> Putting page cache in the CXL seems like nonsense to me. I can see it
> making sense to swap to CXL, or allocating anonymous memory for tasks
> with low priority on it. But I just can't see the point of putting
> pagecache on CXL.
No one said anything about page cache - but it depends.
If you can keep your entire working set in-memory and on-CXL, as opposed
to swapping to disk - you win. "Swapping to CXL" incurs a bunch of page
faults, that sounds like a lose.
However - the stream test from the original proposal agrees with you
that just making everything interleaved (code, pagecache, etc) is at
best a wash:
Global weighting : -6% to +4% (workload dependant)
But targeting specific regions can provide a modest bump
mbind weights : +2.5% to +4% (consistently better than DRAM)
~Gregory
Powered by blists - more mailing lists