[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250509023032.235-1-rakie.kim@sk.com>
Date: Fri, 9 May 2025 11:30:26 +0900
From: Rakie Kim <rakie.kim@...com>
To: Gregory Price <gourry@...rry.net>
Cc: joshua.hahnjy@...il.com,
akpm@...ux-foundation.org,
linux-mm@...ck.org,
linux-kernel@...r.kernel.org,
linux-cxl@...r.kernel.org,
dan.j.williams@...el.com,
ying.huang@...ux.alibaba.com,
kernel_team@...ynix.com,
honggyu.kim@...com,
yunjeong.mun@...com,
Rakie Kim <rakie.kim@...com>
Subject: Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave
On Thu, 8 May 2025 11:12:35 -0400 Gregory Price <gourry@...rry.net> wrote:
> On Thu, May 08, 2025 at 03:30:36PM +0900, Rakie Kim wrote:
> > On Wed, 7 May 2025 12:38:18 -0400 Gregory Price <gourry@...rry.net> wrote:
> >
> > The proposed design is completely optional and isolated: it retains the
> > existing flat weight model as-is and activates the source-aware behavior only
> > when 'multi' mode is enabled. The complexity is scoped entirely to users who
> > opt into this mode.
> >
>
> I get what you're going for, just expressing my experience around this
> issue specifically.
Thank you very much for your response. Your prior experience and insights
have been extremely helpful in refining how I think about this problem.
>
> The lack of enthusiasm for solving the cross-socket case, and thus
> reduction from a 2D array to a 1D array, was because reasoning about
> interleave w/ cross-socket interconnects is not really feasible with
> the NUMA abstraction. Cross-socket interconnects are "Invisible" but
> have real performance implications. Unless we have a way to:
>
> 1) Represent the topology, AND
> 2) A way to get performance about that topology
>
> It's not useful. So NUMA is an incomplete (if not wrong) tool for this.
Your comment gave me an opportunity to reconsider the purpose of the
feature I originally proposed. In fact, I had two different scenarios
in mind when outlining this direction.
Scenario 1: Adapt weighting based on the task's execution node
A task prefers only the DRAM and locally attached CXL memory of the
socket on which it is running, in order to avoid cross-socket access and
optimize bandwidth.
- A task running on CPU0 (node0) would prefer DRAM0 (w=3) and CXL0 (w=1)
- A task running on CPU1 (node1) would prefer DRAM1 (w=3) and CXL1 (w=1)
Scenario 2: Reflect relative memory access performance
The system adjusts weights based on expected bandwidth differences for
remote accesses. This relies on having access to interconnect performance
data, which NUMA currently does not expose.
As you rightly pointed out, Scenario 2 depends on being able to measure
or model the cost of cross-socket access, which is not available in the
current abstraction. I now realize that this case is less actionable and
needs further research before being pursued.
However, Scenario 1 does not depend on such information. Rather, it is
a locality-preserving optimization where we isolate memory access to
each socket's DRAM and CXL nodes. I believe this use case is implementable
today and worth considering independently from interconnect performance
awareness.
>
> Additionally - reacting to task migration is not a real issue. If
> you're deploying an allocation strategy, you probably don't want your
> task migrating away from the place where you just spent a bunch of time
> allocating based on some existing strategy. So the solution is: don't
> migrate, and if you do - don't use cross-socket interleave.
That's a fair point. I also agree that handling migration is not critical
at this stage, and I'm not actively focusing on that aspect in this
proposal.
>
> Maybe if we solve the first half of this we can take a look at the task
> migration piece again, but I wouldn't try to solve for migration.
>
> At the same time we were discussing this, we were also discussing how to
> do external task-mempolicy modifications - which seemed significantly
> more useful, but ultimately more complex and without sufficient
> interested parties / users.
I'd like to learn more about that thread. If you happen to have a pointer
to that discussion, it would be really helpful.
>
> ~Gregory
>
Thanks again for sharing your insights. I will follow up with a refined
proposal based on the localized socket-based routing model (Scenario 1)
and will give further consideration to the parts dependent on topology
performance measurement for now.
Rakie
Powered by blists - more mailing lists