linux-kernel - Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87frjfx6u4.fsf@DESKTOP-5N7EMDA>
Date: Fri, 14 Mar 2025 18:08:35 +0800
From: "Huang, Ying" <ying.huang@...ux.alibaba.com>
To: Joshua Hahn <joshua.hahnjy@...il.com>
Cc: lsf-pc@...ts.linux-foundation.org,  linux-mm@...ck.org,
  linux-kernel@...r.kernel.org,  gourry@...rry.net,  hyeonggon.yoo@...com,
  honggyu.kim@...com,  kernel-team@...a.com
Subject: Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning

Joshua Hahn <joshua.hahnjy@...il.com> writes:

> On Thu,  9 Jan 2025 13:50:48 -0500 Joshua Hahn <joshua.hahnjy@...il.com> wrote:
>
>> Hello everyone, I hope everyone has had a great start to 2025!
>> 
>> Recently, I have been working on a patch series [1] with
>> Gregory Price <gourry@...rry.net> that provides new default interleave
>> weights, along with dynamic re-weighting on hotplug events and a series
>> of UAPIs that allow users to configure how they want the defaults to behave.
>> 
>> In introducing these new defaults, discussions have opened up in the
>> community regarding how best to create a UAPI that can provide
>> coherent and transparent interactions for the user. In particular, consider
>> this scenario: when a hotplug event happens and a node comes online
>> with new bandwidth information (and therefore changing the bandwidth
>> distributions across the system), should user-set weights be overwritten
>> to reflect the new distributions? If so, how can we justify overwriting
>> user-set values in a sysfs interface? If not, how will users manually
>> adjust the node weights to the optimal weight?
>> 
>> I would like to revisit some of the design choices made for this patch,
>> including how the defaults were derived, and open the conversation to
>> hear what the community believes is a reasonable way to allow users to
>> tune weighted interleave weights. More broadly, I hope to get gather
>> community insight on how they use weighted interleave, and do my best to
>> reflect those workflows in the patch.
>
> Weighted interleave has since moved onto v7 [1], and a v8 is currently being
> drafted. Through feedback from reviewers, we have landed on a coherent UAPI
> that gives users two options: auto mode, which leaves all weight calculation
> decisions to the system, and manual mode, which leaves weighted interleave
> the same as it is without the patch.
>
> Given that the patch's functionality is mostly concrete and that the questions
> I hoped to raise during this slot were answered via patch feedback, I hope to
> ask another question during the talk:
>
> Should the system dynamically change what metrics it uses to weight the nodes,
> based on what bottlenecks the system is currently facing?
>
> In the patch, min(read_bandwidth, write_bandwidth) is used as the heuristic
> to determine what a node's weight should be. However, what if the system is
> not bottlenecked by bandwidth, but by latency? A system could also be
> bottlenecked by read bandwidth, but not by write bandwidth.
>
> Consider a scenario where a system has many memory nodes with varying
> latencies and bandwidths. When the system is not bottlenecked by bandwidth,
> it might prefer to allocate memory from nodes with lower latency. Once the
> system starts feeling pressured by bandwidth, the weights for high bandwidth
> (but also high latency) nodes would slowly increase to alleviate pressure
> from the system. Once the system is back in a manageable state, weights for
> low latency nodes would start increasing again. Users would not have to be
> aware of any of this -- they would just see the system take control of the
> weight changes as the system's needs continue to change.

IIUC, this assumes the capacity of all kinds of memory is large enough.
However, this may be not true in some cases.  So, another possibility is
that, for a system with DRAM and CXL memory nodes.

- There is free space on DRAM node, the bandwidth of DRAM node isn't
  saturated, memory is allocated on DRAM node.

- There is no free space on DRAM node, the bandwidth of DRAM node isn't
  saturated, cold pages are migrated to CXL memory nodes, while hot
  pages are migrated to DRAM memory nodes.

- The bandwidth of DRAM node is saturated, hot pages are migrated to CXL
  memory nodes.

In general, I think that the real situation is complex and this makes it
hard to implement a good policy in kernel.  So, I suspect that it's
better to start with the experiments in user space.

> This proposal also has some concerns that need to be addressed:
> - How reactive should the system be, and how aggressively should it tune the
>   weights? We don't want the system to overreact to short spikes in pressure.
> - Does dynamic weight adjusting lead to pages being "misplaced"? Should those
>   "misplaced" pages be migrated? (probably not)
> - Does this need to be in the kernel? A userspace daemon that monitors kernel
>   metrics has the ability to make the changes (via the nodeN interfaces).
>
> Thoughts & comments are appreciated! Thank you, and have a great day!
> Joshua
>
> [1] https://lore.kernel.org/all/20250305200506.2529583-1-joshua.hahnjy@gmail.com/
>
> Sent using hkml (https://github.com/sjp38/hackermail)

---
Best Regards,
Huang, Ying