linux-kernel - Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b049e0d8-a9d2-456f-aa97-148f4a6d8071@sk.com>
Date: Tue, 4 Mar 2025 21:53:13 +0900
From: Honggyu Kim <honggyu.kim@...com>
To: Joshua Hahn <joshua.hahnjy@...il.com>
Cc: kernel_team@...ynix.com, gourry@...rry.net, harry.yoo@...cle.com,
 ying.huang@...ux.alibaba.com, gregkh@...uxfoundation.org, rakie.kim@...com,
 akpm@...ux-foundation.org, rafael@...nel.org, lenb@...nel.org,
 dan.j.williams@...el.com, Jonathan.Cameron@...wei.com, dave.jiang@...el.com,
 horen.chuang@...ux.dev, hannes@...xchg.org, linux-kernel@...r.kernel.org,
 linux-acpi@...r.kernel.org, linux-mm@...ck.org, kernel-team@...a.com,
 yunjeong.mun@...com
Subject: Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for
 memoryless nodes

Hi Joshua,

On 3/4/2025 6:56 AM, Joshua Hahn wrote:
> On Thu, 27 Feb 2025 12:20:03 +0900 Honggyu Kim <honggyu.kim@...com> wrote:
> 
> Hi Honggyu, thank you for taking time to review my patch, as always!

My pleasure!

> I thought I had sent this, but it seems like it was left in my draft
> without being sent.
> 
> I will follow Gregory's advice and we will drop the patch from this series,
> and send the first patch only (with Yunjeong's changes). Thanks again!

It'd be great if you could add her with the following.
Co-developed-by: Yunjeong Mun <yunjeong.mun@...com>

> 
>>
>> On 2/27/2025 11:32 AM, Honggyu Kim wrote:
>>> Hi Joshua,
>>>
>>> On 2/27/2025 6:35 AM, Joshua Hahn wrote:
>>>> We should never try to allocate memory from a memoryless node. Creating a
>>>> sysfs knob to control its weighted interleave weight does not make sense,
>>>> and can be unsafe.
>>>>
>>>> Only create weighted interleave weight knobs for nodes with memory.
>>>>
>>>> Signed-off-by: Joshua Hahn <joshua.hahnjy@...il.com>
>>>> ---
>>>>    mm/mempolicy.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>>>> index 4cc04ff8f12c..50cbb7c047fa 100644
>>>> --- a/mm/mempolicy.c
>>>> +++ b/mm/mempolicy.c
>>>> @@ -3721,7 +3721,7 @@ static int add_weighted_interleave_group(struct
>>>> kobject *root_kobj)
>>>>            return err;
>>>>        }
>>>> -    for_each_node_state(nid, N_POSSIBLE) {
>>>
>>> Actually, we're aware of this issue and currently trying to fix this.
>>> In our system, we've attached 4ch of CXL memory for each socket as
>>> follows.
>>>
>>>           node0             node1
>>>         +-------+   UPI   +-------+
>>>         | CPU 0 |-+-----+-| CPU 1 |
>>>         +-------+         +-------+
>>>         | DRAM0 |         | DRAM1 |
>>>         +---+---+         +---+---+
>>>             |                 |
>>>         +---+---+         +---+---+
>>>         | CXL 0 |         | CXL 4 |
>>>         +---+---+         +---+---+
>>>         | CXL 1 |         | CXL 5 |
>>>         +---+---+         +---+---+
>>>         | CXL 2 |         | CXL 6 |
>>>         +---+---+         +---+---+
>>>         | CXL 3 |         | CXL 7 |
>>>         +---+---+         +---+---+
>>>           node2             node3
>>>
>>> The 4ch of CXL memory are detected as a single NUMA node in each socket,
>>> but it shows as follows with the current N_POSSIBLE loop.
>>>
>>> $ ls /sys/kernel/mm/mempolicy/weighted_interleave/
>>> node0 node1 node2 node3 node4 node5
>>> node6 node7 node8 node9 node10 node11

FYI, we used to set node2 and node3 only for weights for CXL memory here
and ignored node{4-11}.  That sounds silly but it worked.

> 
> I see. For my education, would you mind explaining how the numbering works
> here? I am not very familiar with this setup, and not sure how you would
> figure out what node is which, just by looking at the numbering.

Regarding the numbering, I'm not 100% sure, but I guess there could be a
logical NUMA node that combines 4ch of CXL memory and 4 nodes for CXL 
memory so in total 5 nodes per socket.

I don't have much knowledge on this but maybe this is related to PXM
(Proximity Domain).

> 
>>>> +    for_each_node_state(nid, N_MEMORY) {
>>
>> Thinking it again, we can leave it as a separate patch but add our patch
>> on top of it.
> 
> That sounds good to me.
> 
>> The only concern I have is having only N_MEMORY patch hides weight
>> setting knobs for CXL memory and it makes there is no way to set weight
>> values to CXL memory in my system.
> 
> You can use weighted interleave auto-tuning : -)

Not possible because using N_MEMORY doesn't provide "node" knobs for CXL
memory at all as follows.

$ ls /sys/kernel/mm/mempolicy/weighted_interleave/
node0 node1

We need node2 and node3 for CXL memory here.

> In all seriousness, this makes sense. It seems pretty problematic that
> the knobs aren't created for the CXL channels,

Yeah, it's even worse than the current status.

> and I'm not sure that hiding> it is the correct approach here (it was not my intent, either).

It isn't your problem but we shouldn't hide those nodes until it is
correctly fixed with hot plugging event handler.

> 
>> IMHO, this and our patch is better to be submitted together.
> 
> That sounds good. We can hold off on this patch then, and just consider
> the first patch of this series. Thank you for letting me know!

The N_POSSIBLE and N_MEMORY stuffs should had been fixed earlier than
this work. I will take a few days if we can submit it together.

> 
> Thank you for always reviewing my patches. Have a great day!
> Joshua

Thanks for your work and have a great day you too!

Kind regards,
Honggyu

> 
> Sent using hkml (https://github.com/sjp38/hackermail)
>