lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f64819e2-8dc6-4907-b8bf-faec66eecd0e@sk.com>
Date: Thu, 6 Mar 2025 21:39:26 +0900
From: Honggyu Kim <honggyu.kim@...com>
To: Gregory Price <gourry@...rry.net>
Cc: kernel_team@...ynix.com, Joshua Hahn <joshua.hahnjy@...il.com>,
 harry.yoo@...cle.com, ying.huang@...ux.alibaba.com,
 gregkh@...uxfoundation.org, rakie.kim@...com, akpm@...ux-foundation.org,
 rafael@...nel.org, lenb@...nel.org, dan.j.williams@...el.com,
 Jonathan.Cameron@...wei.com, dave.jiang@...el.com, horen.chuang@...ux.dev,
 hannes@...xchg.org, linux-kernel@...r.kernel.org,
 linux-acpi@...r.kernel.org, linux-mm@...ck.org, kernel-team@...a.com,
 yunjeong.mun@...com
Subject: Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for
 memoryless nodes

Hi Gregory,

On 3/5/2025 1:29 AM, Gregory Price wrote:
> On Thu, Feb 27, 2025 at 11:32:26AM +0900, Honggyu Kim wrote:
>> Actually, we're aware of this issue and currently trying to fix this.
>> In our system, we've attached 4ch of CXL memory for each socket as
>> follows.
>>
>>          node0             node1
>>        +-------+   UPI   +-------+
>>        | CPU 0 |-+-----+-| CPU 1 |
>>        +-------+         +-------+
>>        | DRAM0 |         | DRAM1 |
>>        +---+---+         +---+---+
>>            |                 |
>>        +---+---+         +---+---+
>>        | CXL 0 |         | CXL 4 |
>>        +---+---+         +---+---+
>>        | CXL 1 |         | CXL 5 |
>>        +---+---+         +---+---+
>>        | CXL 2 |         | CXL 6 |
>>        +---+---+         +---+---+
>>        | CXL 3 |         | CXL 7 |
>>        +---+---+         +---+---+
>>          node2             node3
>>
>> The 4ch of CXL memory are detected as a single NUMA node in each socket,
>> but it shows as follows with the current N_POSSIBLE loop.
>>
>> $ ls /sys/kernel/mm/mempolicy/weighted_interleave/
>> node0 node1 node2 node3 node4 node5
>> node6 node7 node8 node9 node10 node11
> 
> This is insufficient information for me to assess the correctness of the
> configuration. Can you please show the contents of your CEDT/CFMWS and
> SRAT/Memory Affinity structures?
> 
> mkdir acpi_data && cd acpi_data
> acpidump -b
> iasl -d *
> cat cedt.dsl  <- find all CFMWS entries
> cat srat.dsl  <- find all Memory Affinity entries

I'm not able to provide all the details as srat.dsl has too much info.

   $ wc -l srat.dsl
   25229 srat.dsl

Instead, I can show you that there are 4 diffferent proximity domains
with "Enabled : 1" with the following filtered output from srat.dsl.

   $ grep -E "Proximity Domain :|Enabled : " srat.dsl | cut -c 31- | sed 
'N;s/\n//' | sort | uniq
          Enabled : 0       Enabled : 0
   Proximity Domain : 00000000       Enabled : 0
   Proximity Domain : 00000000       Enabled : 1
   Proximity Domain : 00000001       Enabled : 1
   Proximity Domain : 00000006       Enabled : 1
   Proximity Domain : 00000007       Enabled : 1

We don't actually have to use those complicated commands to check this
as dmesg clearly prints the SRAT and node numbers as follows.

   [    0.009915] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
   [    0.009917] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x207fffffff]
   [    0.009919] ACPI: SRAT: Node 1 PXM 1 [mem 
0x60f80000000-0x64f7fffffff]
   [    0.009924] ACPI: SRAT: Node 2 PXM 6 [mem 
0x2080000000-0x807fffffff] hotplug
   [    0.009925] ACPI: SRAT: Node 3 PXM 7 [mem 
0x64f80000000-0x6cf7fffffff] hotplug

The memoryless nodes are printed as follows after those ACPI, SRAT,
Node N PXM M messages.

   [    0.010927] Initmem setup node 0 [mem 
0x0000000000001000-0x000000207effffff]
   [    0.010930] Initmem setup node 1 [mem 
0x0000060f80000000-0x0000064f7fffffff]
   [    0.010992] Initmem setup node 2 as memoryless
   [    0.011055] Initmem setup node 3 as memoryless
   [    0.011115] Initmem setup node 4 as memoryless
   [    0.011177] Initmem setup node 5 as memoryless
   [    0.011238] Initmem setup node 6 as memoryless
   [    0.011299] Initmem setup node 7 as memoryless
   [    0.011361] Initmem setup node 8 as memoryless
   [    0.011422] Initmem setup node 9 as memoryless
   [    0.011484] Initmem setup node 10 as memoryless
   [    0.011544] Initmem setup node 11 as memoryless

This is related why the 12 nodes at sysfs knobs are provided with the
current N_POSSIBLE loop.

> 
> Basically I need to know:
> 1) Is each CXL device on a dedicated Host Bridge?
> 2) Is inter-host-bridge interleaving configured?
> 3) Is intra-host-bridge interleaving configured?
> 4) Do SRAT entries exist for all nodes?

Are there some simple commands that I can get those info?

> 5) Why are there 12 nodes but only 10 sources? Are there additional
>     devices left out of your diagram? Are there 2 CFMWS but and 8 Memory
>     Affinity records - resulting in 10 nodes? This is strange.

My blind guess is that there could be a logic node that combines 4ch of
CXL memory so there are 5 nodes per each socket.  Adding 2 nodes for
local CPU/DRAM makes 12 nodes in total.

> 
> By default, Linux creates a node for each proximity domain ("PXM")
> detected in the SRAT Memory Affinity tables. If SRAT entries for a
> memory region described in a CFMWS is absent, it will also create an
> node for that CFMWS.
> 
> Your reported configuration and results lead me to believe you have
> a combination of CFMWS/SRAT configurations that are unexpected.
> 
> ~Gregory

Not sure about this part but our approach with hotplug_memory_notifier()
resolves this problem.  Rakie will submit an initial working patchset 
soonish.

Thanks,
Honggyu

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ