[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f64819e2-8dc6-4907-b8bf-faec66eecd0e@sk.com>
Date: Thu, 6 Mar 2025 21:39:26 +0900
From: Honggyu Kim <honggyu.kim@...com>
To: Gregory Price <gourry@...rry.net>
Cc: kernel_team@...ynix.com, Joshua Hahn <joshua.hahnjy@...il.com>,
harry.yoo@...cle.com, ying.huang@...ux.alibaba.com,
gregkh@...uxfoundation.org, rakie.kim@...com, akpm@...ux-foundation.org,
rafael@...nel.org, lenb@...nel.org, dan.j.williams@...el.com,
Jonathan.Cameron@...wei.com, dave.jiang@...el.com, horen.chuang@...ux.dev,
hannes@...xchg.org, linux-kernel@...r.kernel.org,
linux-acpi@...r.kernel.org, linux-mm@...ck.org, kernel-team@...a.com,
yunjeong.mun@...com
Subject: Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for
memoryless nodes
Hi Gregory,
On 3/5/2025 1:29 AM, Gregory Price wrote:
> On Thu, Feb 27, 2025 at 11:32:26AM +0900, Honggyu Kim wrote:
>> Actually, we're aware of this issue and currently trying to fix this.
>> In our system, we've attached 4ch of CXL memory for each socket as
>> follows.
>>
>> node0 node1
>> +-------+ UPI +-------+
>> | CPU 0 |-+-----+-| CPU 1 |
>> +-------+ +-------+
>> | DRAM0 | | DRAM1 |
>> +---+---+ +---+---+
>> | |
>> +---+---+ +---+---+
>> | CXL 0 | | CXL 4 |
>> +---+---+ +---+---+
>> | CXL 1 | | CXL 5 |
>> +---+---+ +---+---+
>> | CXL 2 | | CXL 6 |
>> +---+---+ +---+---+
>> | CXL 3 | | CXL 7 |
>> +---+---+ +---+---+
>> node2 node3
>>
>> The 4ch of CXL memory are detected as a single NUMA node in each socket,
>> but it shows as follows with the current N_POSSIBLE loop.
>>
>> $ ls /sys/kernel/mm/mempolicy/weighted_interleave/
>> node0 node1 node2 node3 node4 node5
>> node6 node7 node8 node9 node10 node11
>
> This is insufficient information for me to assess the correctness of the
> configuration. Can you please show the contents of your CEDT/CFMWS and
> SRAT/Memory Affinity structures?
>
> mkdir acpi_data && cd acpi_data
> acpidump -b
> iasl -d *
> cat cedt.dsl <- find all CFMWS entries
> cat srat.dsl <- find all Memory Affinity entries
I'm not able to provide all the details as srat.dsl has too much info.
$ wc -l srat.dsl
25229 srat.dsl
Instead, I can show you that there are 4 diffferent proximity domains
with "Enabled : 1" with the following filtered output from srat.dsl.
$ grep -E "Proximity Domain :|Enabled : " srat.dsl | cut -c 31- | sed
'N;s/\n//' | sort | uniq
Enabled : 0 Enabled : 0
Proximity Domain : 00000000 Enabled : 0
Proximity Domain : 00000000 Enabled : 1
Proximity Domain : 00000001 Enabled : 1
Proximity Domain : 00000006 Enabled : 1
Proximity Domain : 00000007 Enabled : 1
We don't actually have to use those complicated commands to check this
as dmesg clearly prints the SRAT and node numbers as follows.
[ 0.009915] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[ 0.009917] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x207fffffff]
[ 0.009919] ACPI: SRAT: Node 1 PXM 1 [mem
0x60f80000000-0x64f7fffffff]
[ 0.009924] ACPI: SRAT: Node 2 PXM 6 [mem
0x2080000000-0x807fffffff] hotplug
[ 0.009925] ACPI: SRAT: Node 3 PXM 7 [mem
0x64f80000000-0x6cf7fffffff] hotplug
The memoryless nodes are printed as follows after those ACPI, SRAT,
Node N PXM M messages.
[ 0.010927] Initmem setup node 0 [mem
0x0000000000001000-0x000000207effffff]
[ 0.010930] Initmem setup node 1 [mem
0x0000060f80000000-0x0000064f7fffffff]
[ 0.010992] Initmem setup node 2 as memoryless
[ 0.011055] Initmem setup node 3 as memoryless
[ 0.011115] Initmem setup node 4 as memoryless
[ 0.011177] Initmem setup node 5 as memoryless
[ 0.011238] Initmem setup node 6 as memoryless
[ 0.011299] Initmem setup node 7 as memoryless
[ 0.011361] Initmem setup node 8 as memoryless
[ 0.011422] Initmem setup node 9 as memoryless
[ 0.011484] Initmem setup node 10 as memoryless
[ 0.011544] Initmem setup node 11 as memoryless
This is related why the 12 nodes at sysfs knobs are provided with the
current N_POSSIBLE loop.
>
> Basically I need to know:
> 1) Is each CXL device on a dedicated Host Bridge?
> 2) Is inter-host-bridge interleaving configured?
> 3) Is intra-host-bridge interleaving configured?
> 4) Do SRAT entries exist for all nodes?
Are there some simple commands that I can get those info?
> 5) Why are there 12 nodes but only 10 sources? Are there additional
> devices left out of your diagram? Are there 2 CFMWS but and 8 Memory
> Affinity records - resulting in 10 nodes? This is strange.
My blind guess is that there could be a logic node that combines 4ch of
CXL memory so there are 5 nodes per each socket. Adding 2 nodes for
local CPU/DRAM makes 12 nodes in total.
>
> By default, Linux creates a node for each proximity domain ("PXM")
> detected in the SRAT Memory Affinity tables. If SRAT entries for a
> memory region described in a CFMWS is absent, it will also create an
> node for that CFMWS.
>
> Your reported configuration and results lead me to believe you have
> a combination of CFMWS/SRAT configurations that are unexpected.
>
> ~Gregory
Not sure about this part but our approach with hotplug_memory_notifier()
resolves this problem. Rakie will submit an initial working patchset
soonish.
Thanks,
Honggyu
Powered by blists - more mailing lists