linux-kernel - Re: [PATCH -next RFC 0/2] cpuset: Add cpuset.mems.spread

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4d49ad5a-878a-44a1-a4d0-459e56924581@redhat.com>
Date: Tue, 30 Sep 2025 09:57:20 -0400
From: Waiman Long <llong@...hat.com>
To: Cai Xinchen <caixinchen1@...wei.com>, llong@...hat.com, tj@...nel.org,
 hannes@...xchg.org, mkoutny@...e.com
Cc: cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
 lujialin4@...wei.com
Subject: Re: [PATCH -next RFC 0/2] cpuset: Add cpuset.mems.spread_page to
 cgroup v2

On 9/30/25 5:35 AM, Cai Xinchen wrote:
> I encountered a scenario where a machine with 1.5TB of memory,
> while testing the Spark TPCDS 3TB dataset, experienced a significant
> concentration of page cache usage on one of the NUMA nodes.
> I discovered that the DataNode process had requested a large amount
> of page cache. most of the page cache was concentrated in one NUMA node,
> ultimately leading to the exhaustion of memory in that NUMA node.
> At this point, all other processes in that NUMA node have to alloc
> memory across NUMA nodes, or even across sockets. This eventually
> caused a degradation in the end-to-end performance of the Spark test.
>
> I do not want to restart the Spark DataNode service during business
> operations. This issue can be resolved by migrating the DataNode into
> a cpuset, dropping the cache, and setting cpuset.memory_spread_page to
> allow it to evenly request memory. The core business threads could still
> allocate local numa memory. After using cpuset.memory_spread_page, the
> performance in the tpcds-99 test is improved by 2%.
>
> The key point is that the even distribution of page cache within the
> DataNode process (rather than the current NUMA distribution) does not
> significantly affect end-to-end performance. However, the allocation
> of core business processes, such as Executors, to the same NUMA node
> does have a noticeable impact on end-to-end performance.
>
> However, I found that cgroup v2 does not provide this interface. I
> believe this interface still holds value in addressing issues caused
> by uneven distribution of page cache allocation among process groups.
>
> Thus I add cpuset.mems.spread_page to cpuset v2 interface.
>
> Cai Xinchen (2):
>    cpuset: Move cpuset1_update_spread_flag to cpuset
>    cpuset: Add spread_page interface to cpuset v2
>
>   kernel/cgroup/cpuset-internal.h |  6 ++--
>   kernel/cgroup/cpuset-v1.c       | 25 +----------------
>   kernel/cgroup/cpuset.c          | 49 ++++++++++++++++++++++++++++++++-
>   3 files changed, 51 insertions(+), 29 deletions(-)
>
The spread_page flag is only used in filemap_alloc_folio_noprof() of 
mm/filemap.c. By setting it, the code will attempt to spread the 
folio/page allocation across different nodes. As noted by Michal,  it is 
more or less equivalent to setting a MPOL_INTERLEAVE memory policy with 
set_mempolicy(2) with the node mask of cpuset.mems. Using 
set_mempolicy(2) has a finer task granularity instead of all the tasks 
within a cpuset. Of course, this requires making changes to the 
application instead of making change to external cgroup control file.

cpusets.mems.spread_page is a legacy interface, we need good 
justification if we want to enable it in cgroup v2.

Cheers,
Longman