[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4d49ad5a-878a-44a1-a4d0-459e56924581@redhat.com>
Date: Tue, 30 Sep 2025 09:57:20 -0400
From: Waiman Long <llong@...hat.com>
To: Cai Xinchen <caixinchen1@...wei.com>, llong@...hat.com, tj@...nel.org,
hannes@...xchg.org, mkoutny@...e.com
Cc: cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
lujialin4@...wei.com
Subject: Re: [PATCH -next RFC 0/2] cpuset: Add cpuset.mems.spread_page to
cgroup v2
On 9/30/25 5:35 AM, Cai Xinchen wrote:
> I encountered a scenario where a machine with 1.5TB of memory,
> while testing the Spark TPCDS 3TB dataset, experienced a significant
> concentration of page cache usage on one of the NUMA nodes.
> I discovered that the DataNode process had requested a large amount
> of page cache. most of the page cache was concentrated in one NUMA node,
> ultimately leading to the exhaustion of memory in that NUMA node.
> At this point, all other processes in that NUMA node have to alloc
> memory across NUMA nodes, or even across sockets. This eventually
> caused a degradation in the end-to-end performance of the Spark test.
>
> I do not want to restart the Spark DataNode service during business
> operations. This issue can be resolved by migrating the DataNode into
> a cpuset, dropping the cache, and setting cpuset.memory_spread_page to
> allow it to evenly request memory. The core business threads could still
> allocate local numa memory. After using cpuset.memory_spread_page, the
> performance in the tpcds-99 test is improved by 2%.
>
> The key point is that the even distribution of page cache within the
> DataNode process (rather than the current NUMA distribution) does not
> significantly affect end-to-end performance. However, the allocation
> of core business processes, such as Executors, to the same NUMA node
> does have a noticeable impact on end-to-end performance.
>
> However, I found that cgroup v2 does not provide this interface. I
> believe this interface still holds value in addressing issues caused
> by uneven distribution of page cache allocation among process groups.
>
> Thus I add cpuset.mems.spread_page to cpuset v2 interface.
>
> Cai Xinchen (2):
> cpuset: Move cpuset1_update_spread_flag to cpuset
> cpuset: Add spread_page interface to cpuset v2
>
> kernel/cgroup/cpuset-internal.h | 6 ++--
> kernel/cgroup/cpuset-v1.c | 25 +----------------
> kernel/cgroup/cpuset.c | 49 ++++++++++++++++++++++++++++++++-
> 3 files changed, 51 insertions(+), 29 deletions(-)
>
The spread_page flag is only used in filemap_alloc_folio_noprof() of
mm/filemap.c. By setting it, the code will attempt to spread the
folio/page allocation across different nodes. As noted by Michal, it is
more or less equivalent to setting a MPOL_INTERLEAVE memory policy with
set_mempolicy(2) with the node mask of cpuset.mems. Using
set_mempolicy(2) has a finer task granularity instead of all the tasks
within a cpuset. Of course, this requires making changes to the
application instead of making change to external cgroup control file.
cpusets.mems.spread_page is a legacy interface, we need good
justification if we want to enable it in cgroup v2.
Cheers,
Longman
Powered by blists - more mailing lists