[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250930093552.2842885-1-caixinchen1@huawei.com>
Date: Tue, 30 Sep 2025 09:35:50 +0000
From: Cai Xinchen <caixinchen1@...wei.com>
To: <llong@...hat.com>, <tj@...nel.org>, <hannes@...xchg.org>,
<mkoutny@...e.com>
CC: <cgroups@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<lujialin4@...wei.com>, <caixinchen1@...wei.com>
Subject: [PATCH -next RFC 0/2] cpuset: Add cpuset.mems.spread_page to cgroup v2
I encountered a scenario where a machine with 1.5TB of memory,
while testing the Spark TPCDS 3TB dataset, experienced a significant
concentration of page cache usage on one of the NUMA nodes.
I discovered that the DataNode process had requested a large amount
of page cache. most of the page cache was concentrated in one NUMA node,
ultimately leading to the exhaustion of memory in that NUMA node.
At this point, all other processes in that NUMA node have to alloc
memory across NUMA nodes, or even across sockets. This eventually
caused a degradation in the end-to-end performance of the Spark test.
I do not want to restart the Spark DataNode service during business
operations. This issue can be resolved by migrating the DataNode into
a cpuset, dropping the cache, and setting cpuset.memory_spread_page to
allow it to evenly request memory. The core business threads could still
allocate local numa memory. After using cpuset.memory_spread_page, the
performance in the tpcds-99 test is improved by 2%.
The key point is that the even distribution of page cache within the
DataNode process (rather than the current NUMA distribution) does not
significantly affect end-to-end performance. However, the allocation
of core business processes, such as Executors, to the same NUMA node
does have a noticeable impact on end-to-end performance.
However, I found that cgroup v2 does not provide this interface. I
believe this interface still holds value in addressing issues caused
by uneven distribution of page cache allocation among process groups.
Thus I add cpuset.mems.spread_page to cpuset v2 interface.
Cai Xinchen (2):
cpuset: Move cpuset1_update_spread_flag to cpuset
cpuset: Add spread_page interface to cpuset v2
kernel/cgroup/cpuset-internal.h | 6 ++--
kernel/cgroup/cpuset-v1.c | 25 +----------------
kernel/cgroup/cpuset.c | 49 ++++++++++++++++++++++++++++++++-
3 files changed, 51 insertions(+), 29 deletions(-)
--
2.34.1
Powered by blists - more mailing lists