linux-kernel - Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CACSyD1MhCaAzycSUSQfirLaLp22mcabVr3jfaRbJqFRkX2VoFw@mail.gmail.com>
Date: Thu, 19 Jun 2025 11:49:58 +0800
From: Zhongkun He <hezhongkun.hzk@...edance.com>
To: Michal Koutný <mkoutny@...e.com>
Cc: Tejun Heo <tj@...nel.org>, Waiman Long <llong@...hat.com>, cgroups@...r.kernel.org, 
	linux-kernel@...r.kernel.org, muchun.song@...ux.dev
Subject: Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems
 setting option

On Wed, Jun 18, 2025 at 5:05 PM Michal Koutný <mkoutny@...e.com> wrote:
>
> On Wed, Jun 18, 2025 at 10:46:02AM +0800, Zhongkun He <hezhongkun.hzk@...edance.com> wrote:
> > It is unnecessary to adjust memory affinity periodically from userspace,
> > as it is a costly operation.
>
> It'd be always costly when there's lots of data to migrate.
>
> > Instead, we need to shrink cpuset.mems to explicitly specify the NUMA
> > node from which newly allocated pages should come, and migrate the
> > pages once in userspace slowly  or adjusted by numa balance.
>
> IIUC, the issue is that there's no set_mempolicy(2) for 3rd party
> threads (it only operates on current) OR that the migration path should
> be optimized to avoid those latencies -- do you know what is the
> contention point?

Hi Michal

In our scenario, when we shrink the allowed cpuset.mems —for example,
from nodes 1, 2, 3 to just nodes 2,3—there may still be a large number of pages
residing on node 1. Currently, modifying cpuset.mems triggers synchronous memory
migration, which results in prolonged and unacceptable service downtime under
cgroup v2. This behavior has become a major blocker for us in adopting
cgroup v2.

Tejun suggested adding an interface to control the migration rate, and
I plan to try
that later. However, we believe that the cpuset.migrate interface in
cgroup v1 is also
sufficient for our use case and is easier to work with.  :)

Thanks,
Zhongkun

>
> Thanks,
> Michal