linux-kernel - Re: [PATCH v2] cgroup/cpuset: Extend kthread_is_per_cpu() check to all PF_NO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <71c45a3d-747a-4e47-8c65-2b982656ab3a@redhat.com>
Date: Fri, 9 May 2025 10:08:20 -0400
From: Waiman Long <llong@...hat.com>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: Tejun Heo <tj@...nel.org>, Johannes Weiner <hannes@...xchg.org>,
 Michal Koutný <mkoutny@...e.com>, cgroups@...r.kernel.org,
 linux-kernel@...r.kernel.org, Xi Wang <xii@...gle.com>
Subject: Re: [PATCH v2] cgroup/cpuset: Extend kthread_is_per_cpu() check to
 all PF_NO_SETAFFINITY tasks

On 5/9/25 9:18 AM, Frederic Weisbecker wrote:
> Le Thu, May 08, 2025 at 03:24:13PM -0400, Waiman Long a écrit :
>> Commit ec5fbdfb99d1 ("cgroup/cpuset: Enable update_tasks_cpumask()
>> on top_cpuset") enabled us to pull CPUs dedicated to child partitions
>> from tasks in top_cpuset by ignoring per cpu kthreads. However, there
>> can be other kthreads that are not per cpu but have PF_NO_SETAFFINITY
>> flag set to indicate that we shouldn't mess with their CPU affinity.
>> For other kthreads, their affinity will be changed to skip CPUs dedicated
>> to child partitions whether it is an isolating or a scheduling one.
>>
>> As all the per cpu kthreads have PF_NO_SETAFFINITY set, the
>> PF_NO_SETAFFINITY tasks are essentially a superset of per cpu kthreads.
>> Fix this issue by dropping the kthread_is_per_cpu() check and checking
>> the PF_NO_SETAFFINITY flag instead.
>>
>> Fixes: ec5fbdfb99d1 ("cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset")
>> Signed-off-by: Waiman Long <longman@...hat.com>
>> ---
>>   kernel/cgroup/cpuset.c | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index d0143b3dce47..967603300ee3 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -1130,9 +1130,11 @@ void cpuset_update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
>>   
>>   		if (top_cs) {
>>   			/*
>> -			 * Percpu kthreads in top_cpuset are ignored
>> +			 * PF_NO_SETAFFINITY tasks are ignored.
>> +			 * All per cpu kthreads should have PF_NO_SETAFFINITY
>> +			 * flag set, see kthread_set_per_cpu().
>>   			 */
>> -			if (kthread_is_per_cpu(task))
>> +			if (task->flags & PF_NO_SETAFFINITY)
>>   				continue;
>>   			cpumask_andnot(new_cpus, possible_mask, subpartitions_cpus);
> Acked-by: Frederic Weisbecker <frederic@...nel.org>
>
> But this makes me realize I overlooked that when I introduced the unbound kthreads
> centralized affinity.
>
> cpuset_update_tasks_cpumask() seem to blindly affine to subpartitions_cpus
> while unbound kthreads might have their preferences (per-nodes or random cpumasks).
>
> So I need to make that pass through kthread API.
AFAIU, the kthread_bind_mask() or the kthread_bin_cpu() functions will 
set PF_NO_SETAFFINITY.
>
> It seems that subpartition_cpus doesn't contain nohz_full= CPUs.
> But it excludes isolcpus=. And it's usually sane to assume that
> nohz_full= CPUs are isolated.
Most users that want isolated CPUs will set both isolcpus and nohz_full 
to the same set of CPUs. I do see that RH OpenShift can set nohz_full 
for a collection of CPUs that may be dynamically isolated later on via 
cpuset partition.
>
> I think I can just rename update_unbound_workqueue_cpumask()
> to update_unbound_kthreads_cpumask() and then handle unbound
> kthreads from there along with workqueues. And then completely
> ignore kthreads from cpuset_update_tasks_cpumask().

I guess we can do that. Right now, update_unbound_workqueue_cpumask() is 
only called to excluded isolated CPUs. The 
cpuset_update_tasks_cpumasks() will updated affinity for both isolated 
and scheduling partitions. I agree that there is code duplication here. 
To suit Xi Wang use case, we may have to add a sysctl parameter, for 
instance, to decide if we have to update unbound kthreads in the 
scheduling partition case.

Cheers,
Longman

> Let me think about it (but feel free to apply the current patch meanwhile).
>
> Thanks.
>