lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9fdad98e-9042-4781-9d73-19f00266711b@redhat.com>
Date: Thu, 8 May 2025 20:30:01 -0400
From: Waiman Long <llong@...hat.com>
To: Xi Wang <xii@...gle.com>, Waiman Long <llong@...hat.com>
Cc: Frederic Weisbecker <frederic@...nel.org>, Tejun Heo <tj@...nel.org>,
 linux-kernel@...r.kernel.org, cgroups@...r.kernel.org,
 Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
 Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 David Rientjes <rientjes@...gle.com>, Mel Gorman <mgorman@...e.de>,
 Valentin Schneider <vschneid@...hat.com>,
 Johannes Weiner <hannes@...xchg.org>, Michal Koutný
 <mkoutny@...e.com>, Vlastimil Babka <vbabka@...e.cz>,
 Dan Carpenter <dan.carpenter@...aro.org>, Chen Yu <yu.c.chen@...el.com>,
 Kees Cook <kees@...nel.org>, Yu-Chun Lin <eleanor15x@...il.com>,
 Thomas Gleixner <tglx@...utronix.de>, Mickaël Salaün
 <mic@...ikod.net>, jiangshanlai@...il.com
Subject: Re: [RFC/PATCH] sched: Support moving kthreads into cpuset cgroups

On 5/8/25 6:39 PM, Xi Wang wrote:
> On Thu, May 8, 2025 at 12:35 PM Waiman Long <llong@...hat.com> wrote:
>> On 5/8/25 1:51 PM, Xi Wang wrote:
>>> I think our problem spaces are different. Perhaps your problems are closer to
>>> hard real-time systems but our problems are about improving latency of existing
>>> systems while maintaining efficiency (max supported cpu util).
>>>
>>> For hard real-time systems we sometimes throw cores at the problem and run no
>>> more than one thread per cpu. But if we want efficiency we have to share cpus
>>> with scheduling policies. Disconnecting the cpu scheduler with isolcpus results
>>> in losing too much of the machine capacity. CPU scheduling is needed for both
>>> kernel and userspace threads.
>>>
>>> For our use case we need to move kernel threads away from certain vcpu threads,
>>> but other vcpu threads can share cpus with kernel threads. The ratio changes
>>> from time to time. Permanently putting aside a few cpus results in a reduction
>>> in machine capacity.
>>>
>>> The PF_NO_SETAFFINTIY case is already handled by the patch. These threads will
>>> run in root cgroup with affinities just like before.
>>>
>>> The original justifications for the cpuset feature is here and the reasons are
>>> still applicable:
>>>
>>> "The management of large computer systems, with many processors (CPUs), complex
>>> memory cache hierarchies and multiple Memory Nodes having non-uniform access
>>> times (NUMA) presents additional challenges for the efficient scheduling and
>>> memory placement of processes."
>>>
>>> "But larger systems, which benefit more from careful processor and memory
>>> placement to reduce memory access times and contention.."
>>>
>>> "These subsets, or “soft partitions” must be able to be dynamically adjusted, as
>>> the job mix changes, without impacting other concurrently executing jobs."
>>>
>>> https://docs.kernel.org/admin-guide/cgroup-v1/cpusets.html
>>>
>>> -Xi
>>>
>> If you create a cpuset root partition, we are pushing some kthreads
>> aways from CPUs dedicated to the newly created partition which has its
>> own scheduling domain separate from the cgroup root. I do realize that
>> the current way of excluding only per cpu kthreads isn't quite right. So
>> I send out a new patch to extend to all the PF_NO_SETAFFINITY kthreads.
>>
>> So instead of putting kthreads into the dedicated cpuset, we still keep
>> them in the root cgroup. Instead we can create a separate cpuset
>> partition to run the workload without interference from the background
>> kthreads. Will that functionality suit your current need?
>>
>> Cheers,
>> Longman
>>
> It's likely a major improvement over a fixed partition but maybe still not fully
> flexible. I am not familiar with cpuset partitions but I wonder if the following
> case can be supported:
>
> Starting from
> 16 cpus
> Root has cpu 0-3, 8-15
> Job A has cpu 4-7 exclusive
> Kernel threads cannot run on cpu 4-8 which is good.
There will still be some kernel threads with PF_NO_SETAFFINITY flag set.

>
> Now adding best effort Job B, which is under SCHED_IDLE and rarely enters kernel
> mode. As we expect C can be easily preempted we allow it to share cpus with A
> and kernel threads to maximize throughput. Is there a layout that supports the
> requirements below?
>
> Job C threads on cpu 0-15

A task/thread can only be in one cpuset. So it cannot span all the CPUs. 
However, if there are multiples threads within the process, some of the 
threads can be moved to a different cpuset as it is threaded. With 
proper thread setting, you can have a job with threads spanning all the 
CPUs.

Cheers,
Longman

> Job A threads on cpu 4-7
> No kernel threads on cpu 4-7
>
> -Xi
>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ