linux-kernel - Re: [PATCH] lib/cpumask: Boot option to disable tasks distribution within cpumask

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZjE3C9UgeZR02Jyy@yury-ThinkPad>
Date: Tue, 30 Apr 2024 11:23:07 -0700
From: Yury Norov <yury.norov@...il.com>
To: Ankit Jain <ankit-aj.jain@...adcom.com>
Cc: linux@...musvillemoes.dk, akpm@...ux-foundation.org,
	linux-kernel@...r.kernel.org, juri.lelli@...hat.com,
	pauld@...hat.com, ajay.kaher@...adcom.com,
	alexey.makhalov@...adcom.com, vasavi.sirnapalli@...adcom.com,
	Paul Turner <pjt@...gle.com>, Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Daniel Bristot de Oliveira <bristot@...hat.com>,
	Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH] lib/cpumask: Boot option to disable tasks distribution
 within cpumask

On Tue, Apr 30, 2024 at 02:34:31PM +0530, Ankit Jain wrote:
> commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
> and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
> introduced the logic to distribute the tasks within cpumask upon initial
> wakeup.

So let's add the authors in CC list?

> For Telco RAN deployments, isolcpus are a necessity to cater to
> the requirement of low latency applications. These isolcpus are generally
> tickless so that high priority SCHED_FIFO tasks can execute without any
> OS jitter. Since load balancing is disabled on isocpus, any task
> which gets placed on these CPUs can not be migrated on its own.
> For RT applications to execute on isolcpus, a guaranteed kubernetes pod
> with all isolcpus becomes the requirement and these RT applications are
> affine to execute on a specific isolcpu within the kubernetes pod.
> However, there may be some non-RT tasks which could also schedule in the
> same kubernetes pod without being affine to any specific CPU(inherits the
> pod cpuset affinity).

OK... It looks like adding scheduler maintainers is also a necessity to
cater here...

> With multiple spawning and running containers inside
> the pod, container runtime spawns several non-RT initializing tasks
> ("runc init") inside the pod and due to above mentioned commits, these
> non-RT tasks may get placed on any isolcpus and may starve if it happens
> to wakeup on the same CPU as SCHED_FIFO task because RT throttling is also
> disabled in telco setup. Thus, RAN deployment fails and eventually leads
> to system hangs.

Not that I'm familiar to your setup, but this sounds like a userspace
configuration problems. Can you try to move your non-RT tasks into a
cgroup attached to non-RT CPUs, or something like that? 

> With the introduction of kernel cmdline param 'sched_pick_firstcpu',
> there is an option provided for such usecases to disable the distribution
> of tasks within the cpumask logic and use the previous 'pick first cpu'
> approach for initial placement of tasks. Because many telco vendors
> configure the system in such a way that the first cpu within a cpuset
> of pod doesn't run any SCHED_FIFO or High priority tasks.
> 
> Co-developed-by: Alexey Makhalov <alexey.makhalov@...adcom.com>
> Signed-off-by: Alexey Makhalov <alexey.makhalov@...adcom.com>
> Signed-off-by: Ankit Jain <ankit-aj.jain@...adcom.com>
> ---
>  lib/cpumask.c | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index e77ee9d46f71..3dea87d5ec1f 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -154,6 +154,23 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
>  }
>  EXPORT_SYMBOL(cpumask_local_spread);
>  
> +/*
> + * Task distribution within the cpumask feature disabled?
> + */
> +static bool cpumask_pick_firstcpu __read_mostly;
> +
> +/*
> + * Disable Tasks distribution within the cpumask feature
> + */
> +static int __init cpumask_pick_firstcpu_setup(char *str)
> +{
> +	cpumask_pick_firstcpu = 1;
> +	pr_info("cpumask: Tasks distribution within cpumask is disabled.");
> +	return 1;
> +}
> +
> +__setup("sched_pick_firstcpu", cpumask_pick_firstcpu_setup);
> +
>  static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
>  
>  /**
> @@ -171,6 +188,13 @@ unsigned int cpumask_any_and_distribute(const struct cpumask *src1p,
>  {
>  	unsigned int next, prev;
>  
> +	/*
> +	 * Don't distribute, if tasks distribution
> +	 * within cpumask feature is disabled
> +	 */
> +	if (cpumask_pick_firstcpu)
> +		return cpumask_any_and(src1p, src2p);

No, this is a wrong way.

To begin with, this parameter shouldn't control a single random
function. At least, the other cpumask_*_distribute() should be
consistent to the policy.

But in general... I don't think we should do things like that at all.
Cpumask API is a simple and plain wrapper around bitmaps. If you want
to modify a behavior of the scheduler, you could do that at scheduler
level, not in a random helper function.

Consider 2 cases:
 - Someone unrelated to scheduler would use the same helper and will
   be affected by this parameter inadvertently.
 - Scheduler will switch to using another function to distribute CPUs,
   and your setups will suddenly get broken again. This time deeply in
   production.

Thanks,
Yury

>  	/* NOTE: our first selection will skip 0. */
>  	prev = __this_cpu_read(distribute_cpu_mask_prev);
>  
> -- 
> 2.23.1