[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240501133608.GB39737@lorien.usersys.redhat.com>
Date: Wed, 1 May 2024 09:36:08 -0400
From: Phil Auld <pauld@...hat.com>
To: Yury Norov <yury.norov@...il.com>
Cc: Ankit Jain <ankit-aj.jain@...adcom.com>, linux@...musvillemoes.dk,
akpm@...ux-foundation.org, linux-kernel@...r.kernel.org,
juri.lelli@...hat.com, ajay.kaher@...adcom.com,
alexey.makhalov@...adcom.com, vasavi.sirnapalli@...adcom.com,
Paul Turner <pjt@...gle.com>, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH] lib/cpumask: Boot option to disable tasks distribution
within cpumask
Hi Yuri,
On Tue, Apr 30, 2024 at 11:23:07AM -0700 Yury Norov wrote:
> On Tue, Apr 30, 2024 at 02:34:31PM +0530, Ankit Jain wrote:
> > commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
> > and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
> > introduced the logic to distribute the tasks within cpumask upon initial
> > wakeup.
>
> So let's add the authors in CC list?
>
> > For Telco RAN deployments, isolcpus are a necessity to cater to
> > the requirement of low latency applications. These isolcpus are generally
> > tickless so that high priority SCHED_FIFO tasks can execute without any
> > OS jitter. Since load balancing is disabled on isocpus, any task
> > which gets placed on these CPUs can not be migrated on its own.
> > For RT applications to execute on isolcpus, a guaranteed kubernetes pod
> > with all isolcpus becomes the requirement and these RT applications are
> > affine to execute on a specific isolcpu within the kubernetes pod.
> > However, there may be some non-RT tasks which could also schedule in the
> > same kubernetes pod without being affine to any specific CPU(inherits the
> > pod cpuset affinity).
>
> OK... It looks like adding scheduler maintainers is also a necessity to
> cater here...
>
> > With multiple spawning and running containers inside
> > the pod, container runtime spawns several non-RT initializing tasks
> > ("runc init") inside the pod and due to above mentioned commits, these
> > non-RT tasks may get placed on any isolcpus and may starve if it happens
> > to wakeup on the same CPU as SCHED_FIFO task because RT throttling is also
> > disabled in telco setup. Thus, RAN deployment fails and eventually leads
> > to system hangs.
>
> Not that I'm familiar to your setup, but this sounds like a userspace
> configuration problems. Can you try to move your non-RT tasks into a
> cgroup attached to non-RT CPUs, or something like that?
>
It's not really. In a container environment just logging in to the
container could end up with the exec'd task landing on one of
the polling or latency sensitive cores.
In a telco deployment the applications will run containers with
isolated(pinned) cpus with load balacning disabled. These
containers typically use one of these cpus for its "housekeeping"
with the remainder used for the latency sensitive workloads.
Also, this is a change in kernel behavior which is breaking
userspace.
We are also hitting this and are interested in a way to get the
old behavior back for some workloads.
> > With the introduction of kernel cmdline param 'sched_pick_firstcpu',
> > there is an option provided for such usecases to disable the distribution
> > of tasks within the cpumask logic and use the previous 'pick first cpu'
> > approach for initial placement of tasks. Because many telco vendors
> > configure the system in such a way that the first cpu within a cpuset
> > of pod doesn't run any SCHED_FIFO or High priority tasks.
> >
> > Co-developed-by: Alexey Makhalov <alexey.makhalov@...adcom.com>
> > Signed-off-by: Alexey Makhalov <alexey.makhalov@...adcom.com>
> > Signed-off-by: Ankit Jain <ankit-aj.jain@...adcom.com>
> > ---
> > lib/cpumask.c | 24 ++++++++++++++++++++++++
> > 1 file changed, 24 insertions(+)
> >
> > diff --git a/lib/cpumask.c b/lib/cpumask.c
> > index e77ee9d46f71..3dea87d5ec1f 100644
> > --- a/lib/cpumask.c
> > +++ b/lib/cpumask.c
> > @@ -154,6 +154,23 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
> > }
> > EXPORT_SYMBOL(cpumask_local_spread);
> >
> > +/*
> > + * Task distribution within the cpumask feature disabled?
> > + */
> > +static bool cpumask_pick_firstcpu __read_mostly;
> > +
> > +/*
> > + * Disable Tasks distribution within the cpumask feature
> > + */
> > +static int __init cpumask_pick_firstcpu_setup(char *str)
> > +{
> > + cpumask_pick_firstcpu = 1;
> > + pr_info("cpumask: Tasks distribution within cpumask is disabled.");
> > + return 1;
> > +}
> > +
> > +__setup("sched_pick_firstcpu", cpumask_pick_firstcpu_setup);
> > +
> > static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
> >
> > /**
> > @@ -171,6 +188,13 @@ unsigned int cpumask_any_and_distribute(const struct cpumask *src1p,
> > {
> > unsigned int next, prev;
> >
> > + /*
> > + * Don't distribute, if tasks distribution
> > + * within cpumask feature is disabled
> > + */
> > + if (cpumask_pick_firstcpu)
> > + return cpumask_any_and(src1p, src2p);
>
> No, this is a wrong way.
>
> To begin with, this parameter shouldn't control a single random
> function. At least, the other cpumask_*_distribute() should be
> consistent to the policy.
>
> But in general... I don't think we should do things like that at all.
> Cpumask API is a simple and plain wrapper around bitmaps. If you want
> to modify a behavior of the scheduler, you could do that at scheduler
> level, not in a random helper function.
>
> Consider 2 cases:
> - Someone unrelated to scheduler would use the same helper and will
> be affected by this parameter inadvertently.
> - Scheduler will switch to using another function to distribute CPUs,
> and your setups will suddenly get broken again. This time deeply in
> production.
>
Yeah, I think I agree with this part. At the scheduler level, where this
is called, makes more sense.
Note, this is "deeply in production" now...
Cheers,
Phil
> Thanks,
> Yury
>
> > /* NOTE: our first selection will skip 0. */
> > prev = __this_cpu_read(distribute_cpu_mask_prev);
> >
> > --
> > 2.23.1
>
--
Powered by blists - more mailing lists