linux-kernel - Re: [PATCH] rcu/nocb: Add an option to ON/OFF an offloading from RT context

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YnY33nq5jl6FLFOu@pc638.lan>
Date:   Sat, 7 May 2022 11:11:58 +0200
From:   Uladzislau Rezki <urezki@...il.com>
To:     "Paul E. McKenney" <paulmck@...nel.org>,
        Alison Chaiken <achaiken@...ora.tech>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc:     Uladzislau Rezki <urezki@...il.com>,
        LKML <linux-kernel@...r.kernel.org>, RCU <rcu@...r.kernel.org>,
        Frederic Weisbecker <frederic@...nel.org>,
        Neeraj Upadhyay <neeraj.iitr10@...il.com>,
        Joel Fernandes <joel@...lfernandes.org>,
        Oleksiy Avramchenko <oleksiy.avramchenko@...y.com>,
        bigeasy@...utronix.de
Subject: Re: [PATCH] rcu/nocb: Add an option to ON/OFF an offloading from RT
 context

> On Fri, May 06, 2022 at 06:22:26PM +0200, Uladzislau Rezki wrote:
> > > On Thu, May 05, 2022 at 12:16:41PM +0200, Uladzislau Rezki (Sony) wrote:
> > > > Introduce a RCU_NOCB_CPU_CB_BOOST kernel option. So a user can
> > > > decide if an offloading has to be done in a high-prio context or
> > > > not. Please note an option depends on RCU_NOCB_CPU and RCU_BOOST
> > > > parameters and by default it is off.
> > > > 
> > > > This patch splits the boosting preempted RCU readers and those
> > > > kthreads which directly responsible for driving expedited grace
> > > > periods forward with enabling/disabling the offloading from/to
> > > > SCHED_FIFO/SCHED_OTHER contexts.
> > > > 
> > > > The main reason of such split is, for example on Android there
> > > > are some workloads which require fast expedited grace period to
> > > > be done whereas offloading in RT context can lead to starvation
> > > > and hogging a CPU for a long time what is not acceptable for
> > > > latency sensitive environment. For instance:
> > > > 
> > > > <snip>
> > > > <...>-60 [006] d..1 2979.028717: rcu_batch_start: rcu_preempt CBs=34619 bl=270
> > > > <snip>
> > > > 
> > > > invoking 34 619 callbacks will take time thus making other CFS
> > > > tasks waiting in run-queue to be starved due to such behaviour.
> > > > 
> > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@...il.com>
> > > 
> > > All good points!
> > > 
> > > Some questions and comments below.
> > > 
> > > Adding Sebastian on CC for his perspective.
> > > 
> > > 						Thanx, Paul
> > > 
> > > > ---
> > > >  kernel/rcu/Kconfig     | 14 ++++++++++++++
> > > >  kernel/rcu/tree.c      |  5 ++++-
> > > >  kernel/rcu/tree_nocb.h |  3 ++-
> > > >  3 files changed, 20 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > > > index 27aab870ae4c..074630b94902 100644
> > > > --- a/kernel/rcu/Kconfig
> > > > +++ b/kernel/rcu/Kconfig
> > > > @@ -275,6 +275,20 @@ config RCU_NOCB_CPU_DEFAULT_ALL
> > > >  	  Say Y here if you want offload all CPUs by default on boot.
> > > >  	  Say N here if you are unsure.
> > > >  
> > > > +config RCU_NOCB_CPU_CB_BOOST
> > > > +	bool "Perform offloading from real-time kthread"
> > > > +	depends on RCU_NOCB_CPU && RCU_BOOST
> > > > +	default n
> > > 
> > > I understand that you need this to default to "n" on your systems.
> > > However, other groups already using callback offloading should not see
> > > a sudden change.  I don't see an Android-specific defconfig file, but
> > > perhaps something in drivers/android/Kconfig?
> > > 
We saw a sudden change when the priority was lifted up for rcuop kthreads. 
I would like to know the reason. As for Android, i would like to avoid
it to be Android specific. It is better just to enable boosting by
default for nocb kthreads.

> > > One easy way to make this work would be to invert the sense of this
> > > Kconfig option ("RCU_NOCB_CB_NO_BOOST"?), continue having it default to
> > > "n", but then select it somewhere in drivers/android/Kconfig.  But I
> > > would not be surprised if there is a better way.
In that situation probably we should just enable it by default.

> > It was done deliberately, i mean off by default. Because the user has to
> > think before enabling it for its workloads. It is not a big issue for
> > kthreads which drive a grace period forward, because their context runtime
> > i find pretty short. Whereas an offloading callback kthread can stuck
> > for a long time depending on workloads.
> > 
> > Also, i put it that way because initially those kthreads were staying
> > as SCHED_NORMAL even though the RCU_BOOST was set in kernel config.
> > 
> > <snip>
> > commit c8b16a65267e35ecc5621dbc81cbe7e5b0992fce
> > Author: Alison Chaiken <achaiken@...ora.tech>
> > Date:   Tue Jan 11 15:32:52 2022 -0800
> > 
> >     rcu: Elevate priority of offloaded callback threads
> >     
> >     When CONFIG_PREEMPT_RT=y, the rcutree.kthread_prio command-line
> >     parameter signals initialization code to boost the priority of rcuc
> >     callbacks to the designated value.  With the additional
> >     CONFIG_RCU_NOCB_CPU=y configuration and an additional rcu_nocbs
> >     command-line parameter, the callbacks on the listed cores are
> >     offloaded to new rcuop kthreads that are not pinned to the cores whose
> >     post-grace-period work is performed.  While the rcuop kthreads perform
> >     the same function as the rcuc kthreads they offload, the kthread_prio
> >     parameter only boosts the priority of the rcuc kthreads.  Fix this
> >     inconsistency by elevating rcuop kthreads to the same priority as the rcuc
> >     kthreads.
> >     
> >     Signed-off-by: Alison Chaiken <achaiken@...ora.tech>
> >     Signed-off-by: Paul E. McKenney <paulmck@...nel.org>
> > <snip>
> > 
> > I have a doubt that it is needed for CONFIG_PREEMPT_RT=y. The reason i mentioned
> > above it is a source of extra latency. That is why i have made it inactive by default.
> > 
> > Any thoughts?
> 
> My first thought is that Alison does real RT work.  Let's please therefore
> avoid assuming that she doesn't know what she is doing.  ;-)
> 
I read a commit message that is what i know about the patch.

>
> One thing that she knows is that RT workloads usually run the most
> latency-sensitive parts of their application at far higher priority
> than they do the rcuo[ps] kthreads.  This means that they do not have
> the same issues with these kthreads that you see.
> 
We can make it ON by default for CONFIG_PREEMPT_RT=y kernels. But i do
not want to guess. The correct way is just to ask if Sebastian and Alison
would like to have it by default on.

I have added them into "To".

> > > > +	help
> > > > +	  Use this option to offload callbacks from the SCHED_FIFO context
> > > > +	  to make the process faster. As a side effect of this approach is
> > > > +	  a latency especially for the SCHED_OTHER tasks which will not be
> > > > +	  able to preempt an offloading kthread. That latency depends on a
> > > > +	  number of callbacks to be invoked.
> > > > +
> > > > +	  Say Y here if you want to set RT priority for offloading kthreads.
> > > > +	  Say N here if you are unsure.
> > > > +
> > > >  config TASKS_TRACE_RCU_READ_MB
> > > >  	bool "Tasks Trace RCU readers use memory barriers in user and idle"
> > > >  	depends on RCU_EXPERT && TASKS_TRACE_RCU
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 9dc4c4e82db6..d769a15bc0e3 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -154,7 +154,10 @@ static void sync_sched_exp_online_cleanup(int cpu);
> > > >  static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp);
> > > >  static bool rcu_rdp_is_offloaded(struct rcu_data *rdp);
> > > >  
> > > > -/* rcuc/rcub/rcuop kthread realtime priority */
> > > > +/*
> > > > + * rcuc/rcub/rcuop kthread realtime priority. The former
> > > > + * depends on if CONFIG_RCU_NOCB_CPU_CB_BOOST is set.
> > > 
> > > Aren't the rcuo[ps] kthreads controlled by the RCU_NOCB_CPU_CB_BOOST
> > > Kconfig option?  (As opposed to the "former", which is "rcuc".)
> > > 
> > The CONFIG_RCU_NOCB_CPU_CB_BOOST controls only the last what is
> > the rcuo CB kthread or "rcuo%c/%d" name. Sorry it is not "former"
> > it is the last in the rcuc/rcub/rcuop sequence. It was a typo :)
> 
> I do know that feeling!  Absolutely not a problem, please just fix it
> in the next version.
> 
Will do :)

> > > > + */
> > > >  static int kthread_prio = IS_ENABLED(CONFIG_RCU_BOOST) ? 1 : 0;
> > > >  module_param(kthread_prio, int, 0444);
> > > >  
> > > > diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> > > > index 60cc92cc6655..a2823be9b1d0 100644
> > > > --- a/kernel/rcu/tree_nocb.h
> > > > +++ b/kernel/rcu/tree_nocb.h
> > > > @@ -1315,8 +1315,9 @@ static void rcu_spawn_cpu_nocb_kthread(int cpu)
> > > >  	if (WARN_ONCE(IS_ERR(t), "%s: Could not start rcuo CB kthread, OOM is now expected behavior\n", __func__))
> > > >  		goto end;
> > > >  
> > > > -	if (kthread_prio)
> > > > +	if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_CB_BOOST))
> > > 
> > > Don't we need both non-zero kthread_prio and the proper setting of the
> > > new Kconfig option before we run it at SCHED_FIFO?
> > > 
> > > Yes, we could rely on sched_setscheduler_nocheck() erroring out in
> > > that case, but that sounds like an accident waiting to happen.
> > > 
> > As far as i see it is odd, because the "kthread_prio" is verified so
> > there is a sanity check to check if the value is correct for SCHED_FIFO
> > case and does some adjustment if not. There is sanitize_kthread_prio()
> > that does all trick.
> 
> Agreed, and like I said, we could rely on sched_setscheduler_nocheck()
> erroring out in that case.  But people do sometimes turn error cases
> into some other functionality.  Keeping the check of kthread_prio makes
> it clear to people reading the code what our intent is and also avoids
> strange breakage should someone find a use for SCHED_FIFO priority zero.
> 
Hm... I can place it back, though it is useless, IMHO.

--
Uladzislau Rezki