linux-kernel - Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAC_TJvf_ubOPUVpNYK9UUWXn5J5hpiTkOt2++kYTGbF9bM_pYw@mail.gmail.com>
Date:   Fri, 8 Apr 2022 10:53:53 -0700
From:   Kalesh Singh <kaleshsingh@...gle.com>
To:     paulmck@...nel.org
Cc:     Joel Fernandes <joel@...lfernandes.org>,
        Suren Baghdasaryan <surenb@...gle.com>,
        kernel-team <kernel-team@...roid.com>, Tejun Heo <tj@...nel.org>,
        Tim Murray <timmurray@...gle.com>, Wei Wang <wvw@...gle.com>,
        Kyle Lin <kylelin@...gle.com>,
        Chunwei Lu <chunweilu@...gle.com>,
        Lulu Wang <luluw@...gle.com>,
        Frederic Weisbecker <frederic@...nel.org>,
        Neeraj Upadhyay <quic_neeraju@...cinc.com>,
        Josh Triplett <josh@...htriplett.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Lai Jiangshan <jiangshanlai@...il.com>,
        rcu <rcu@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

On Fri, Apr 8, 2022 at 10:39 AM Paul E. McKenney <paulmck@...nel.org> wrote:
>
> On Fri, Apr 08, 2022 at 01:14:35PM -0400, Joel Fernandes wrote:
> > On Fri, Apr 8, 2022 at 11:34 AM Paul E. McKenney <paulmck@...nel.org> wrote:
> > >
> > > On Fri, Apr 08, 2022 at 10:41:26AM -0400, Joel Fernandes wrote:
> > > > On Fri, Apr 8, 2022 at 10:34 AM Paul E. McKenney <paulmck@...nel.org> wrote:
> > > > >
> > > > > On Fri, Apr 08, 2022 at 06:42:42AM -0400, Joel Fernandes wrote:
> > > > > > On Fri, Apr 8, 2022 at 12:57 AM Kalesh Singh <kaleshsingh@...gle.com> wrote:
> > > > > > >
> > > > > > [...]
> > > > > > > @@ -334,15 +334,13 @@ static bool exp_funnel_lock(unsigned long s)
> > > > > > >   * Select the CPUs within the specified rcu_node that the upcoming
> > > > > > >   * expedited grace period needs to wait for.
> > > > > > >   */
> > > > > > > -static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > > > > > > +static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
> > > > > > >  {
> > > > > > >         int cpu;
> > > > > > >         unsigned long flags;
> > > > > > >         unsigned long mask_ofl_test;
> > > > > > >         unsigned long mask_ofl_ipi;
> > > > > > >         int ret;
> > > > > > > -       struct rcu_exp_work *rewp =
> > > > > > > -               container_of(wp, struct rcu_exp_work, rew_work);
> > > > > > >         struct rcu_node *rnp = container_of(rewp, struct rcu_node, rew);
> > > > > > >
> > > > > > >         raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > > > > > > @@ -417,13 +415,119 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > > > > > >                 rcu_report_exp_cpu_mult(rnp, mask_ofl_test, false);
> > > > > > >  }
> > > > > > >
> > > > > > > +static void rcu_exp_sel_wait_wake(unsigned long s);
> > > > > > > +
> > > > > > > +#ifdef CONFIG_RCU_EXP_KTHREAD
> > > > > >
> > > > > > Just my 2c:
> > > > > >
> > > > > > Honestly, I am not sure if the benefits of duplicating the code to use
> > > > > > normal workqueues outweighs the drawbacks (namely code complexity,
> > > > > > code duplication - which can in turn cause more bugs and maintenance
> > > > > > headaches down the line). The code is harder to read and adding more
> > > > > > 30 character function names does not help.
> > > > > >
> > > > > > For something as important as expedited GPs, I can't imagine a
> > > > > > scenario where an RT kthread worker would cause "issues". If it does
> > > > > > cause issues, that's what the -rc cycles and the stable releases are
> > > > > > for. I prefer to trust the process than take a one-foot-in-the-door
> > > > > > approach.
> > > > > >
> > > > > > So please, can we just keep it simple?
> > > > >
> > > > > Yes and no.
> > > > >
> > > > > This is a bug fix, but only for those systems that are expecting real-time
> > > > > response from synchronize_rcu_expedited().  As far as I know, this is only
> > > > > Android.  The rest of the systems are just fine with the current behavior.
> > > >
> > > > As far as you know, but are you sure?
> > >
> > > None of us are sure.  We are balancing risks and potential benefits.
> >
> > Right.
> >
> > > > > In addition, this bug fix introduces significant risks, especially in
> > > > > terms of performance for throughput-oriented workloads.
> > > >
> > > > Could you explain what the risk is? That's the part I did not follow.
> > > > How can making synchronize_rcu_expedited() work getting priority
> > > > introduce throughput issues?
> > >
> > > Status quo has synchronize_rcu_expedited() workqueues running as
> > > SCHED_OTHER.
> >
> > Yeah, so if we go by this, you are saying RCU_BOOST likely does not
> > work correctly on status quo right? I do not see what in the commit
> > message indicates that this is an Android-only issue, let me know what
> > I am missing.
>
> Android is in the process of imposing a new requirement, which I have been
> letting you get away with calling a bug.  Mostly because I don't care that
> much what it is called.  Either way, the code base needs to accommodate
> Android's needs with acceptably low risk of breakage elsewhere.
>
> You are right that RCU_BOOST has run additional kthreads at real-time
> priority, and has done so for quite some time.  But RCU_BOOST has not
> (repeat, NOT) implied real-time response from synchronize_rcu_expedited(),
> and the ONLY request thus far has come from Android.
>
> Face it, the fact that synchronize_rcu_expedited() can achieve
> tens-of-milliseconds "worst-case" latencies latencies is quite
> amazing.  Let's not spoil things by visiting the changes on
> workloads that neither need nor care about "worst-case" latencies
> from synchronize_rcu_expedited().
>
> The "worst case" with the double quotes is because a single long-running
> CPU-bound reader will force increased latencies.  Then again, that is
> exactly why Uladzislau created this commit:
>
> 0b74356e7b95 ("rcu: Introduce CONFIG_RCU_EXP_CPU_STALL_TIMEOUT")
>
> Please note that this commit's short timeouts depend on ANDROID.  If you
> are serious about also having this work in ChromeOS, please send a patch
> on top of Uladzislau's that also covers ChromeOS.
>
> But please be very careful what you wish for...
>
> Right now, you get expedited RCU CPU stall warnings at 21 seconds
> in mainline kernels.  With this commit, that become 20 milliseconds,
> but only for ANDROID.
>
> I could easily imagine that the Android guys would welcome help
> fixing long-running RCU readers, but I find it even easier to
> imagine that some of the more vocal ChromeOS users might stridently
> object to being involuntarily recruited to provide such help.  ;-)

Thanks for the discussion everyone.

We didn't fully switch to kthread workers to avoid changing the
behavior for users that don’t need this low latency exp GPs. Another
(and perhaps more important) reason is because kthread_worker offers
reduced concurrency than workqueues which Pual reported can pose
issues on systems with a large number of CPUs.

The Kconfig depends on RCU_EXPERT. Android enables RCU_EXPERT in the
common kernels defconfigs to allow configuring performance
optimizations. Changing to depend on CONFIG_ANDROID would also mean
less testing. So I prefer to keep RCU_EXPERT.

If PREEMT_RT systems don’t disable expedited grace periods, we can
remove default off for those. Steve?

Thanks,
Kalesh

>
> > The users affected by this will instead have these running
> > > as SCHED_FIFO.  That changes scheduling.  For users not explicitly
> > > needing low-latency synchronize_rcu_expedited(), this change is very
> > > unlikely to be for the better.  And historically, unnecessarily running
> > > portions of RCU at real-time priorities has been a change for the worse.
> > > As in greatly increased context-switch rates and consequently degraded
> > > performance.  Please note that this is not a theoretical statement:  Real
> > > users have really been burned by too much SCHED_FIFO in RCU kthreads in
> > > the past.
> >
> > Android also has suffered from too much SCHED_FIFO in the past. I
> > remember the display thread called 'surfaceflinger' had to be dropped
> > from RT privilege at one point.
>
> "With great power..."  ;-)
>
>                                                         Thanx, Paul
>
> > > > > So yes, let's do this bug fix (with appropriate adjustment), but let's
> > > > > also avoid exposing the non-Android workloads to risks from the inevitable
> > > > > unintended consequences.  ;-)
> > > >
> > > > I would argue the risk is also adding code complexity and more bugs
> > > > without clear rationale for why it is being done. There's always risk
> > > > with any change, but that's the -rc cycles and stable kernels help
> > > > catch those. I think we should not add more code complexity if it is a
> > > > theoretical concern.
> > > >
> > > > There's also another possible risk - there is a possible hidden
> > > > problem here that probably the non-Android folks haven't noticed or
> > > > been able to debug. I would rather just do the right thing.
> > > >
> > > > Just my 2c,
> > >
> > > Sorry, but my answer is still "no".
> > >
> > > Your suggested change risks visiting unacceptable performance
> > > degradation on a very large number of innocent users for whom current
> > > synchronize_rcu_expedited() latency is plenty good enough.
> >
> > I believe the process will catch any such risk, but it is your call! ;-)
> >
> > Thanks,
> >
> > - Joel
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@...roid.com.
>