linux-kernel - Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220408143444.GC4285@paulmck-ThinkPad-P17-Gen-1>
Date:   Fri, 8 Apr 2022 07:34:44 -0700
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Joel Fernandes <joel@...lfernandes.org>
Cc:     Kalesh Singh <kaleshsingh@...gle.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        kernel-team <kernel-team@...roid.com>, Tejun Heo <tj@...nel.org>,
        Tim Murray <timmurray@...gle.com>, Wei Wang <wvw@...gle.com>,
        Kyle Lin <kylelin@...gle.com>,
        Chunwei Lu <chunweilu@...gle.com>,
        Lulu Wang <luluw@...gle.com>,
        Frederic Weisbecker <frederic@...nel.org>,
        Neeraj Upadhyay <quic_neeraju@...cinc.com>,
        Josh Triplett <josh@...htriplett.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Lai Jiangshan <jiangshanlai@...il.com>,
        rcu <rcu@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT
 kthread_worker

On Fri, Apr 08, 2022 at 06:42:42AM -0400, Joel Fernandes wrote:
> On Fri, Apr 8, 2022 at 12:57 AM Kalesh Singh <kaleshsingh@...gle.com> wrote:
> >
> [...]
> > @@ -334,15 +334,13 @@ static bool exp_funnel_lock(unsigned long s)
> >   * Select the CPUs within the specified rcu_node that the upcoming
> >   * expedited grace period needs to wait for.
> >   */
> > -static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > +static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
> >  {
> >         int cpu;
> >         unsigned long flags;
> >         unsigned long mask_ofl_test;
> >         unsigned long mask_ofl_ipi;
> >         int ret;
> > -       struct rcu_exp_work *rewp =
> > -               container_of(wp, struct rcu_exp_work, rew_work);
> >         struct rcu_node *rnp = container_of(rewp, struct rcu_node, rew);
> >
> >         raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > @@ -417,13 +415,119 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> >                 rcu_report_exp_cpu_mult(rnp, mask_ofl_test, false);
> >  }
> >
> > +static void rcu_exp_sel_wait_wake(unsigned long s);
> > +
> > +#ifdef CONFIG_RCU_EXP_KTHREAD
> 
> Just my 2c:
> 
> Honestly, I am not sure if the benefits of duplicating the code to use
> normal workqueues outweighs the drawbacks (namely code complexity,
> code duplication - which can in turn cause more bugs and maintenance
> headaches down the line). The code is harder to read and adding more
> 30 character function names does not help.
> 
> For something as important as expedited GPs, I can't imagine a
> scenario where an RT kthread worker would cause "issues". If it does
> cause issues, that's what the -rc cycles and the stable releases are
> for. I prefer to trust the process than take a one-foot-in-the-door
> approach.
> 
> So please, can we just keep it simple?

Yes and no.

This is a bug fix, but only for those systems that are expecting real-time
response from synchronize_rcu_expedited().  As far as I know, this is only
Android.  The rest of the systems are just fine with the current behavior.

In addition, this bug fix introduces significant risks, especially in
terms of performance for throughput-oriented workloads.

So yes, let's do this bug fix (with appropriate adjustment), but let's
also avoid exposing the non-Android workloads to risks from the inevitable
unintended consequences.  ;-)

							Thanx, Paul

> Thanks,
> 
>  - Joel
> 
> 
> > +static void sync_rcu_exp_select_node_cpus(struct kthread_work *wp)
> > +{
> > +       struct rcu_exp_work *rewp =
> > +               container_of(wp, struct rcu_exp_work, rew_work);
> > +
> > +       __sync_rcu_exp_select_node_cpus(rewp);
> > +}
> > +
> > +static inline bool rcu_gp_par_worker_started(void)
> > +{
> > +       return !!READ_ONCE(rcu_exp_par_gp_kworker);
> > +}
> > +
> > +static inline void sync_rcu_exp_select_cpus_queue_work(struct rcu_node *rnp)
> > +{
> > +       kthread_init_work(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> > +       /*
> > +        * Use rcu_exp_par_gp_kworker, because flushing a work item from
> > +        * another work item on the same kthread worker can result in
> > +        * deadlock.
> > +        */
> > +       kthread_queue_work(rcu_exp_par_gp_kworker, &rnp->rew.rew_work);
> > +}
> > +
> > +static inline void sync_rcu_exp_select_cpus_flush_work(struct rcu_node *rnp)
> > +{
> > +       kthread_flush_work(&rnp->rew.rew_work);
> > +}
> > +
> > +/*
> > + * Work-queue handler to drive an expedited grace period forward.
> > + */
> > +static void wait_rcu_exp_gp(struct kthread_work *wp)
> > +{
> > +       struct rcu_exp_work *rewp;
> > +
> > +       rewp = container_of(wp, struct rcu_exp_work, rew_work);
> > +       rcu_exp_sel_wait_wake(rewp->rew_s);
> > +}
> > +
> > +static inline void synchronize_rcu_expedited_queue_work(struct rcu_exp_work *rew)
> > +{
> > +       kthread_init_work(&rew->rew_work, wait_rcu_exp_gp);
> > +       kthread_queue_work(rcu_exp_gp_kworker, &rew->rew_work);
> > +}
> > +
> > +static inline void synchronize_rcu_expedited_destroy_work(struct rcu_exp_work *rew)
> > +{
> > +}
> > +#else /* !CONFIG_RCU_EXP_KTHREAD */
> > +static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > +{
> > +       struct rcu_exp_work *rewp =
> > +               container_of(wp, struct rcu_exp_work, rew_work);
> > +
> > +       __sync_rcu_exp_select_node_cpus(rewp);
> > +}
> > +
> > +static inline bool rcu_gp_par_worker_started(void)
> > +{
> > +       return !!READ_ONCE(rcu_par_gp_wq);
> > +}
> > +
> > +static inline void sync_rcu_exp_select_cpus_queue_work(struct rcu_node *rnp)
> > +{
> > +       int cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1);
> > +
> > +       INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> > +       /* If all offline, queue the work on an unbound CPU. */
> > +       if (unlikely(cpu > rnp->grphi - rnp->grplo))
> > +               cpu = WORK_CPU_UNBOUND;
> > +       else
> > +               cpu += rnp->grplo;
> > +       queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
> > +}
> > +
> > +static inline void sync_rcu_exp_select_cpus_flush_work(struct rcu_node *rnp)
> > +{
> > +       flush_work(&rnp->rew.rew_work);
> > +}
> > +
> > +/*
> > + * Work-queue handler to drive an expedited grace period forward.
> > + */
> > +static void wait_rcu_exp_gp(struct work_struct *wp)
> > +{
> > +       struct rcu_exp_work *rewp;
> > +
> > +       rewp = container_of(wp, struct rcu_exp_work, rew_work);
> > +       rcu_exp_sel_wait_wake(rewp->rew_s);
> > +}
> > +
> > +static inline void synchronize_rcu_expedited_queue_work(struct rcu_exp_work *rew)
> > +{
> > +       INIT_WORK_ONSTACK(&rew->rew_work, wait_rcu_exp_gp);
> > +       queue_work(rcu_gp_wq, &rew->rew_work);
> > +}
> > +
> > +static inline void synchronize_rcu_expedited_destroy_work(struct rcu_exp_work *rew)
> > +{
> > +       destroy_work_on_stack(&rew->rew_work);
> > +}
> > +#endif /* CONFIG_RCU_EXP_KTHREAD */
> > +
> >  /*
> >   * Select the nodes that the upcoming expedited grace period needs
> >   * to wait for.
> >   */
> >  static void sync_rcu_exp_select_cpus(void)
> >  {
> > -       int cpu;
> >         struct rcu_node *rnp;
> >
> >         trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("reset"));
> > @@ -435,28 +539,21 @@ static void sync_rcu_exp_select_cpus(void)
> >                 rnp->exp_need_flush = false;
> >                 if (!READ_ONCE(rnp->expmask))
> >                         continue; /* Avoid early boot non-existent wq. */
> > -               if (!READ_ONCE(rcu_par_gp_wq) ||
> > +               if (!rcu_gp_par_worker_started() ||
> >                     rcu_scheduler_active != RCU_SCHEDULER_RUNNING ||
> >                     rcu_is_last_leaf_node(rnp)) {
> > -                       /* No workqueues yet or last leaf, do direct call. */
> > +                       /* No worker started yet or last leaf, do direct call. */
> >                         sync_rcu_exp_select_node_cpus(&rnp->rew.rew_work);
> >                         continue;
> >                 }
> > -               INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> > -               cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1);
> > -               /* If all offline, queue the work on an unbound CPU. */
> > -               if (unlikely(cpu > rnp->grphi - rnp->grplo))
> > -                       cpu = WORK_CPU_UNBOUND;
> > -               else
> > -                       cpu += rnp->grplo;
> > -               queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
> > +               sync_rcu_exp_select_cpus_queue_work(rnp);
> >                 rnp->exp_need_flush = true;
> >         }
> >
> > -       /* Wait for workqueue jobs (if any) to complete. */
> > +       /* Wait for jobs (if any) to complete. */
> >         rcu_for_each_leaf_node(rnp)
> >                 if (rnp->exp_need_flush)
> > -                       flush_work(&rnp->rew.rew_work);
> > +                       sync_rcu_exp_select_cpus_flush_work(rnp);
> >  }
> >
> >  /*
> > @@ -622,17 +719,6 @@ static void rcu_exp_sel_wait_wake(unsigned long s)
> >         rcu_exp_wait_wake(s);
> >  }
> >
> > -/*
> > - * Work-queue handler to drive an expedited grace period forward.
> > - */
> > -static void wait_rcu_exp_gp(struct work_struct *wp)
> > -{
> > -       struct rcu_exp_work *rewp;
> > -
> > -       rewp = container_of(wp, struct rcu_exp_work, rew_work);
> > -       rcu_exp_sel_wait_wake(rewp->rew_s);
> > -}
> > -
> >  #ifdef CONFIG_PREEMPT_RCU
> >
> >  /*
> > @@ -848,20 +934,19 @@ void synchronize_rcu_expedited(void)
> >         } else {
> >                 /* Marshall arguments & schedule the expedited grace period. */
> >                 rew.rew_s = s;
> > -               INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
> > -               queue_work(rcu_gp_wq, &rew.rew_work);
> > +               synchronize_rcu_expedited_queue_work(&rew);
> >         }
> >
> >         /* Wait for expedited grace period to complete. */
> >         rnp = rcu_get_root();
> >         wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> >                    sync_exp_work_done(s));
> > -       smp_mb(); /* Workqueue actions happen before return. */
> > +       smp_mb(); /* Work actions happen before return. */
> >
> >         /* Let the next expedited grace period start. */
> >         mutex_unlock(&rcu_state.exp_mutex);
> >
> >         if (likely(!boottime))
> > -               destroy_work_on_stack(&rew.rew_work);
> > +               synchronize_rcu_expedited_destroy_work(&rew);
> >  }
> >  EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
> >
> > base-commit: 42e7a03d3badebd4e70aea5362d6914dfc7c220b
> > --
> > 2.35.1.1178.g4f1659d476-goog
> >