linux-kernel - Re: [PATCH 2/3] rcu: Defer RCU kthreads wakeup when CPU is dying

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALm+0cVv4cnbDPi=9oCYE_5s+DfuzQcB1fz=M1T8Hyp9D9sbXw@mail.gmail.com>
Date: Wed, 20 Dec 2023 16:24:35 +0800
From: Z qiang <qiang.zhang1211@...il.com>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Boqun Feng <boqun.feng@...il.com>, 
	Joel Fernandes <joel@...lfernandes.org>, Neeraj Upadhyay <neeraj.upadhyay@....com>, 
	Uladzislau Rezki <urezki@...il.com>, rcu <rcu@...r.kernel.org>, 
	"Paul E . McKenney" <paulmck@...nel.org>, Thomas Gleixner <tglx@...utronix.de>, 
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH 2/3] rcu: Defer RCU kthreads wakeup when CPU is dying

>
> When the CPU goes idle for the last time during the CPU down hotplug
> process, RCU reports a final quiescent state for the current CPU. If
> this quiescent state propagates up to the top, some tasks may then be
> woken up to complete the grace period: the main grace period kthread
> and/or the expedited main workqueue (or kworker).
>
> If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
> arm the RT bandwith timer to the local offline CPU. Since this happens
> after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
> timer gets ignored. Therefore if the RCU kthreads are waiting for RT
> bandwidth to be available, they may never be actually scheduled.
>

In the rcutree_report_cpu_dead(), the rcuog kthreads may also be wakeup in
do_nocb_deferred_wakeup(), if the rcuog kthreads is rt-fifo and wakeup happen,
the rt_period_active is set 1 and enqueue hrtimer to offline CPU in
do_start_rt_bandwidth(),
after that, we invoke swake_up_one_online() send ipi to online CPU, due to the
rt_period_active is 1, the rt-bandwith hrtimer will not enqueue to online CPU.
any thoughts?

Thanks
Zqiang


>
> This triggers TREE03 rcutorture hangs:
>
>          rcu: INFO: rcu_preempt self-detected stall on CPU
>          rcu:     4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
>          rcu:     (t=21035 jiffies g=938281 q=40787 ncpus=6)
>          rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
>          rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
>          rcu: RCU grace-period kthread stack dump:
>          task:rcu_preempt     state:R  running task     stack:14896 pid:14    tgid:14    ppid:2      flags:0x00004000
>          Call Trace:
>           <TASK>
>           __schedule+0x2eb/0xa80
>           schedule+0x1f/0x90
>           schedule_timeout+0x163/0x270
>           ? __pfx_process_timeout+0x10/0x10
>           rcu_gp_fqs_loop+0x37c/0x5b0
>           ? __pfx_rcu_gp_kthread+0x10/0x10
>           rcu_gp_kthread+0x17c/0x200
>           kthread+0xde/0x110
>           ? __pfx_kthread+0x10/0x10
>           ret_from_fork+0x2b/0x40
>           ? __pfx_kthread+0x10/0x10
>           ret_from_fork_asm+0x1b/0x30
>           </TASK>
>
> The situation can't be solved with just unpinning the timer. The hrtimer
> infrastructure and the nohz heuristics involved in finding the best
> remote target for an unpinned timer would then also need to handle
> enqueues from an offline CPU in the most horrendous way.
>
> So fix this on the RCU side instead and defer the wake up to an online
> CPU if it's too late for the local one.
>
> Reported-by: Paul E. McKenney <paulmck@...nel.org>
> Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
> Signed-off-by: Frederic Weisbecker <frederic@...nel.org>
> ---
>  kernel/rcu/tree.c     | 34 +++++++++++++++++++++++++++++++++-
>  kernel/rcu/tree_exp.h |  3 +--
>  2 files changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 3ac3c846105f..157f3ca2a9b5 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp)
>         return needmore;
>  }
>
> +static void swake_up_one_online_ipi(void *arg)
> +{
> +       struct swait_queue_head *wqh = arg;
> +
> +       swake_up_one(wqh);
> +}
> +
> +static void swake_up_one_online(struct swait_queue_head *wqh)
> +{
> +       int cpu = get_cpu();
> +
> +       /*
> +        * If called from rcutree_report_cpu_starting(), wake up
> +        * is dangerous that late in the CPU-down hotplug process. The
> +        * scheduler might queue an ignored hrtimer. Defer the wake up
> +        * to an online CPU instead.
> +        */
> +       if (unlikely(cpu_is_offline(cpu))) {
> +               int target;
> +
> +               target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU),
> +                                        cpu_online_mask);
> +
> +               smp_call_function_single(target, swake_up_one_online_ipi,
> +                                        wqh, 0);
> +               put_cpu();
> +       } else {
> +               put_cpu();
> +               swake_up_one(wqh);
> +       }
> +}
> +
>  /*
>   * Awaken the grace-period kthread.  Don't do a self-awaken (unless in an
>   * interrupt or softirq handler, in which case we just might immediately
> @@ -1037,7 +1069,7 @@ static void rcu_gp_kthread_wake(void)
>                 return;
>         WRITE_ONCE(rcu_state.gp_wake_time, jiffies);
>         WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq));
> -       swake_up_one(&rcu_state.gp_wq);
> +       swake_up_one_online(&rcu_state.gp_wq);
>  }
>
>  /*
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 6d7cea5d591f..2ac440bc7e10 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -173,7 +173,6 @@ static bool sync_rcu_exp_done_unlocked(struct rcu_node *rnp)
>         return ret;
>  }
>
> -
>  /*
>   * Report the exit from RCU read-side critical section for the last task
>   * that queued itself during or before the current expedited preemptible-RCU
> @@ -201,7 +200,7 @@ static void __rcu_report_exp_rnp(struct rcu_node *rnp,
>                         raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>                         if (wake) {
>                                 smp_mb(); /* EGP done before wake_up(). */
> -                               swake_up_one(&rcu_state.expedited_wq);
> +                               swake_up_one_online(&rcu_state.expedited_wq);
>                         }
>                         break;
>                 }
> --
> 2.42.1
>