[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <46fb3937-3a1d-7fb1-9493-bf11b6ef8dad@codeaurora.org>
Date: Mon, 16 Nov 2020 21:35:09 +0530
From: Neeraj Upadhyay <neeraju@...eaurora.org>
To: paulmck@...nel.org
Cc: josh@...htriplett.org, rostedt@...dmis.org,
mathieu.desnoyers@...icios.com, jiangshanlai@...il.com,
joel@...lfernandes.org, rcu@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] rcu: Check and report missed fqs timer wakeup on RCU
stall
Hi Paul,
On 11/12/2020 1:01 AM, Paul E. McKenney wrote:
> On Wed, Nov 11, 2020 at 07:37:37PM +0530, Neeraj Upadhyay wrote:
>> For a new grace period request, RCU GP kthread transitions
>> through following states:
>>
>> a. [RCU_GP_WAIT_GPS] -> [RCU_GP_DONE_GPS]
>>
>> Initial state, where GP kthread waits for a new GP start
>> request is RCU_GP_WAIT_GPS. On new GP request (started by
>> callers, for any new callbacks, by setting RCU_GP_FLAG_INIT
>> gp_state flag and waking up the GP kthread) GP kthread
>> enters RCU_GP_DONE_GPS.
>>
>> b. [RCU_GP_DONE_GPS] -> [RCU_GP_ONOFF]
>>
>> Grace period initialization starts in rcu_gp_init(), which
>> records the start of new GP in rcu_state.gp_seq and enters
>> into RCU_GP_ONOFF state.
>>
>> c. [RCU_GP_ONOFF] -> [RCU_GP_INIT]
>>
>> In RCU_GP_ONOFF state, for each leaf rnp node, GP kthread
>> applies the buffered online/offline information of its cpus,
>> from ->qsmaskinitnext to ->qsmaskinit, which is basically
>> the mask of cpus, which need to report quiescent state, for
>> the new GP to complete.
>>
>> Also, in this state, an outgoing rnp's (for which all cpus
>> are now offline and there are no tasks blocked inside
>> RCU read section) or an incoming rnp's (for which first cpu
>> comes online) participation in the new and subsequent GP is
>> advertised, by propagating its qsmaskinit information up the
>> tree.
>>
>> d. [RCU_GP_INIT] -> [RCU_GP_WAIT_FQS]
>>
>> In RCU_GP_INIT state, current GPs qs mask information and new
>> GP seq is propagated down the tree, into all rnp nodes,
>> in breadth first order.
>>
>> On GP initialization completion, GP kthread enters a fqs loop,
>> which does following things, to get quiescent state reported
>> for cpus, which aren't quiesce in ->qsmask:
>>
>> i. For CPUs, which have entered idle since the first iteration,
>> report quiescent state up the tree.
>>
>> ii. Based on how long the current grace period has been running
>> for, either, set the appropriate flags, so that sched clock
>> interrupt on that cpu, does qs reporting or do a resched
>> on that cpu.
>>
>> On each iteration, it transitions through following states. The fqs
>> loop is terminated on GP completion, when all CPUs report their
>> quiescent state and all RCU readers, blocking current grace period
>> have completed the RCU read section.
>>
>> e. [RCU_GP_WAIT_FQS] -> [RCU_GP_DOING_FQS]
>>
>> GP kthread uses a timed wait (jiffies_till_first_fqs for first
>> iteration and jiffies_till_next_fqs for subsequent iterations),
>> before doing all the work, to force quiescent state on
>> all cpus. It wakes up from this state, either on timeout, or
>> when (RCU_GP_FLAG_FQS | RCU_GP_FLAG_OVLD) flags are set, either
>> on callback overload conditions or when last cpu reports its
>> quiescent state and all RCU readers blocking current GP, have
>> left the RCU read section.
>>
>> f. [RCU_GP_CLEANUP] -> [RCU_GP_CLEANED]
>>
>> Mark end of grace period, in ->gp_seq, on all rnp nodes, in breadth
>> first order and finally in rcu_state.
>>
>> For cases where timers are not served (due to issues in timer
>> configuration, in timer framework or due to softirqs not getting
>> handled, when there is a storm of interrupts) on the CPU,
>> where GP kthread queued a timer (for timed wait, which is done
>> in RCU_GP_WAIT_FQS) its possible that RCU kthread never wakes up.
>> Report the same from stall warnings, if GP thread is in RCU_GP_WAIT_FQS
>> state, and the timeout has elapsed and the kthread is not woken.
>>
>> Signed-off-by: Neeraj Upadhyay <neeraju@...eaurora.org>
>> ---
>> kernel/rcu/tree.c | 25 +++++++++++++++----------
>> kernel/rcu/tree_stall.h | 33 +++++++++++++++++++++++++++++++++
>> 2 files changed, 48 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> index 413831b..804e543 100644
>> --- a/kernel/rcu/tree.c
>> +++ b/kernel/rcu/tree.c
>> @@ -1767,7 +1767,7 @@ static bool rcu_gp_init(void)
>> * go offline later. Please also refer to "Hotplug CPU" section
>> * of RCU's Requirements documentation.
>> */
>> - rcu_state.gp_state = RCU_GP_ONOFF;
>> + WRITE_ONCE(rcu_state.gp_state, RCU_GP_ONOFF);
>> rcu_for_each_leaf_node(rnp) {
>> smp_mb(); // Pair with barriers used when updating ->ofl_seq to odd values.
>> firstseq = READ_ONCE(rnp->ofl_seq);
>> @@ -1833,7 +1833,7 @@ static bool rcu_gp_init(void)
>> * The grace period cannot complete until the initialization
>> * process finishes, because this kthread handles both.
>> */
>> - rcu_state.gp_state = RCU_GP_INIT;
>> + WRITE_ONCE(rcu_state.gp_state, RCU_GP_INIT);
>> rcu_for_each_node_breadth_first(rnp) {
>> rcu_gp_slow(gp_init_delay);
>> raw_spin_lock_irqsave_rcu_node(rnp, flags);
>> @@ -1932,17 +1932,22 @@ static void rcu_gp_fqs_loop(void)
>> ret = 0;
>> for (;;) {
>> if (!ret) {
>> - rcu_state.jiffies_force_qs = jiffies + j;
>> + WRITE_ONCE(rcu_state.jiffies_force_qs, jiffies + j);
>> + /*
>> + * jiffies_force_qs before RCU_GP_WAIT_FQS state
>> + * update; required for stall checks.
>> + */
>> + smp_wmb();
>> WRITE_ONCE(rcu_state.jiffies_kick_kthreads,
>> jiffies + (j ? 3 * j : 2));
>> }
>> trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq,
>> TPS("fqswait"));
>> - rcu_state.gp_state = RCU_GP_WAIT_FQS;
>> + WRITE_ONCE(rcu_state.gp_state, RCU_GP_WAIT_FQS);
>> ret = swait_event_idle_timeout_exclusive(
>> rcu_state.gp_wq, rcu_gp_fqs_check_wake(&gf), j);
>> rcu_gp_torture_wait();
>> - rcu_state.gp_state = RCU_GP_DOING_FQS;
>> + WRITE_ONCE(rcu_state.gp_state, RCU_GP_DOING_FQS);
>> /* Locking provides needed memory barriers. */
>> /* If grace period done, leave loop. */
>> if (!READ_ONCE(rnp->qsmask) &&
>> @@ -2056,7 +2061,7 @@ static void rcu_gp_cleanup(void)
>> trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("end"));
>> rcu_seq_end(&rcu_state.gp_seq);
>> ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
>> - rcu_state.gp_state = RCU_GP_IDLE;
>> + WRITE_ONCE(rcu_state.gp_state, RCU_GP_IDLE);
>> /* Check for GP requests since above loop. */
>> rdp = this_cpu_ptr(&rcu_data);
>> if (!needgp && ULONG_CMP_LT(rnp->gp_seq, rnp->gp_seq_needed)) {
>> @@ -2095,12 +2100,12 @@ static int __noreturn rcu_gp_kthread(void *unused)
>> for (;;) {
>> trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq,
>> TPS("reqwait"));
>> - rcu_state.gp_state = RCU_GP_WAIT_GPS;
>> + WRITE_ONCE(rcu_state.gp_state, RCU_GP_WAIT_GPS);
>> swait_event_idle_exclusive(rcu_state.gp_wq,
>> READ_ONCE(rcu_state.gp_flags) &
>> RCU_GP_FLAG_INIT);
>> rcu_gp_torture_wait();
>> - rcu_state.gp_state = RCU_GP_DONE_GPS;
>> + WRITE_ONCE(rcu_state.gp_state, RCU_GP_DONE_GPS);
>> /* Locking provides needed memory barrier. */
>> if (rcu_gp_init())
>> break;
>> @@ -2115,9 +2120,9 @@ static int __noreturn rcu_gp_kthread(void *unused)
>> rcu_gp_fqs_loop();
>>
>> /* Handle grace-period end. */
>> - rcu_state.gp_state = RCU_GP_CLEANUP;
>> + WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANUP);
>> rcu_gp_cleanup();
>> - rcu_state.gp_state = RCU_GP_CLEANED;
>> + WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANED);
>> }
>> }
>>
>> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
>> index 70d48c5..4ff7990 100644
>> --- a/kernel/rcu/tree_stall.h
>> +++ b/kernel/rcu/tree_stall.h
>> @@ -468,6 +468,37 @@ static void rcu_check_gp_kthread_starvation(void)
>> }
>> }
>>
>> +/* Complain about missing wakeups from expired fqs wait timer */
>> +static void rcu_check_gp_kthread_expired_fqs_timer(void)
>> +{
>> + struct task_struct *gpk = rcu_state.gp_kthread;
>> + short gp_state;
>> + unsigned long jiffies_fqs;
>> + int cpu;
>> +
>> + gp_state = READ_ONCE(rcu_state.gp_state);
>> + /*
>> + * Order reads of .gp_state and .jiffies_force_qs.
>> + * Matching smp_wmb() is present in rcu_gp_fqs_loop().
>> + */
>> + smp_rmb();
>> + jiffies_fqs = READ_ONCE(rcu_state.jiffies_force_qs);
>
> Or you could use smp_load_acquire() instead of the smp_rmb()/READ_ONCE()
> combination. (Other parts of the stall-warning code might also be able
> to do this, given that they predate smp_load_acquire().)
>
Done for this part; will check other parts for cleanup.
>> +
>> + if (gp_state == RCU_GP_WAIT_FQS &&
>> + time_before(jiffies + RCU_STALL_MIGHT_MIN, jiffies_fqs) &&
>> + gpk && !READ_ONCE(gpk->on_rq)) {
>> + cpu = task_cpu(gpk);
>> + pr_err("%s kthread timer wakeup didn't happen for %ld jiffies! g%ld f%#x %s(%d) ->state=%#lx\n",
>> + rcu_state.name, (jiffies - jiffies_fqs),
>> + (long)rcu_seq_current(&rcu_state.gp_seq),
>> + data_race(rcu_state.gp_flags),
>> + gp_state_getname(RCU_GP_WAIT_FQS), RCU_GP_WAIT_FQS,
>> + gpk->state);
>> + pr_err("\tPossible timer handling issue on cpu=%d timer-softirq=%u\n",
>> + cpu, kstat_softirqs_cpu(TIMER_SOFTIRQ, cpu));
>> + }
>
> Otherwise looks plausible.
>
>> +}
>> +
>> static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps)
>> {
>> int cpu;
>> @@ -530,6 +561,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps)
>> jiffies + 3 * rcu_jiffies_till_stall_check() + 3);
>>
>> rcu_check_gp_kthread_starvation();
>> + rcu_check_gp_kthread_expired_fqs_timer();
>>
>> panic_on_rcu_stall();
>>
>> @@ -565,6 +597,7 @@ static void print_cpu_stall(unsigned long gps)
>> (long)rcu_seq_current(&rcu_state.gp_seq), totqlen);
>>
>> rcu_check_gp_kthread_starvation();
>> + rcu_check_gp_kthread_expired_fqs_timer();
>
> I don't have an opinion about the ordering of this output within the
> RCU CPU stall warning message, but I do encourage you to try out a
> few different options.
>
I think all stall warnings which describe GP kthread state - like
starvation, timer wakeup, can be kept close to each other?
> Here is one possible way to test this code:
>
> tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 2 --configs TREE09 --bootargs "rcutorture.stall_cpu=25 rcutorture.stall_cpu_irqsoff=1" --trust-make
>
> As an alternative to the above "--bootargs" parameter, the
> rcutorture.stall_gp_kthread module parameter should also be interesting.
> If I remember correctly, it should result in a stall without causing
> your timer code above to print anything.
>
> See recent paulmck.livejournal.com posts for additional information
> on rcutorture and running it.
>
> Thanx, Paul
>
Thanks! Wasn't able to get the warning with below:
tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 2
--configs TREE09 --bootargs "rcutorture.stall_cpu=25
rcutorture.stall_cpu_irqsoff=1" --trust-make
Possibly, for single cpu case, while interrupts are disabled, stall
checks didn't happen?
Was able to reproduce with
tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 2
--configs TREE01 --bootargs "rcutorture.stall_gp_kthread=50" --trust-make
Below is the log output; I moved the warning above starvation logs,
so that its visible before nmi backtrace printed from starvation logs:
[ 32.780341] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 32.782062] (detected by 4, t=21003 jiffies, g=981, q=36134)
[ 32.782065] rcu: All QSes seen, last rcu_preempt kthread activity
20958 (4294699485-4294678527), jiffies_till_next_fqs=3, root ->qsmask 0x0
[ 32.786652] rcu: rcu_preempt kthread timer wakeup didn't happen for
20957 jiffies! g981 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 32.788868] rcu: Possible timer handling issue on cpu=4
timer-softirq=11142
[ 32.789946] rcu: rcu_preempt kthread starved for 20960 jiffies! g981
f0x2 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=4 fqs: 4294678530/4294699487
[ 32.791866] rcu: Unless rcu_preempt kthread gets sufficient CPU
time, OOM is now expected behavior.
[ 32.793258] rcu: RCU grace-period kthread stack dump:
[ 32.794046] task:rcu_preempt state:I stack:14624 pid: 10 ppid:
2 flags:0x00004000
[ 32.795326] Call Trace:
[ 32.795718] __schedule+0x269/0x6b0
[ 32.796262] schedule+0x41/0xe0
[ 32.796704] schedule_timeout+0x19c/0x2e0
Thanks
Neeraj
--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation
Powered by blists - more mailing lists