[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <SN6PR2101MB1807BDF049D7155201A8178DBFFA1@SN6PR2101MB1807.namprd21.prod.outlook.com>
Date: Wed, 25 Nov 2020 04:56:33 +0000
From: Dexuan Cui <decui@...rosoft.com>
To: "Paul E. McKenney" <paulmck@...nel.org>,
"boqun.feng@...il.com" <boqun.feng@...il.com>,
Ingo Molnar <mingo@...hat.com>,
"rcu@...r.kernel.org" <rcu@...r.kernel.org>,
vkuznets <vkuznets@...hat.com>
CC: Michael Kelley <mikelley@...rosoft.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: kdump always hangs in rcu_barrier() -> wait_for_completion()
Hi,
I happened to hit a kdump hang issue in a Linux VM running on some
Hyper-V host. Please see the attached log: the kdump kernel always hangs,
even if I configure only 1 virtual CPU to the VM.
I firstly hit the issue in RHEL 8.3's 4.18.x kernel, but later I found that
the latest upstream v5.10-rc5 also has the same issue (at least the
symptom is exactly the same), so I dug into v5.10-rc5 and found that the
kdump kernel always hangs in kernel_init() -> mark_readonly() ->
rcu_barrier() -> wait_for_completion(&rcu_state.barrier_completion).
Let's take the 1-vCPU case for example (refer to the attached log): in the
below code, somehow rcu_segcblist_n_cbs() returns true, so the call
smp_call_function_single(cpu, rcu_barrier_func, (void *)cpu, 1) increases
the counter by 1, and hence later the counter is still 1 after the
atomic_sub_and_test(), and the complete() is not called.
static void rcu_barrier_func(void *cpu_in)
{
...
if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
atomic_inc(&rcu_state.barrier_cpu_count);
} else {
...
}
void rcu_barrier(void)
{
...
atomic_set(&rcu_state.barrier_cpu_count, 2);
...
for_each_possible_cpu(cpu) {
rdp = per_cpu_ptr(&rcu_data, cpu);
...
if (rcu_segcblist_n_cbs(&rdp->cblist) && cpu_online(cpu)) {
...
smp_call_function_single(cpu, rcu_barrier_func, (void *)cpu, 1);
...
}
}
...
if (atomic_sub_and_test(2, &rcu_state.barrier_cpu_count))
complete(&rcu_state.barrier_completion);
...
wait_for_completion(&rcu_state.barrier_completion);
Sorry for my ignorance of RCU -- I'm not sure why the rcu_segcblist_n_cbs()
returns 1 here. In the normal kernel, it returns 0, so the normal kernel does not
hang.
Note: in the case of kdump kernel, if I remove the kernel parameter
console=ttyS0 OR if I build the kernel with CONFIG_HZ=250, the issue can
no longer reproduce. Currently my kernel uses CONFIG_HZ=1000 and I use
console=ttyS0, so I'm able to reproduce the isue every time.
Note: the same kernel binary can not reproduce the issue when the VM
runs on another Hyper-V host.
It looks there is some kind of race condition?
Looking forward to your insights!
I'm happy to test any patch or enable more tracing, if necessary. Thanks!
Thanks,
-- Dexuan
Download attachment "bad-hz-1000.log" of type "application/octet-stream" (78840 bytes)
Powered by blists - more mailing lists