[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1d007125-9e1a-8018-d6b4-8838ecc1a873@huawei.com>
Date: Thu, 17 Nov 2022 10:03:17 +0800
From: "Leizhen (ThunderTown)" <thunder.leizhen@...wei.com>
To: Frederic Weisbecker <frederic@...nel.org>
CC: "Paul E . McKenney" <paulmck@...nel.org>,
Neeraj Upadhyay <quic_neeraju@...cinc.com>,
Josh Triplett <josh@...htriplett.org>,
"Steven Rostedt" <rostedt@...dmis.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Lai Jiangshan <jiangshanlai@...il.com>,
Joel Fernandes <joel@...lfernandes.org>, <rcu@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, Robert Elliott <elliott@....com>
Subject: Re: [PATCH v7 5/6] doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall
information
On 2022/11/17 6:55, Frederic Weisbecker wrote:
> On Fri, Nov 11, 2022 at 09:07:08PM +0800, Zhen Lei wrote:
>> +1. A CPU looping with interrupts disabled.::
>> +
>> + rcu: hardirqs softirqs csw/system
>> + rcu: number: 0 0 0
>> +65;6003;1c rcu: cputime: 0 0 0 ==> 2500(ms)
>> +
>> + Because interrupts have been disabled throughout the measurement
>> + interval, there are no interrupts and no context switches.
>> + Furthermore, because CPU time consumption was measured using interrupt
>> + handlers, the system CPU consumption is misleadingly measured as zero.
>> + This scenario will normally also have "(0 ticks this GP)" printed on
>> + this CPU's summary line.
>> +
>> +2. A CPU looping with bottom halves disabled.
>> +
>> + This is similar to the previous example, but with non-zero number of
>> + and CPU time consumed by hard interrupts, along with non-zero CPU
>> + time consumed by in-kernel execution.::
>> +
>> + rcu: hardirqs softirqs csw/system
>> + rcu: number: 624 0 0
>> + rcu: cputime: 49 0 2446 ==> 2500(ms)
>> +
>> + The fact that there are zero softirqs gives a hint that these were
>> + disabled, perhaps via local_bh_disable(). It is of course possible
>> + that there were no softirqs, perhaps because all events that would
>> + result in softirq execution are confined to other CPUs. In this case,
>> + the diagnosis should continue as shown in the next example.
>> +
>> +3. A CPU looping with preemption disabled.
>> +
>> + Here, only the number of context switches is zero.::
>> +
>> + rcu: hardirqs softirqs csw/system
>> + rcu: number: 624 45 0
>> + rcu: cputime: 69 1 2425 ==> 2500(ms)
>> +
>> + This situation hints that the stalled CPU was looping with preemption
>> + disabled.
>> +
>> +4. No looping, but massive hard and soft interrupts.::
>> +
>> + rcu: hardirqs softirqs csw/system
>> + rcu: number: xx xx 0
>> + rcu: cputime: xx xx 0 ==> 2500(ms)
>> +
>> + Here, the number and CPU time of hard interrupts are all non-zero,
>> + but the number of context switches and the in-kernel CPU time consumed
>> + are zero. The number and cputime of soft interrupts will usually be
>> + non-zero, but could be zero, for example, if the CPU was spinning
>> + within a single hard interrupt handler.
>> +
>> + If this type of RCU CPU stall warning can be reproduced, you can
>> + narrow it down by looking at /proc/interrupts or by writing code to
>> + trace each interrupt, for example, by referring to show_interrupts().
>
> One last question I have. Usually all these informations can be deduced by
> just looking at the stacktrace that comes along an RCU stall report. So on
> which kind of situation the stacktrace is not enough?
Interrupt storm.
>
> Thanks.
> .
>
--
Regards,
Zhen Lei
Powered by blists - more mailing lists