linux-kernel - Re: [PATCH v7 5/6] doc: Document CONFIG_RCU_CPU_STALL

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20221116225507.GA839220@lothringen>
Date:   Wed, 16 Nov 2022 23:55:07 +0100
From:   Frederic Weisbecker <frederic@...nel.org>
To:     Zhen Lei <thunder.leizhen@...wei.com>
Cc:     "Paul E . McKenney" <paulmck@...nel.org>,
        Neeraj Upadhyay <quic_neeraju@...cinc.com>,
        Josh Triplett <josh@...htriplett.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Lai Jiangshan <jiangshanlai@...il.com>,
        Joel Fernandes <joel@...lfernandes.org>, rcu@...r.kernel.org,
        linux-kernel@...r.kernel.org, Robert Elliott <elliott@....com>
Subject: Re: [PATCH v7 5/6] doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y
 stall information

On Fri, Nov 11, 2022 at 09:07:08PM +0800, Zhen Lei wrote:
> +1. A CPU looping with interrupts disabled.::
> +
> +   rcu:          hardirqs   softirqs   csw/system
> +   rcu:  number:        0          0            0
> +65;6003;1c   rcu: cputime:        0          0            0   ==> 2500(ms)
> +
> +   Because interrupts have been disabled throughout the measurement
> +   interval, there are no interrupts and no context switches.
> +   Furthermore, because CPU time consumption was measured using interrupt
> +   handlers, the system CPU consumption is misleadingly measured as zero.
> +   This scenario will normally also have "(0 ticks this GP)" printed on
> +   this CPU's summary line.
> +
> +2. A CPU looping with bottom halves disabled.
> +
> +   This is similar to the previous example, but with non-zero number of
> +   and CPU time consumed by hard interrupts, along with non-zero CPU
> +   time consumed by in-kernel execution.::
> +
> +   rcu:          hardirqs   softirqs   csw/system
> +   rcu:  number:      624          0            0
> +   rcu: cputime:       49          0         2446   ==> 2500(ms)
> +
> +   The fact that there are zero softirqs gives a hint that these were
> +   disabled, perhaps via local_bh_disable().  It is of course possible
> +   that there were no softirqs, perhaps because all events that would
> +   result in softirq execution are confined to other CPUs.  In this case,
> +   the diagnosis should continue as shown in the next example.
> +
> +3. A CPU looping with preemption disabled.
> +
> +   Here, only the number of context switches is zero.::
> +
> +   rcu:          hardirqs   softirqs   csw/system
> +   rcu:  number:      624         45            0
> +   rcu: cputime:       69          1         2425   ==> 2500(ms)
> +
> +   This situation hints that the stalled CPU was looping with preemption
> +   disabled.
> +
> +4. No looping, but massive hard and soft interrupts.::
> +
> +   rcu:          hardirqs   softirqs   csw/system
> +   rcu:  number:       xx         xx            0
> +   rcu: cputime:       xx         xx            0   ==> 2500(ms)
> +
> +   Here, the number and CPU time of hard interrupts are all non-zero,
> +   but the number of context switches and the in-kernel CPU time consumed
> +   are zero. The number and cputime of soft interrupts will usually be
> +   non-zero, but could be zero, for example, if the CPU was spinning
> +   within a single hard interrupt handler.
> +
> +   If this type of RCU CPU stall warning can be reproduced, you can
> +   narrow it down by looking at /proc/interrupts or by writing code to
> +   trace each interrupt, for example, by referring to show_interrupts().

One last question I have. Usually all these informations can be deduced by
just looking at the stacktrace that comes along an RCU stall report. So on
which kind of situation the stacktrace is not enough?

Thanks.