lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221020231353.GC5600@paulmck-ThinkPad-P17-Gen-1>
Date:   Thu, 20 Oct 2022 16:13:53 -0700
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Zhen Lei <thunder.leizhen@...wei.com>
Cc:     Frederic Weisbecker <frederic@...nel.org>,
        Neeraj Upadhyay <quic_neeraju@...cinc.com>,
        Josh Triplett <josh@...htriplett.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Lai Jiangshan <jiangshanlai@...il.com>,
        Joel Fernandes <joel@...lfernandes.org>, rcu@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/3] rcu: Add RCU stall diagnosis information

On Mon, Oct 17, 2022 at 06:01:05PM +0800, Zhen Lei wrote:
> In some extreme cases, such as the I/O pressure test, the CPU usage may
> be 100%, causing RCU stall. In this case, the printed information about
> current is not useful. Displays the number and usage of hard interrupts,
> soft interrupts, and context switches that are generated within half of
> the CPU stall timeout, can help us make a general judgment. In other
> cases, we can preliminarily determine whether an infinite loop occurs
> when local_irq, local_bh or preempt is disabled.
> 
> Zhen Lei (3):
>   sched: Add helper kstat_cpu_softirqs_sum()
>   sched: Add helper nr_context_switches_cpu()
>   rcu: Add RCU stall diagnosis information

Interesting approach, thank you!

I have pulled this in for testing and review, having rescued it from my
spam folder.

Some questions that might come up include:  (1) Can the addition of
things like cond_resched() make RCU happier with the I/O pressure test?
(2) Should there be a way to turn this off for environments with slow
consoles?  (3) If this information shows heavy CPU usage, what debug
and fix approach should be used?

For an example of #1, if a CPU is flooded with softirq activity, one
might hope that the call to rcu_softirq_qs() would prevent the RCU CPU
stall warning, at least for kernels built with CONFIG_PREEMPT_RT=n.
Similarly, if there are huge numbers of context switches, one might hope
that the rcu_note_context_switch() would report a quiescent state sooner
rather than later.

Thoughts?

							Thanx, Paul

>  include/linux/kernel_stat.h | 12 +++++++++++
>  kernel/rcu/tree.h           | 11 ++++++++++
>  kernel/rcu/tree_stall.h     | 40 +++++++++++++++++++++++++++++++++++++
>  kernel/sched/core.c         |  5 +++++
>  4 files changed, 68 insertions(+)
> 
> -- 
> 2.25.1
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ