linux-kernel - Re: RCU stall when using function

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20170816163228.GZ7017@linux.vnet.ibm.com>
Date:   Wed, 16 Aug 2017 09:32:28 -0700
From:   "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:     Steven Rostedt <rostedt@...dmis.org>
Cc:     Daniel Lezcano <daniel.lezcano@...aro.org>,
        Pratyush Anand <panand@...hat.com>,
        김동현 <austinkernel.kim@...il.com>,
        john.stultz@...aro.org, linux-kernel@...r.kernel.org
Subject: Re: RCU stall when using function_graph

On Wed, Aug 16, 2017 at 10:04:21AM -0400, Steven Rostedt wrote:
> On Wed, 16 Aug 2017 10:42:15 +0200
> Daniel Lezcano <daniel.lezcano@...aro.org> wrote:
> 
> > Hi Steven,
> > 
> > 
> > On 15/08/2017 15:29, Steven Rostedt wrote:
> > > 
> > > [ I'm back from vacation! ]  
> > 
> > Did you get the tapes? :)
> 
> Yes, but nothing in them would cause the reputation of the POTUS to
> become any worse than it already is.
> 
> > 
> > > On Wed, 9 Aug 2017 17:51:33 +0200
> > > Daniel Lezcano <daniel.lezcano@...aro.org> wrote:
> > >   
> > >> Well, may be the instruction pointer thing is not a good idea.
> > >>
> > >> I learnt from this experience, an overloaded kernel with a lot of
> > >> interrupts can hang the console and issue RCU stall.
> > >>
> > >> However, someone else can face the same situation. Even if he reads the
> > >> RCU/stallwarn.txt documentation, it will be hard to figure out the issue.
> > >>
> > >> A message telling the grace period can't be reached because we are too
> > >> busy processing interrupts would have helped but I understand it is not
> > >> easy to implement.  
> > > 
> > > What if the stall code triggered an irqwork first? The irqwork would
> > > trigger as soon as interrupts were enabled again (or at the next tick,
> > > depending on the arch), and then it would know that RCU stalled due to
> > > an irq storm if the irqwork is being hit.  
> > 
> > Is that condition enough to tell the CPU is over utilized by the
> > interrupts handling?
> > 
> > And I'm wondering if it wouldn't make sense to have this detection in
> > the irq code. With or without the RCU stall warning kernel option set,
> > the irq framework will be warning about this situation. If the RCU stall
> > option is set, that will issue a second message. It will be easy to do
> > the connection between the first message and the second one, no ?
> 
> The thing is, the RCU code keeps track of the state of progress, I
> don't believe the interrupt code does. It just worries about handling
> interrupts. I'm not excited about adding infrastructure to the
> interrupt code to do accounting of IRQ storms.
> 
> On the other hand, the RCU code already does this. If it notices a
> stall, it can trigger a irq_work and wait a little more. If the
> irq_work doesn't fire, then it can do the normal RCU stall message. But
> if the irq_work does fire, and the RCU progress still hasn't moved
> forward, then it would be able to say this is due to an IRQ storm and
> produce a better error message.

Let me see if I understand you...  About halfway to the stall limit,
RCU triggers an irq_work (on each CPU that has not yet passed through
a quiescent state, IPIing them in turn?), and if the irq_work has
not completed by the end of the stall limit, RCU adds that to its
stall-warning message.

Or am I missing something here?

							Thanx, Paul