lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20190723134717.GT14271@linux.ibm.com>
Date:   Tue, 23 Jul 2019 06:47:17 -0700
From:   "Paul E. McKenney" <paulmck@...ux.ibm.com>
To:     Byungchul Park <byungchul.park@....com>
Cc:     Joel Fernandes <joel@...lfernandes.org>,
        Byungchul Park <max.byungchul.park@...il.com>,
        rcu <rcu@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>,
        kernel-team@....com
Subject: Re: [PATCH] rcu: Make jiffies_till_sched_qs writable

On Tue, Jul 23, 2019 at 08:05:21PM +0900, Byungchul Park wrote:
> On Fri, Jul 19, 2019 at 04:33:56PM -0400, Joel Fernandes wrote:
> > On Fri, Jul 19, 2019 at 3:57 PM Paul E. McKenney <paulmck@...ux.ibm.com> wrote:
> > >
> > > On Fri, Jul 19, 2019 at 06:57:58PM +0900, Byungchul Park wrote:
> > > > On Fri, Jul 19, 2019 at 4:43 PM Paul E. McKenney <paulmck@...ux.ibm.com> wrote:
> > > > >
> > > > > On Thu, Jul 18, 2019 at 08:52:52PM -0400, Joel Fernandes wrote:
> > > > > > On Thu, Jul 18, 2019 at 8:40 PM Byungchul Park <byungchul.park@....com> wrote:
> > > > > > [snip]
> > > > > > > > - There is a bug in the CPU stopper machinery itself preventing it
> > > > > > > > from scheduling the stopper on Y. Even though Y is not holding up the
> > > > > > > > grace period.
> > > > > > >
> > > > > > > Or any thread on Y is busy with preemption/irq disabled preventing the
> > > > > > > stopper from being scheduled on Y.
> > > > > > >
> > > > > > > Or something is stuck in ttwu() to wake up the stopper on Y due to any
> > > > > > > scheduler locks such as pi_lock or rq->lock or something.
> > > > > > >
> > > > > > > I think what you mentioned can happen easily.
> > > > > > >
> > > > > > > Basically we would need information about preemption/irq disabled
> > > > > > > sections on Y and scheduler's current activity on every cpu at that time.
> > > > > >
> > > > > > I think all that's needed is an NMI backtrace on all CPUs. An ARM we
> > > > > > don't have NMI solutions and only IPI or interrupt based backtrace
> > > > > > works which should at least catch and the preempt disable and softirq
> > > > > > disable cases.
> > > > >
> > > > > True, though people with systems having hundreds of CPUs might not
> > > > > thank you for forcing an NMI backtrace on each of them.  Is it possible
> > > > > to NMI only the ones that are holding up the CPU stopper?
> > > >
> > > > What a good idea! I think it's possible!
> > > >
> > > > But we need to think about the case NMI doesn't work when the
> > > > holding-up was caused by IRQ disabled.
> > > >
> > > > Though it's just around the corner of weekend, I will keep thinking
> > > > on it during weekend!
> > >
> > > Very good!
> > 
> > Me too will think more about it ;-) Agreed with point about 100s of
> > CPUs usecase,
> > 
> > Thanks, have a great weekend,
> 
> BTW, if there's any long code section with irq/preemption disabled, then
> the problem would be not only about RCU stall. And we can also use
> latency tracer or something to detect the bad situation.
> 
> So in this case, sending ipi/nmi to the CPUs where the stoppers cannot
> to be scheduled does not give us additional meaningful information.
> 
> I think Paul started to think about this to solve some real problem. I
> seriously love to help RCU and it's my pleasure to dig deep into kind of
> RCU stuff, but I've yet to define exactly what problem is. Sorry.
> 
> Could you share the real issue? I think you don't have to reproduce it.
> Just sharing the issue that you got inspired from is enough. Then I
> might be able to develop 'how' with Joel! :-) It's our pleasure!

It is unfortunately quite intermittent.  I was hoping to find a way
to make it happen more often.  Part of the underlying problem appears
to be lock contention, in that reducing contention made it even more
intermittent.  Which is good in general, but not for exercising the
CPU-stopper issue.

But perhaps your hardware will make this happen more readily than does
mine.  The repeat-by is simple, namely run TREE04 on branch "dev" on an
eight-CPU system.  It appear that the number of CPUs used by the test
should match the number available on the system that you are running on,
though perhaps affinity could allow mismatches.

So why not try it and see what happens?

							Thanx, Paul

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ