linux-kernel - Re: question about RCU dynticks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150504193923.GX5381@linux.vnet.ibm.com>
Date:	Mon, 4 May 2015 12:39:23 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Rik van Riel <riel@...hat.com>
Cc:	Paolo Bonzini <pbonzini@...hat.com>,
	Ingo Molnar <mingo@...nel.org>,
	Andy Lutomirski <luto@...capital.net>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	X86 ML <x86@...nel.org>, williams@...hat.com,
	Andrew Lutomirski <luto@...nel.org>, fweisbec@...hat.com,
	Peter Zijlstra <peterz@...radead.org>,
	Heiko Carstens <heiko.carstens@...ibm.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: question about RCU dynticks_nesting

On Mon, May 04, 2015 at 03:00:44PM -0400, Rik van Riel wrote:
> On 05/04/2015 11:59 AM, Rik van Riel wrote:
> 
> > However, currently the RCU code seems to use a much more
> > complex counting scheme, with a different increment for
> > kernel/task use, and irq use.
> > 
> > This counter seems to be modeled on the task preempt_counter,
> > where we do care about whether we are in task context, irq
> > context, or softirq context.
> > 
> > On the other hand, the RCU code only seems to care about
> > whether or not a CPU is in an extended quiescent state,
> > or is potentially in an RCU critical section.
> > 
> > Paul, what is the reason for RCU using a complex counter,
> > instead of a simple increment for each potential kernel/RCU
> > entry, like rcu_read_lock() does with CONFIG_PREEMPT_RCU
> > enabled?
> 
> Looking at the code for a while more, I have not found
> any reason why the rcu dynticks counter is so complex.

For the nesting counter, please see my earlier email.

> The rdtp->dynticks atomic seems to be used as a serial
> number. Odd means the cpu is in an rcu quiescent state,
> even means it is not.

Yep.

> This test is used to verify whether or not a CPU is
> in rcu quiescent state. Presumably the atomic_add_return
> is used to add a memory barrier.
> 
> 	atomic_add_return(0, &rdtp->dynticks) & 0x1)

Yep.  It is sampled remotely, hence the need for full memory barriers.
It doesn't help to sample the counter if the sampling gets reordered
with the surrounding code.  Ditto for the increments.

By the end of the year, and hopefully much sooner, I expect to have
testing infrastructure capable of detecting ordering bugs in this code.
At which point, I can start experimenting with alternative code sequences.

But full ordering is still required, and cache misses can happen.

> > In fact, would we be able to simply use tsk->rcu_read_lock_nesting
> > as an indicator of whether or not we should bother waiting on that
> > task or CPU when doing synchronize_rcu?
> 
> We seem to have two variants of __rcu_read_lock().
> 
> One increments current->rcu_read_lock_nesting, the other
> calls preempt_disable().

Yep.  The first is preemptible RCU, the second classic RCU.

> In case of the non-preemptible RCU, we could easily also
> increase current->rcu_read_lock_nesting at the same time
> we increase the preempt counter, and use that as the
> indicator to test whether the cpu is in an extended
> rcu quiescent state. That way there would be no extra
> overhead at syscall entry or exit at all. The trick
> would be getting the preempt count and the rcu read
> lock nesting count in the same cache line for each task.

But in non-preemptible RCU, we have PREEMPT=n, so there is no preempt
counter in production kernels.  Even if there was, we have to sample this
on other CPUs, so the overhead of preempt_disable() and preempt_enable()
would be where kernel entry/exit is, so I expect that this would be a
net loss in overall performance.

> In case of the preemptible RCU scheme, we would have to
> examine the per-task state (under the runqueue lock)
> to get the current task info of all CPUs, and in
> addition wait for the blkd_tasks list to empty out
> when doing a synchronize_rcu().
> 
> That does not appear to require special per-cpu
> counters; examining the per-cpu rdp and the lists
> inside it, with the rnp->lock held if doing any
> list manipulation, looks like it would be enough.
> 
> However, the current code is a lot more complicated
> than that. Am I overlooking something obvious, Paul?
> Maybe something non-obvious? :)

Ummm...  The need to maintain memory ordering when sampling task
state from remote CPUs?

Or am I completely confused about what you are suggesting?

That said, are you chasing a real system-visible performance issue
that you tracked to RCU's dyntick-idle system?

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/