linux-kernel - Re: [PATCH tip/core/rcu] classic RCU locking and memory-barrier cleanups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20080818140404.GD6847@linux.vnet.ibm.com>
Date:	Mon, 18 Aug 2008 07:04:04 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Manfred Spraul <manfred@...orfullife.com>
Cc:	linux-kernel@...r.kernel.org, mingo@...e.hu,
	akpm@...ux-foundation.org, oleg@...sign.ru, dipankar@...ibm.com,
	rostedt@...dmis.org, dvhltc@...ibm.com, niv@...ibm.com
Subject: Re: [PATCH tip/core/rcu] classic RCU locking and memory-barrier
	cleanups

On Mon, Aug 18, 2008 at 11:13:45AM +0200, Manfred Spraul wrote:
> Paul E. McKenney wrote:
>>
>>> Right now, I try to understand the current code first - and some of it 
>>> doesn't make much sense.
>>>
>>> There are three per-cpu lists:
>>> ->nxt
>>> ->cur
>>> ->done.
>>>
>>> Obviously, there must be a quiescent state between cur and done.
>>> But why does the code require a quiescent state between nxt and cur?
>>> I think that's superflous. The only thing that is required is that all 
>>> cpus have moved their callbacks from nxt to cur. That doesn't need a 
>>> quiescent state, this operation could be done in hard interrupt as well.
>>>     
>>
>> The deal is that we have to put incoming callbacks somewhere while
>> the batch in ->cur waits for an RCU grace period.  That somewhere is
>> ->nxt.  So to be painfully pedantic, the callbacks in ->nxt are not
>> waiting for an RCU grace period.  Instead, they are waiting for the
>> callbacks in ->cur to get out of the way.
>>
>>   
> Ok, thanks.
> If I understand the new code in tip/rcu correctly, you have rewritten that 
> block anyway.
>
> I'll try to implement my proposal - on paper, it looks far simpler than the 
> current code.

Indeed.  The current code has accreted a bunch of fixes that have greatly
complicated the code.

> On the one hand, a state machine that keeps track of a global state:
> - collect the callbacks in a nxt list.
> - wait for quiecent
> - destroy the callbacks in the nxt list.
> (actually, there will be 5 states, 2 additional for "start the next rcu 
> cycle immediately")
>
> On the other hand a cpu bitmap that keeps track of the cpus that have 
> completed the work that must be done after a state change.
> The last cpu advances the global state.

FWIW, attached is what I have coded up thus far along these lines.
Currently testing in user space, hence some of the strange code and
comments flagged by "@@@" and "&&&&".  It automatically shapes the
hierarchy, reverting to no hierarchy for machines with few CPUs, and
using up to two additional levels for machines with many CPUs.  It would
not be hard to add additional levels, but given that the two additional
levels on a 64-bit machine can support 256K CPUs, this should suffice
for almost all cases.

> The state machine could be seq_lock protected, the cpu bitmap could be 
> either hierarchical or flat or for uniprocessor just a nop.

I chose hierarchical for cache locality.  I used simple spinlocks
instead of seq_lock on the grounds that the hierarchy decreases lock
contention by orders of magnitude.  I do still access some values
outside of locks, but this is greatly reduced compared to the current
code base.

> Do you have any statistics about rcu_check_callbacks? On my single-cpu 
> system, around 2/3 of the calls are from "normal" context, i.e. 
> rcu_qsctr_inc() is called.

This depends critically on the workload.  Workloads such as HPC that
spend almost all of their time in user-mode execution would invoke
rcu_qsctr_inc() almost all of the time, as would workloads with
significant idle time.  On the other hand, it is possible to arrange
so that almost all execution is in the kernel and there are almost no
context switches, in which case, rcu_qsctr_inc() would only be called
after a force_quiescent_state().

But most workloads should see rcu_qsctr_inc() called most of the time.

							Thanx, Paul

View attachment "rcuclassic.h" of type "text/x-chdr" (3537 bytes)

View attachment "rcuclassic.c" of type "text/x-csrc" (28084 bytes)