linux-kernel - Re: [RFC, PATCH] state machine based rcu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sat, 23 Aug 2008 10:16:17 +0200
From:	Manfred Spraul <manfred@...orfullife.com>
To:	paulmck@...ux.vnet.ibm.com
CC:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [RFC, PATCH] state machine based rcu

Paul E. McKenney wrote:
>> -#if defined(CONFIG_PREEMPT_RCU) && defined(CONFIG_NO_HZ)
>> +#ifdef CONFIG_NO_HZ
>>  extern void rcu_irq_enter(void);
>>  extern void rcu_irq_exit(void);
>>  #else
>>  # define rcu_irq_enter() do { } while (0)
>>  # define rcu_irq_exit() do { } while (0)
>> -#endif /* CONFIG_PREEMPT_RCU */
>> +#endif /* CONFIG_NO_HZ */
>>     
>
> Good approach!  Will steal it.  ;-)
>
>   
I've attached an updated patch [now without the initial "From" line. 
Either thunderbird or dovecot cannot handle that, sorry for the noise 
caused by posting everything 3 times].

Btw, does STP still exist? I'd like to do some testing on real SMP 
hardware. http://stp.testing.osdl.org/ appears to be dead.

>>  /*
>>   * It is safe to do non-atomic ops on ->hardirq_context,
>> diff --git a/include/linux/rcuclassic.h b/include/linux/rcuclassic.h
>> index 1658995..811969f 100644
>> --- a/include/linux/rcuclassic.h
>> +++ b/include/linux/rcuclassic.h
>> @@ -28,6 +28,8 @@
>>   * For detailed explanation of Read-Copy Update mechanism see -
>>   * 		Documentation/RCU
>>   *
>> + * Rewrite based on a global state machine
>> + * (C) Manfred Spraul <manfred@...orfullife.com>, 2008
>>   */
>>
>>  #ifndef __LINUX_RCUCLASSIC_H
>> @@ -39,88 +41,97 @@
>>  #include <linux/percpu.h>
>>  #include <linux/cpumask.h>
>>  #include <linux/seqlock.h>
>> +#include <linux/rcucpumask.h>
>>
>> +/*
>> + * global state machine:
>> + * - each cpu regularly check the global state and compares it with it's own local state.
>> + * - if both state do not match, then the cpus do the required work and afterwards
>> + *   - update their local state
>> + *   - clear their bit in the cpu bitmask.
>> + * The state machine is protected by the protocol:
>> + * The state can only change when all cpus have completed the current stage, thus
>> + * random changes cannot happen.
>> + * The only exception is the change from RCU_STATE_DESTROY to RCU_STATE_DESTROY_AND_COLLECT,
>> + * but this change doesn't matter, because RCU_STATE_DESTROY is a subset of
>> + * RCU_STATE_DESTROY_AND_COLLECT.
>> + *
>> + * The state is stored in the rcu_cpumask structure.
>> + */
>>     
>
> Interesting approach!  My main concern would be that this might extend
> grace periods (which has come up with preemptable RCU).  Or do you
> have some clever way of overlapping the required processing for the
> various states?
>
>   
No, no overlapping at all. But it shouldn't be slower than mainline:
Mainline has two grace periods between call_rcu() and the rcu callback.
My approach means one call and one grace period.

Your code might be a bit faster, if I understand it correctly, 
call_rcu() reads rdp->batch and includes everything in the next grace 
period.

> How do you handle the uncertainty as to when a given state begins?
> Here is an example sequence of events that I would be worried about:
>
> o	CPU 0 notices the end of a grace period, so updates the state.
>   
global state now DESTROY_AND_COLLECT.
> o	CPU 1 notices the new grace period while in a quiescent state.
> 	It checks into the RCU state machine.
>   
DESTROY_AND_COLLECT done for cpu 1. Btw, there is no need that there is 
a quiescent state for this operation.
> o	CPU 1 starts a long-running RCU read-side critical section.
>
> o	CPU 2 deletes one of the elements that CPU 1 is referencing,
> 	and registers an RCU callback to free it after a grace period.
>
>   
 >>> ok - here is call_rcu(). element in rcs->new.
> o	CPU 2 notices that a new grace period has commenced.
>
>   
CPU 2 notices DESTROY_AND_COLLECT. Moves all elements from rcs->new to 
rcs->old.
> o	The remaining CPUs (other than CPU 1, which already passed
> 	through a quiescent state) pass through a quiescent state, ending
> 	the grace period.  CPU 1 remains in its RCU read-side critical
> 	section.
>   
someone notices that DESTROY_AND_COLLECT is completed, moves global 
state to GRACE.
> o	The RCU grace period ends, permitting CPU 2 to free the element
> 	that it removed -- but which CPU 1 is still referencing.
>   
No - that's impossible. The grace period is started when the global 
state is set to GRACE, all cpus must pass a quiescent state while in GRACE.
What is still missing is:
- all cpus must pass a quiescent state.
- last cpus moves global state to DESTROY
- cpu 2 notices that the global state is DESTROY. It moves the elements 
from rcs->new to rcd->dead and the softirq will destroy them.

Oh - I forgot to list one point in the patch summary:
I've merged the list of dead pointers for the _bh and the _normal lists. 
rcu_do_batch() operates on a unified list.

>   Jiangshan recently unified this into another stage of
> queuing, which seems to work very well -- and much more straightforwardly.
>   
My approach is similar: first all cpus collect the pointers. Then the 
grace period starts. When all cpus have finished, the pointers are 
destroyed. New call_rcu() calls during the grace period are queued.

>> +/*
>> + * FIXME:
>> + * This is wrong:
>> + * NMIs are not handled.
>> + */
>>  #define call_rcu_sched(head, func) call_rcu(head, func)
>>     
>
> The approach preemptable RCU uses to interact with dynticks should
> handle this.  You mentioned using atomic operations previously, which
> might simplify the code (Steve and I were concerned that use of atomic
> ops in the interrupt path would get an automatic NACK, but it is quite
> possible that we were being too paranoid).
>
>   
I think it was a NACK on sparc, because sparc used a spinlock inside 
atomic_t. I assume it's ok today.
If it's not ok, then I would have to find another solution. I'll wait 
for complains.

>> +
>> +#ifndef __LINUX_RCUCPUMASK_H
>> +#define __LINUX_RCUCPUMASK_H
>> +
>> +#include <linux/spinlock.h>
>> +#include <linux/cpumask.h>
>> +
>> +#define RCUCPUMASK_CPULIMIT	512
>>     
>
> People are apparently looking at 4096 CPUs these days, FWIW.  I don't
> see any architectural limit in your code, so just FYI.
>
>   
The #define has a bad name: above that limit I would use a hierarchy 
instead of the flag rcu_cumask. The hierarchy is not yet implemented.
>> +#if (NR_CPUS > RCUCPUMASK_CPULIMIT)
>> +
>> +Bla Bla Bla
>> +
>>     
Here the miracle occurs: "bla bla bla" is replaced by a rcu_cpumask 
structure with (probably) an array of atomic_t's instead of the simple 
"int cpus_open".

>> +/*
>> + * rcu_cpumode:
>> + * -1:
>> + * "normal" rcu behavior: the scheduler and the timer interrupt
>> + * check for grace periods, read side critical sections are permitted
>> + * everywhere.
>> + *
>> + * 0:
>> + * This cpu is sitting in the idle thread, with disabled hz timer.
>> + *
>> + * > 0:
>> + * The cpu is in an interrupt that interrupted a nohz idle thread.
>> + */
>>     
>
> This could be made to work, but the advantage of preemptable RCU's
> upcounter approach is the ability to count momentarily dropping into
> dyntick idle mode as a quiescent state -- even if we don't happen to
> look at that CPU while it is actually residing in dyntick idle mode.
>
>   
My code does that same thing: When "0", the cpu is ignored by the state 
machine, the cpu is assumed to be outside any read side critical sections.
When switching from "1" to "0", the outstanding work for the current 
state is performed.

That's for the detailed review!

Attached is an updated patch, NO_HZ and NMI is now implemented.

--
    Manfred

View attachment "0001-kernel-rcustate.c-state-machine-based-rcu-implement.patch" of type "text/plain" (53479 bytes)