linux-kernel - Re: [PATCH 1/2] rcu: Comment on the extraneous delta test on rcu_seq_done

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <EF52A95D-C68B-4CB7-ADF3-5278FA79DCA5@nvidia.com>
Date: Sun, 23 Mar 2025 22:57:39 +0000
From: Joel Fernandes <joelagnelf@...dia.com>
To: Frederic Weisbecker <frederic@...nel.org>
CC: "Paul E. McKenney" <paulmck@...nel.org>, LKML
	<linux-kernel@...r.kernel.org>, Boqun Feng <boqun.feng@...il.com>, Neeraj
 Upadhyay <neeraj.upadhyay@....com>, Uladzislau Rezki <urezki@...il.com>,
	Zqiang <qiang.zhang1211@...il.com>, rcu <rcu@...r.kernel.org>
Subject: Re: [PATCH 1/2] rcu: Comment on the extraneous delta test on
 rcu_seq_done_exact()



> On Mar 23, 2025, at 11:05 PM, Frederic Weisbecker <frederic@...nel.org> wrote:
> 
> Le Sat, Mar 22, 2025 at 03:40:53PM +0100, Joel Fernandes a écrit :
>> 
>> 
>>> On 3/22/2025 3:20 PM, Joel Fernandes wrote:
>>> 
>>> On 3/22/2025 11:25 AM, Frederic Weisbecker wrote:
>>>> Le Sat, Mar 22, 2025 at 03:06:08AM +0100, Joel Fernandes a écrit :
>>>>> Insomnia kicked in, so 3 am reply here (Zurich local time) ;-):
>>>>> 
>>>>> On 3/20/2025 3:15 PM, Frederic Weisbecker wrote:
>>>>>> Le Wed, Mar 19, 2025 at 03:38:31PM -0400, Joel Fernandes a écrit :
>>>>>>> On Tue, Mar 18, 2025 at 11:37:38AM -0700, Paul E. McKenney wrote:
>>>>>>>> On Tue, Mar 18, 2025 at 02:56:18PM +0100, Frederic Weisbecker wrote:
>>>>>>>>> The numbers used in rcu_seq_done_exact() lack some explanation behind
>>>>>>>>> their magic. Especially after the commit:
>>>>>>>>> 
>>>>>>>>>    85aad7cc4178 ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection")
>>>>>>>>> 
>>>>>>>>> which reported a subtle issue where a new GP sequence snapshot was taken
>>>>>>>>> on the root node state while a grace period had already been started and
>>>>>>>>> reflected on the global state sequence but not yet on the root node
>>>>>>>>> sequence, making a polling user waiting on a wrong already started grace
>>>>>>>>> period that would ignore freshly online CPUs.
>>>>>>>>> 
>>>>>>>>> The fix involved taking the snaphot on the global state sequence and
>>>>>>>>> waiting on the root node sequence. And since a grace period is first
>>>>>>>>> started on the global state and only afterward reflected on the root
>>>>>>>>> node, a snapshot taken on the global state sequence might be two full
>>>>>>>>> grace periods ahead of the root node as in the following example:
>>>>>>>>> 
>>>>>>>>> rnp->gp_seq = rcu_state.gp_seq = 0
>>>>>>>>> 
>>>>>>>>>    CPU 0                                           CPU 1
>>>>>>>>>    -----                                           -----
>>>>>>>>>    // rcu_state.gp_seq = 1
>>>>>>>>>    rcu_seq_start(&rcu_state.gp_seq)
>>>>>>>>>                                                    // snap = 8
>>>>>>>>>                                                    snap = rcu_seq_snap(&rcu_state.gp_seq)
>>>>>>>>>                                                    // Two full GP differences
>>>>>>>>>                                                    rcu_seq_done_exact(&rnp->gp_seq, snap)
>>>>>>>>>    // rnp->gp_seq = 1
>>>>>>>>>    WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq);
>>>>>>>>> 
>>>>>>>>> Add a comment about those expectations and to clarify the magic within
>>>>>>>>> the relevant function.
>>>>>>>>> 
>>>>>>>>> Signed-off-by: Frederic Weisbecker <frederic@...nel.org>
>>>>>>>> Reviewed-by: Paul E. McKenney <paulmck@...nel.org>
>>>>>>>> 
>>>>>>>> But it would of course be good to get reviews from the others.
>>>>>>> I actually don't agree that the magic in the rcu_seq_done_exact() function about the
>>>>>>> ~2 GPs is related to the lag between rcu_state.gp_seq and root rnp->gp_seq,
>>>>>>> because the small lag can just as well survive with the rcu_seq_done()
>>>>>>> function in the above sequence right?
>>>>>>> 
>>>>>>> The rcu_seq_done_exact() function on the other hand is more about not being
>>>>>>> stuck in the ULONG_MAX/2 guard band, but to actually get to that, you need a
>>>>>>> wrap around to happen and the delta between "rnp->gp_seq" and "snap" to be at
>>>>>>> least ULONG_MAX/2 AFAIU.
>>>>>>> 
>>>>>>> So the only time this magic will matter is if you have a huge delta between
>>>>>>> what is being compared, not just 2 GPs.
>>>>>> You're right, and perhaps I should have made it more specific that my comment
>>>>>> only explains the magic "3" number here, in that if it were "2" instead, there
>>>>>> could be accidents with 2 full GPs difference (which is possible) spuriously
>>>>>> accounted as a wrap around.
>>>>> Ahh, so I guess I get it now and we are both right. The complete picture is - We
>>>>> are trying to handle the case of "very large wrap" around but as a part of that,
>>>>> we don't want to create false-positives for this "snap" case.
>>>>> 
>>>>> A "snap" can be atmost  (2 * RCU_SEQ_STATE_MASK + 1) away from a gp_seq.
>>>>> 
>>>>> That's within "2 GPs" worth of counts (about 8 counts)
>>>>> 
>>>>> Taking some numbers:
>>>>> 
>>>>> cur_s    s    delta (s - cur_s)
>>>>> 0    4    4
>>>>> 1    8    7
>>>>> 2    8    6
>>>>> 3    8    5
>>>>> 4    8    4
>>>>> 5    12    7
>>>>> 
>>>>> The maximum delta of a snap from actual gp_seq can be (2 * RCU_SEQ_STATE_MASK +
>>>>> 1) which in this case is 7.
>>>>> 
>>>>> So we adjust the comparison by adding the  ULONG_CMP_LT(cur_s, s - (2 *
>>>>> RCU_SEQ_STATE_MASK + 1)). i.e.
>>>> 3, right?
>>> Just to be absolutely sure, are you talking about the value of RCU_SEQ_STATE_MASK ?
>>> 
>>> That is indeed 3 (RCU_SEQ_STATE_MASK).
>>> 
>>> But if we're talking about number of GPs, my understanding is a count of 4 is
>>> one GP worth. Per the above table, the delta between gp_seq and is snap is
>>> always a count of 7 (hence less than 2 GPs).
>>> 
>>> Agreed?
>>> 
>>> If RCU_SEQ_STATE_MASK was 0x1 instead of 0x11, that is a single bit (or a count
>>> of 2 instead of 4, for a GP), then the table would be:
>>> 
>>> cur_s    s (snap)    delta (s - cur_s)
>>> 0    2        2
>>> 1    4        3
>>> 2    4        2
>>> 3    6        3
>>> 4    6        2
>>> 5    8        3
>>> 
>>> So delta is always <= 3,  Or more generally: <= (RCU_SEQ_STATE_MASK * 2) + 1
>> 
>> Oh man, I am wondering if we are on to a bug here:
>> 
>> From your example:
>> 
>>    CPU 0                                           CPU 1
>>    -----                                           -----
>>    // rcu_state.gp_seq = 1
>>    rcu_seq_start(&rcu_state.gp_seq)
>>                                      // snap = 8
>>                                      snap = rcu_seq_snap(&rcu_state.gp_seq)
>>                                      // Two full GP
>>                                      rcu_seq_done_exact(&rnp->gp_seq, snap)
>> 
>> 
>> Here, the
>> ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
>> 
>> Will be
>> ULONG_CMP_LT(0, 8 - (2 * RCU_SEQ_STATE_MASK + 1));
>> 
>> = ULONG_CMP_LT(0, 8 - 7)
>> 
>> = TRUE.
>> 
>> Which means rcu_seq_done_exact() will return a false positive saying the GP has
>> completed even though it has not.
>> 
>> I think rcu_seq_done_exact() is off by one and should be doing:
>> 
>> ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 2));
>> 
>> ?
> 
> But it's ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1) now since:
> 
>    85aad7cc4178 ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection")
> 
> That's 10 so we are good.
> 
> However that magic value is arbitrary and doesn't mean much. It should be
> like you said. Or rather for clarity:
> 
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index 7acf1f36dd6c..e53f0b687a83 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -57,6 +57,10 @@
> /* Low-order bit definition for polled grace-period APIs. */
> #define RCU_GET_STATE_COMPLETED    0x1
> 
> +/* A complete grace period count */
> +#define RCU_SEQ_GP (RCU_SEQ_STATE_MASK + 1)
> +
> +
> extern int sysctl_sched_rt_runtime;
> 
> /*
> @@ -169,7 +173,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
> {
>    unsigned long cur_s = READ_ONCE(*sp);
> 
> -    return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1));
> +    return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_GP));
> }

Ah, my kernel did not have the change at the time I commented, sorry. I agree that this is a much better and meaningful expression than the existing one. I shall create a patch with it and send it out with my other series on forcing the wrap for testing (along with the modification for the new comments you added).

Thanks!


> 
> /*
> 
>