linux-kernel - Re: [PATCH 1/2] rcu: Comment on the extraneous delta test on rcu_seq_done

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <42297734-44af-42f9-8544-f3b25e35d3ef@nvidia.com>
Date: Sat, 22 Mar 2025 15:40:53 +0100
From: Joel Fernandes <joelagnelf@...dia.com>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: "Paul E. McKenney" <paulmck@...nel.org>,
 LKML <linux-kernel@...r.kernel.org>, Boqun Feng <boqun.feng@...il.com>,
 Neeraj Upadhyay <neeraj.upadhyay@....com>,
 Uladzislau Rezki <urezki@...il.com>, Zqiang <qiang.zhang1211@...il.com>,
 rcu <rcu@...r.kernel.org>
Subject: Re: [PATCH 1/2] rcu: Comment on the extraneous delta test on
 rcu_seq_done_exact()



On 3/22/2025 3:20 PM, Joel Fernandes wrote:
> 
> On 3/22/2025 11:25 AM, Frederic Weisbecker wrote:
>> Le Sat, Mar 22, 2025 at 03:06:08AM +0100, Joel Fernandes a écrit :
>>> Insomnia kicked in, so 3 am reply here (Zurich local time) ;-):
>>>
>>> On 3/20/2025 3:15 PM, Frederic Weisbecker wrote:
>>>> Le Wed, Mar 19, 2025 at 03:38:31PM -0400, Joel Fernandes a écrit :
>>>>> On Tue, Mar 18, 2025 at 11:37:38AM -0700, Paul E. McKenney wrote:
>>>>>> On Tue, Mar 18, 2025 at 02:56:18PM +0100, Frederic Weisbecker wrote:
>>>>>>> The numbers used in rcu_seq_done_exact() lack some explanation behind
>>>>>>> their magic. Especially after the commit:
>>>>>>>
>>>>>>>     85aad7cc4178 ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection")
>>>>>>>
>>>>>>> which reported a subtle issue where a new GP sequence snapshot was taken
>>>>>>> on the root node state while a grace period had already been started and
>>>>>>> reflected on the global state sequence but not yet on the root node
>>>>>>> sequence, making a polling user waiting on a wrong already started grace
>>>>>>> period that would ignore freshly online CPUs.
>>>>>>>
>>>>>>> The fix involved taking the snaphot on the global state sequence and
>>>>>>> waiting on the root node sequence. And since a grace period is first
>>>>>>> started on the global state and only afterward reflected on the root
>>>>>>> node, a snapshot taken on the global state sequence might be two full
>>>>>>> grace periods ahead of the root node as in the following example:
>>>>>>>
>>>>>>> rnp->gp_seq = rcu_state.gp_seq = 0
>>>>>>>
>>>>>>>     CPU 0                                           CPU 1
>>>>>>>     -----                                           -----
>>>>>>>     // rcu_state.gp_seq = 1
>>>>>>>     rcu_seq_start(&rcu_state.gp_seq)
>>>>>>>                                                     // snap = 8
>>>>>>>                                                     snap = rcu_seq_snap(&rcu_state.gp_seq)
>>>>>>>                                                     // Two full GP differences
>>>>>>>                                                     rcu_seq_done_exact(&rnp->gp_seq, snap)
>>>>>>>     // rnp->gp_seq = 1
>>>>>>>     WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq);
>>>>>>>
>>>>>>> Add a comment about those expectations and to clarify the magic within
>>>>>>> the relevant function.
>>>>>>>
>>>>>>> Signed-off-by: Frederic Weisbecker <frederic@...nel.org>
>>>>>> Reviewed-by: Paul E. McKenney <paulmck@...nel.org>
>>>>>>
>>>>>> But it would of course be good to get reviews from the others.
>>>>> I actually don't agree that the magic in the rcu_seq_done_exact() function about the
>>>>> ~2 GPs is related to the lag between rcu_state.gp_seq and root rnp->gp_seq,
>>>>> because the small lag can just as well survive with the rcu_seq_done()
>>>>> function in the above sequence right?
>>>>>
>>>>> The rcu_seq_done_exact() function on the other hand is more about not being
>>>>> stuck in the ULONG_MAX/2 guard band, but to actually get to that, you need a
>>>>> wrap around to happen and the delta between "rnp->gp_seq" and "snap" to be at
>>>>> least ULONG_MAX/2 AFAIU.
>>>>>
>>>>> So the only time this magic will matter is if you have a huge delta between
>>>>> what is being compared, not just 2 GPs.
>>>> You're right, and perhaps I should have made it more specific that my comment
>>>> only explains the magic "3" number here, in that if it were "2" instead, there
>>>> could be accidents with 2 full GPs difference (which is possible) spuriously
>>>> accounted as a wrap around.
>>> Ahh, so I guess I get it now and we are both right. The complete picture is - We
>>> are trying to handle the case of "very large wrap" around but as a part of that,
>>> we don't want to create false-positives for this "snap" case.
>>>
>>> A "snap" can be atmost  (2 * RCU_SEQ_STATE_MASK + 1) away from a gp_seq.
>>>
>>> That's within "2 GPs" worth of counts (about 8 counts)
>>>
>>> Taking some numbers:
>>>
>>> cur_s	s	delta (s - cur_s)
>>> 0	4	4
>>> 1	8	7
>>> 2	8	6
>>> 3	8	5
>>> 4	8	4
>>> 5	12	7
>>>
>>> The maximum delta of a snap from actual gp_seq can be (2 * RCU_SEQ_STATE_MASK +
>>> 1) which in this case is 7.
>>>
>>> So we adjust the comparison by adding the  ULONG_CMP_LT(cur_s, s - (2 *
>>> RCU_SEQ_STATE_MASK + 1)). i.e.
>> 3, right?
> Just to be absolutely sure, are you talking about the value of RCU_SEQ_STATE_MASK ?
> 
> That is indeed 3 (RCU_SEQ_STATE_MASK).
> 
> But if we're talking about number of GPs, my understanding is a count of 4 is
> one GP worth. Per the above table, the delta between gp_seq and is snap is
> always a count of 7 (hence less than 2 GPs).
> 
> Agreed?
> 
> If RCU_SEQ_STATE_MASK was 0x1 instead of 0x11, that is a single bit (or a count
> of 2 instead of 4, for a GP), then the table would be:
> 
>  cur_s	s (snap)	delta (s - cur_s)
>  0	2		2
>  1	4		3
>  2	4		2
>  3	6		3
>  4	6		2
>  5	8		3
> 
> So delta is always <= 3,  Or more generally: <= (RCU_SEQ_STATE_MASK * 2) + 1

Oh man, I am wondering if we are on to a bug here:

>From your example:

    CPU 0                                           CPU 1
    -----                                           -----
    // rcu_state.gp_seq = 1
    rcu_seq_start(&rcu_state.gp_seq)
                                      // snap = 8
                                      snap = rcu_seq_snap(&rcu_state.gp_seq)
                                      // Two full GP
                                      rcu_seq_done_exact(&rnp->gp_seq, snap)


Here, the
ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));

Will be
ULONG_CMP_LT(0, 8 - (2 * RCU_SEQ_STATE_MASK + 1));

 = ULONG_CMP_LT(0, 8 - 7)

 = TRUE.

Which means rcu_seq_done_exact() will return a false positive saying the GP has
completed even though it has not.

I think rcu_seq_done_exact() is off by one and should be doing:

ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 2));

?

Oh and I see from an old email that Frederic did ask about this: "Should it be
ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ?"

thanks,

 - Joel