linux-kernel - Re: [RFC 0/2] srcu: Remove pre-flip memory barrier

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEXW_YTyZaE4ULvm-HygFN2BGm-jayHTbpnYbrJFoo_GOsYKQg@mail.gmail.com>
Date:   Sun, 18 Dec 2022 19:24:23 -0500
From:   Joel Fernandes <joel@...lfernandes.org>
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc:     linux-kernel@...r.kernel.org,
        Josh Triplett <josh@...htriplett.org>,
        Lai Jiangshan <jiangshanlai@...il.com>,
        "Paul E. McKenney" <paulmck@...nel.org>, rcu@...r.kernel.org,
        Steven Rostedt <rostedt@...dmis.org>
Subject: Re: [RFC 0/2] srcu: Remove pre-flip memory barrier

On Sun, Dec 18, 2022 at 7:04 PM Joel Fernandes <joel@...lfernandes.org> wrote:
>
> Hi Mathieu,
>
> On Sun, Dec 18, 2022 at 6:38 PM Mathieu Desnoyers
> <mathieu.desnoyers@...icios.com> wrote:
> >
> > On 2022-12-18 16:30, Joel Fernandes wrote:
> > > Hi Mathieu,
> > >
> > > On Sun, Dec 18, 2022 at 3:56 PM Mathieu Desnoyers
> > > <mathieu.desnoyers@...icios.com> wrote:
> > >>
> > >> On 2022-12-18 14:13, Joel Fernandes (Google) wrote:
> > >>> Hello, I believe the pre-flip memory barrier is not required. The only reason I
> > >>> can say to remove it, other than the possibility that it is unnecessary, is to
> > >>> not have extra code that does not help. However, since we are issuing a fully
> > >>> memory-barrier after the flip, I cannot say that it hurts to do it anyway.
> > >>>
> > >>> For this reason, please consider these patches as "informational", than a
> > >>> "please merge". :-) Though, feel free to consider merging if you agree!
> > >>>
> > >>> All SRCU scenarios pass with these, with 6 hours of testing.
> > >>
> > >> Hi Joel,
> > >>
> > >> Please have a look at the comments in my side-rcu implementation [1, 2].
> > >> It is similar to what SRCU does (per-cpu counter based grace period
> > >> tracking), but implemented for userspace. The comments explain why this
> > >> works without the memory barrier you identify as useless in SRCU.
> > >>
> > >> Following my implementation of side-rcu, I reviewed the SRCU comments
> > >> and identified that the barrier "/* E */" appears to be useless. I even
> > >> discussed this privately with Paul E. McKenney.
> > >>
> > >> My implementation and comments go further though, and skip the period
> > >> "flip" entirely if the first pass observes that all readers (in both
> > >> periods) are quiescent.
> > >
> > > Actually in SRCU, the first pass scans only 1 index, then does the
> > > flip, and the second pass scans the second index. Without doing a
> > > flip, an index cannot be scanned for forward progress reasons because
> > > it is still "active". So I am curious how you can skip flip and still
> > > scan both indexes? I will dig more into your implementation to learn more.
> >
> > If we look at SRCU read-side:
> >
> > int __srcu_read_lock(struct srcu_struct *ssp)
> > {
> >          int idx;
> >
> >          idx = READ_ONCE(ssp->srcu_idx) & 0x1;
> >          this_cpu_inc(ssp->sda->srcu_lock_count[idx]);
> >          smp_mb(); /* B */  /* Avoid leaking the critical section. */
> >          return idx;
> > }
> >
> > If the thread is preempted for a long period of time between load of
> > ssp->srcu_idx and increment of srcu_lock_count[idx], this means this
> > thread can appear as a "new reader" for the idx period at any arbitrary
> > time in the future, independently of which period is the current one
> > within a future grace period.
> >
> > As a result, the grace period algorithm needs to inherently support the
> > fact that a "new reader" can appear in any of the two periods,
> > independently of the current period state.
> >
> > As a result, this means that while within period "0", we _need_ to allow
> > newly coming readers to appear as we scan period "0".
>
> Sure, it already does handle it but that is I believe it is a corner
> case, not the norm.
>
> > As a result, we can simply scan both periods 0/1 for reader quiescence,
> > even while new readers appear within those periods.
>
> I think this is a bit dangerous. Yes there is the preemption thing you
> mentioned above, but that is bounded since you can only have a fixed
> number of tasks that underwent that preemption, and it is quite rare
> in the sense, each reader should get preempted just after sampling idx
> but not incrementing lock count.
>
> However, if we scan while new readers appear (outside of the above
> preemption problem), we can have counter wrap causing a false match
> much quicker.
> The scan loop is:
> check_readers(idx) {
>    count_all_unlocks(idx);
>    smp_mb();
>    count_all_locks(idx);
>    bool done = (locks == unlocks)
>    if (done) {
>          // readers are done, end scan for this idx.
>    } else {
>          // try again later
>    }
> }
>
> So if check_readers() got preempted just after the smp_mb(), then you
> can have lots of tasks enter and exit the read-side critical section
> and increment the locks count. Eventually locks == unlocks will
> happen, and it is screwed. Sure this is also theoretical, but yeah
> that issue can be made "worse" by scanning active readers
> deliberately, especially when such readers can also nest arbitrarily.
>
> > As a result, flipping between periods 0/1 is just relevant for forward
> > progress, not for correctness.
>
> Sure, agreed, forward progress.

Adding to the last statement "But also correctness as described above".

thanks,

 - Joel