linux-kernel - Re: [PATCH v5 06/18] rcu: Introduce call_rcu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8812ea75-ef14-0d5d-19d8-bda70394b41a@joelfernandes.org>
Date:   Tue, 6 Sep 2022 22:56:01 -0400
From:   Joel Fernandes <joel@...lfernandes.org>
To:     Frederic Weisbecker <frederic@...nel.org>
Cc:     rcu@...r.kernel.org, linux-kernel@...r.kernel.org,
        rushikesh.s.kadam@...el.com, urezki@...il.com,
        neeraj.iitr10@...il.com, paulmck@...nel.org, rostedt@...dmis.org,
        vineeth@...byteword.org, boqun.feng@...il.com
Subject: Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API
 implementation



On 9/6/2022 3:11 PM, Frederic Weisbecker wrote:
> On Tue, Sep 06, 2022 at 12:43:52PM -0400, Joel Fernandes wrote:
>> On 9/6/2022 12:38 PM, Joel Fernandes wrote:
>> Ah, now I know why I got confused. I *used* to flush the bypass list before when
>> !lazy CBs showed up. Paul suggested this is overkill. In this old overkill
>> method, I was missing a wake up which was likely causing the boot regression.
>> Forcing a wake up fixed that. Now in v5 I make it such that I don't do the flush
>> on a !lazy rate-limit.
>>
>> I am sorry for the confusion. Either way, in my defense this is just an extra
>> bit of code that I have to delete. This code is hard. I have mostly relied on a
>> test-driven development. But now thanks to this review and I am learning the
>> code more and more...
> 
> Yeah this code is hard.
> 
> Especially as it's possible to flush from both sides and queue the timer
> from both sides. And both sides read the bypass/lazy counter locklessly.
> But only call_rcu_*() can queue/increase the bypass size whereas only
> nocb_gp_wait() can cancel the timer. Phew!
> 

Haha, Indeed ;-)

> Among the many possible dances between rcu_nocb_try_bypass()
> and nocb_gp_wait(), I haven't found a way yet for the timer to be
> set to LAZY when it should be BYPASS (or other kind of accident such
> as an ignored callback).
> In the worst case we may arm an earlier timer than necessary
> (RCU_NOCB_WAKE_BYPASS instead of RCU_NOCB_WAKE_LAZY for example).
> 
> Famous last words...

Agreed.

On the issue of regressions with non-lazy things being treated as lazy, I was
thinking of adding a bounded-time-check to:

[PATCH v5 08/18] rcu: Add per-CB tracing for queuing, flush and invocation.

Where, if a non-lazy CB takes an abnormally long time to execute (say it was
subject to a race-condition), it would splat. This can be done because I am
tracking the queue-time in the rcu_head in that patch.

On another note, boot time regressions show up pretty quickly (at least on
ChromeOS) when non-lazy things become lazy and so far with the latest code it
has fortunately been pretty well behaved.

Thanks,

 - Joel