linux-kernel - Re: [PATCH RFC v2 rcu] Fix get_state_synchronize_rcu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <19388b60-5e71-463d-ba68-9064d0caa224@paulmck-laptop>
Date: Fri, 24 Jan 2025 07:58:20 -0800
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: rcu@...r.kernel.org, linux-kernel@...r.kernel.org, kernel-team@...a.com,
	rostedt@...dmis.org
Subject: Re: [PATCH RFC v2 rcu] Fix get_state_synchronize_rcu_full() GP-start
 detection

On Fri, Jan 24, 2025 at 03:49:24PM +0100, Frederic Weisbecker wrote:
> Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit :
> > The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full()
> > functions use the root rcu_node structure's ->gp_seq field to detect
> > the beginnings and ends of grace periods, respectively.  This choice is
> > necessary for the poll_state_synchronize_rcu_full() function because
> > (give or take counter wrap), the following sequence is guaranteed not
> > to trigger:
> > 
> >         get_state_synchronize_rcu_full(&rgos);
> >         synchronize_rcu();
> >         WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos));
> > 
> > The RCU callbacks that awaken synchronize_rcu() instances are
> > guaranteed not to be invoked before the root rcu_node structure's
> > ->gp_seq field is updated to indicate the end of the grace period.
> > However, these callbacks might start being invoked immediately
> > thereafter, in particular, before rcu_state.gp_seq has been updated.
> > Therefore, poll_state_synchronize_rcu_full() must refer to the
> > root rcu_node structure's ->gp_seq field.  Because this field is
> > updated under this structure's ->lock, any code following a call to
> > poll_state_synchronize_rcu_full() will be fully ordered after the
> > full grace-period computation, as is required by RCU's memory-ordering
> > semantics.
> > 
> > By symmetry, the get_state_synchronize_rcu_full() function should also
> > use this same root rcu_node structure's ->gp_seq field.  But it turns out
> > that symmetry is profoundly (though extremely infrequently) destructive
> > in this case.  To see this, consider the following sequence of events:
> > 
> > 1.      CPU 0 starts a new grace period, and updates rcu_state.gp_seq
> >         accordingly.
> > 
> > 2.      As its first step of grace-period initialization, CPU 0 examines
> >         the current CPU hotplug state and decides that it need not wait
> >         for CPU 1, which is currently offline.
> > 
> > 3.      CPU 1 comes online, and updates its state.  But this does not
> >         affect the current grace period, but rather the one after that.
> >         After all, CPU 1 was offline when the current grace period
> >         started, so all pre-existing RCU readers on CPU 1 must have
> >         completed or been preempted before it last went offline.
> >         The current grace period therefore has nothing it needs to wait
> >         for on CPU 1.
> > 
> > 4.      CPU 1 switches to an rcutorture kthread which is running
> >         rcutorture's rcu_torture_reader() function, which starts a new
> >         RCU reader.
> > 
> > 5.      CPU 2 is running rcutorture's rcu_torture_writer() function
> >         and collects a new polled grace-period "cookie" using
> >         get_state_synchronize_rcu_full().  Because the newly started
> >         grace period has not completed initialization, the root rcu_node
> >         structure's ->gp_seq field has not yet been updated to indicate
> >         that this new grace period has already started.
> > 
> >         This cookie is therefore set up for the end of the current grace
> >         period (rather than the end of the following grace period).
> > 
> > 6.      CPU 0 finishes grace-period initialization.
> > 
> > 7.      If CPU 1’s rcutorture reader is preempted, it will be added to
> >         the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not
> >         set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer
> >         will not be updated.  Thus, this grace period will not wait on
> >         it.  Which is only fair, given that the CPU did not come online
> >         until after the grace period officially started.
> > 
> > 8.      CPUs 0 and 2 then detect the new grace period and then report
> >         a quiescent state to the RCU core.
> > 
> > 9.      Because CPU 1 was offline at the start of the current grace
> >         period, CPUs 0 and 2 are the only CPUs that this grace period
> >         needs to wait on.  So the grace period ends and post-grace-period
> >         cleanup starts.  In particular, the root rcu_node structure's
> >         ->gp_seq field is updated to indicate that this grace period
> >         has now ended.
> > 
> > 10.     CPU 2 continues running rcu_torture_writer() and sees that,
> >         from the viewpoint of the root rcu_node structure consulted by
> >         the poll_state_synchronize_rcu_full() function, the grace period
> >         has ended.  It therefore updates state accordingly.
> > 
> > 11.     CPU 1 is still running the same RCU reader, which notices this
> >         update and thus complains about the too-short grace period.
> 
> I think I get the race but I must confess I'm not very familiar with how this
> all materialize on CPU 2's rcu_torture_writer() and CPU 1's rcu_torture_reader().
> 
> Basically this could trigger on CPU 1 with just doing the following, right?
> 
> rcu_read_lock()
> get_state_synchronize_rcu_full(&rgos);
> WARN_ON_ONCE(poll_state_synchronize_rcu_full(&rgos))
> rcu_read_unlock()

CPU 1 would do rcu_read_lock()/checks/rcu_read_unlock() as the
reader, and CPU 2 would do get_state_synchronize_rcu_full(), later
poll_state_synchronize_rcu_full(), which would (erroneously) indicate
a completed grace period, so it would update the state, triggering CPU
1's checks.

> > The fix is for the get_state_synchronize_rcu_full() function to use
> > rcu_state.gp_seq instead of the the root rcu_node structure's ->gp_seq
> > field.  With this change in place, if step 5's cookie indicates that the
> > grace period has not yet started, then any prior code executed by CPU 2
> > must have happened before CPU 1 came online.  This will in turn prevent
> > CPU 1's code in steps 3 and 11 from spanning CPU 2's grace-period wait,
> > thus preventing CPU 1 from being subjected to a too-short grace period.
> > 
> > This commit therefore makes this change.  Note that there is no change to
> > the poll_state_synchronize_rcu_full() function, which as noted above,
> > must continue to use the root rcu_node structure's ->gp_seq field.
> > This is of course an asymmetry between these two functions, but is an
> > asymmetry that is absolutely required for correct operation.  It is a
> > common human tendency to greatly value symmetry, and sometimes symmetry
> > is a wonderful thing.  Other times, symmetry results in poor performance.
> > But in this case, symmetry is just plain wrong.
> > 
> > Nevertheless, the asymmetry does require an additional adjustment.
> > It is possible for get_state_synchronize_rcu_full() to see a given
> > grace period as having started, but for an immediately following
> > poll_state_synchronize_rcu_full() to see it as having not yet started.
> > Given the current rcu_seq_done_exact() implementation, this will
> > result in a false-positive indication that the grace period is done
> > from poll_state_synchronize_rcu_full().  This is dealt with by making
> > rcu_seq_done_exact() reach back three grace periods rather than just
> > two of them.
> > 
> > Although this fixes 91a967fd6934 ("rcu: Add full-sized polling for
> > get_completed*() and poll_state*()"), it is not clear that it is worth
> > backporting this commit.  First, it took me many weeks to convince
> > rcutorture to reproduce this more frequently than once per year.  Second,
> > this cannot be reproduced at all without frequent CPU-hotplug operations,
> > as in waiting all of 50 milliseconds from the end of the previous
> > operation until starting the next one.  Third, the TREE03.boot settings
> > cause multi-millisecond delays during RCU grace-period initialization,
> > which greatly increase the probability of the above sequence of events.
> > (Don't do this in production workloads!)  Fourth, extremely heavy use of
> > get_state_synchronize_rcu_full() and/or poll_state_synchronize_rcu_full()
> > is required to reproduce this, and as of v6.12, only kfree_rcu() uses it,
> > and even then not particularly heavily.
> 
> I'm wondering, what prevents us from removing rcu_state.gp_seq and rely only on
> the root node for the global state ?

One scenario comes to mind immediately.  There may be others.

Suppose we were running with default configuration on a system with
"only" eight CPUs.  Then there is only the one rcu_node structure,
which is both root and leaf.  Without rcu_state.gp_seq, there
would be no way to communicate the beginning of the grace period to
get_state_synchronize_rcu_full() without also allowing quiescent states
to be reported.  There would thus be no time in which to check for newly
onlined/offlined CPUs.

							Thanx, Paul