linux-kernel - Re: [PATCH 2/2] cgroup: explain the race between updater and flusher

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f6900de7-bfab-47da-b29d-138c75c172fd@paulmck-laptop>
Date: Thu, 3 Jul 2025 16:53:45 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Tejun Heo <tj@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>,
	JP Kobryn <inwardvessel@...il.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Ying Huang <huang.ying.caritas@...il.com>,
	Vlastimil Babka <vbabka@...e.cz>,
	Alexei Starovoitov <ast@...nel.org>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
	Michal Koutný <mkoutny@...e.com>,
	bpf@...r.kernel.org, linux-mm@...ck.org, cgroups@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	Meta kernel team <kernel-team@...a.com>
Subject: Re: [PATCH 2/2] cgroup: explain the race between updater and flusher

On Thu, Jul 03, 2025 at 03:46:07PM -0700, Shakeel Butt wrote:
> On Thu, Jul 03, 2025 at 03:29:16PM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 03, 2025 at 01:00:12PM -0700, Shakeel Butt wrote:
> > > Currently the rstat updater and the flusher can race and cause a
> > > scenario where the stats updater skips adding the css to the lockless
> > > list but the flusher might not see those updates done by the skipped
> > > updater. This is benign race and the subsequent flusher will flush those
> > > stats and at the moment there aren't any rstat users which are not fine
> > > with this kind of race. However some future user might want more
> > > stricter guarantee, so let's add appropriate comments and data_race()
> > > tags to ease the job of future users.
> > > 
> > > Signed-off-by: Shakeel Butt <shakeel.butt@...ux.dev>
> > > ---
> > >  kernel/cgroup/rstat.c | 32 +++++++++++++++++++++++++++++---
> > >  1 file changed, 29 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> > > index c8a48cf83878..b98c03b1af25 100644
> > > --- a/kernel/cgroup/rstat.c
> > > +++ b/kernel/cgroup/rstat.c
> > > @@ -60,6 +60,12 @@ static inline struct llist_head *ss_lhead_cpu(struct cgroup_subsys *ss, int cpu)
> > >   * Atomically inserts the css in the ss's llist for the given cpu. This is
> > >   * reentrant safe i.e. safe against softirq, hardirq and nmi. The ss's llist
> > >   * will be processed at the flush time to create the update tree.
> > > + *
> > > + * NOTE: if the user needs the guarantee that the updater either add itself in
> > > + * the lockless list or the concurrent flusher flushes its updated stats, a
> > > + * memory barrier is needed before the call to css_rstat_updated() i.e. a
> > > + * barrier after updating the per-cpu stats and before calling
> > > + * css_rstat_updated().
> > >   */
> > >  __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
> > >  {
> > > @@ -86,8 +92,13 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
> > >  		return;
> > >  
> > >  	rstatc = css_rstat_cpu(css, cpu);
> > > -	/* If already on list return. */
> > > -	if (llist_on_list(&rstatc->lnode))
> > > +	/*
> > > +	 * If already on list return. This check is racy and smp_mb() is needed
> > > +	 * to pair it with the smp_mb() in css_process_update_tree() if the
> > > +	 * guarantee that the updated stats are visible to concurrent flusher is
> > > +	 * needed.
> > > +	 */
> > > +	if (data_race(llist_on_list(&rstatc->lnode)))
> > 
> > OK, I will bite...
> > 
> > Why is this needed given the READ_ONCE() that the earlier patch added to
> > llist_on_list()?
> > 
> > >  		return;
> > >  
> > >  	/*
> > > @@ -145,9 +156,24 @@ static void css_process_update_tree(struct cgroup_subsys *ss, int cpu)
> > >  	struct llist_head *lhead = ss_lhead_cpu(ss, cpu);
> > >  	struct llist_node *lnode;
> > >  
> > > -	while ((lnode = llist_del_first_init(lhead))) {
> > > +	while ((lnode = data_race(llist_del_first_init(lhead)))) {
> > 
> > And for this one, why not make init_llist_node(), which is invoked from
> > llist_del_first_init(), do a WRITE_ONCE()?
> > 
> 
> Let me answer this one first. The previous patch actually made
> init_llist_node() do WRITE_ONCE().
> 
> So the actual question is why do we need
> data_race([READ|WRITE]_ONCE()) instead of just [READ|WRITE]_ONCE()?

You should *almost* always use [READ|WRITE]_ONCE() instead of data_race().

> Actually I had the similar question myself and found the following
> comment in include/linux/compiler.h:
> 
> /**
>  * data_race - mark an expression as containing intentional data races
>  *
>  * This data_race() macro is useful for situations in which data races
>  * should be forgiven.  One example is diagnostic code that accesses
>  * shared variables but is not a part of the core synchronization design.
>  * For example, if accesses to a given variable are protected by a lock,
>  * except for diagnostic code, then the accesses under the lock should
>  * be plain C-language accesses and those in the diagnostic code should
>  * use data_race().  This way, KCSAN will complain if buggy lockless
>  * accesses to that variable are introduced, even if the buggy accesses
>  * are protected by READ_ONCE() or WRITE_ONCE().
>  *
>  * This macro *does not* affect normal code generation, but is a hint
>  * to tooling that data races here are to be ignored.  If the access must
>  * be atomic *and* KCSAN should ignore the access, use both data_race()
>  * and READ_ONCE(), for example, data_race(READ_ONCE(x)).
>  */
> 
> IIUC correctly, I need to protect llist_node against tearing and as well
> as tell KCSAN to ignore the access for race then I should use both.
> Though I think KCSAN treat [READ|WRITE]_ONCE similar to data_race(), so
> it kind of seem redundant but I think at least I want to convey that we
> need protection against tearing and ignore KCSAN and using both conveys
> that. Let me know if you think otherwise.
> 
> thanks a lot for taking a look.

The thing to remember is that data_race() does not affect the
generated code (except of course when running KCSAN), and thus does
absolutely nothing to prevent load/store tearing.  You need things like
[READ|WRITE]_ONCE() to prevent tearing.

So if it does not affect the generated code, what is the point of
data_race()?

One answer to this question is for diagnostics where you want KCSAN
to check the main algorithm, but you don't want KCSAN to be confused
by the diagnostic accesses.  For example, you might use something like
ASSERT_EXCLUSIVE_ACCESS() as in __list_splice_init_rcu(), and not want
your diagnostic accesses to result in false-positive KCSAN reports
due to interactions with ASSERT_EXCLUSIVE_ACCESS() on some particular
memory location.  And if you were to use READ_ONCE() to access that same
memory location in your diagnostics, KCSAN would complain if they ran
concurrently with that ASSERT_EXCLUSIVE_ACCESS().  So you would instead
use data_race() to suppress such complaints.

Does that make sense?

						Thanx, Paul