linux-kernel - Re: [PATCH v3 11/17] rv: Retry when da monitor detects race conditions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <e2f4f8d372612cd61689b91562e73677599d08de.camel@redhat.com>
Date: Wed, 16 Jul 2025 10:20:39 +0200
From: Gabriele Monaco <gmonaco@...hat.com>
To: Nam Cao <namcao@...utronix.de>
Cc: linux-kernel@...r.kernel.org, Steven Rostedt <rostedt@...dmis.org>, 
	linux-trace-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>, Peter
 Zijlstra <peterz@...radead.org>, Tomas Glozar <tglozar@...hat.com>, Juri
 Lelli <jlelli@...hat.com>,  Clark Williams <williams@...hat.com>, John
 Kacur <jkacur@...hat.com>
Subject: Re: [PATCH v3 11/17] rv: Retry when da monitor detects race
 conditions



On Tue, 2025-07-15 at 17:23 +0200, Nam Cao wrote:
> On Tue, Jul 15, 2025 at 09:14:28AM +0200, Gabriele Monaco wrote:
> >  static inline
> > bool										\
> >  da_event_##name(struct da_monitor *da_mon, enum events_##name
> > event)				\
> >  {								
> > 				\
> > -	type curr_state =
> > da_monitor_curr_state_##name(da_mon);					\
> > -	type next_state = model_get_next_state_##name(curr_state,
> > event);			\
> > -
> > 												\
> > -	if (next_state != INVALID_STATE)
> > {							\
> > -		da_monitor_set_state_##name(da_mon,
> > next_state);				\
> > -
> > 												\
> > -
> > 		trace_event_##name(model_get_state_name_##name(curr_state),			\
> > -				  
> > model_get_event_name_##name(event),				\
> > -				  
> > model_get_state_name_##name(next_state),			\
> > -				  
> > model_is_final_state_##name(next_state));			\
> > -
> > 												\
> > -		return
> > true;									\
> > +	enum states_##name curr_state,
> > next_state;						\
> > +								
> > 				\
> > +	curr_state = READ_ONCE(da_mon-
> > >curr_state);						\
> > +	for (int i = 0; i < MAX_DA_RETRY_RACING_EVENTS; i++)
> > {					\
> > +		next_state =
> > model_get_next_state_##name(curr_state, event);			\
> > +		if (next_state ==
> > INVALID_STATE)						\
> > +			goto
> > out_react;								\
> > +		if (likely(try_cmpxchg(&da_mon->curr_state,
> > &curr_state, next_state)))		\
> > +			goto
> > out_success;							\
> >  	}							
> > 				\
> > +	/* Special invalid transition if we run out of retries.
> > */				\
> > +	curr_state =
> > INVALID_STATE;								\
> >  								
> > 				\
> > +out_react:							
> > 				\
> >  	cond_react_##name(curr_state,
> > event);							\
> >  								
> > 				\
> >  	trace_error_##name(model_get_state_name_##name(curr_state)
> > ,				\
> >  			  
> > model_get_event_name_##name(event));					\
> 
> If I understand correctly, if after 3 tries and we still fail to
> change the
> state, we will invoke the reactor and trace_error? Doesn't that cause
> a
> false positive? Because it is not a violation of the model, it is
> just a
> race making us fail to change the state.
> 

Yes, that's correct.
My rationale was that, at that point, the monitor is likely no longer
in sync, so silently ignoring the situation is not really an option.
In this case, the reaction includes an invalid current state (because
in fact we don't know what the current state is) and tools may be able
to understand that. I know you wouldn't be able to do that in LTL..
By the way, LTL uses multiple statuses, so this lockless approach may
not really work.

I don't see this situation happening often: I only ever observed 2
events able to race, 4 happening at the same time is wild, but of
course cannot be excluded in principle for any possible monitor.
Yet, I have the feeling a monitor where this can happen is not well
designed and RV should point that out.
Do you have ideas of potential monitors where more than 3 events can
race?

Perhaps a full blown reaction is a bit aggressive in this situation, as
the /fault/ may not be necessarily in the monitor.
We could think of a special tracepoint or just printing.

> Same below.
> 
> Also, I wouldn't use goto unless necessary. Perhaps it is better to
> put the
> code at "out_react:" and "out_success:" into the loop. But that's
> just my
> personal preference, up to you.

That could be done if we do a whole different thing when retries run
out, instead of defaulting to out_react.
I liked to avoid excessive indentation with those goto as well but
yeah, it may not be quite necessary.

I'll have a deeper thought on this.

Thanks,
Gabriele