lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z6tutxs2UE7vjjJ6@bogus>
Date: Tue, 11 Feb 2025 15:37:27 +0000
From: Sudeep Holla <sudeep.holla@....com>
To: jack21 <jackhuang021@...il.com>
Cc: Cristian Marussi <cristian.marussi@....com>,
	Dan Carpenter <dan.carpenter@...aro.org>,
	<arm-scmi@...r.kernel.org>, <linux-arm-kernel@...ts.infradead.org>,
	<linux-kernel@...r.kernel.org>,
	Huangjie <huangjie1663@...tium.com.cn>
Subject: Re: [PATCH] dirvers: scmi: poll again when transfer reach timeout

On Thu, Jan 23, 2025 at 07:48:53PM +0000, Cristian Marussi wrote:
> On Thu, Jan 23, 2025 at 01:38:30PM +0300, Dan Carpenter wrote:
> > s/dirvers/drivers/
> > 
> > On Thu, Jan 23, 2025 at 04:33:24PM +0800, jack21 wrote:
> > > From: Huangjie <huangjie1663@...tium.com.cn>
> > > 
> > > spin_until_cond() not really hold a spin lock, possible timeout may occur
> > > in preemption kernel when preempted after spin_until_cond().
> > > 
> > > We check status again when reach timeout is reached to prevent incorrect
> > > jugement of timeout.
> > > 
> 
> Hi,
> 
> probably another not so short email of mine :P ... 
> 
> > 
> > I suspect the real issue is that we exit the spin loop when
> > try_wait_for_completion(&xfer->done) is true.  Probably we should add
> > that as a Fixes tag?:
> > 
> 
> The Kernel SCMI stack, acting as an SCMI agent have to cope with the
> possible (even though rare) scenario of receiving Out of Order messages
> when dealing with async commands...
> 
> ...IOW, what to do if, after having issued an AsycnCmd, the Delayed response
> is received BEFORE the corresponding immediate reply to the initial request:
> such a wicked (and rare) situation could be the result of a misbehaving
> platform server OR simply due to parallellization of activies on the A2P
> and P2A on some transports, i.e. platform sent reply and delayed_response
> in the correct order but the transport delivered those OoO, being
> transmitted on 2 discinct physical  channels...hard but not impossible.
> 
> Kernel side we address this OoO scenario by assuming that, if a valid
> Delayed Response to a in-flight Async-cmd is received on teh P2A chan,
> before the immediate A2P reply, the transaction itself is good and we
> can progress by just logging and ignoring/swallowing the missing
> immediate-reply, that will probably arrive later, and just carry on
> processing the Delayed Response.
> 
> In order to do that, we maintain, in fact, a per-message state-machine,
> and inside scmi_msg_response_validate(), when we detect the OoO condition,
> we cause the wait-for-immediate-reply to terminate by issuing forcibly a
> complete(xfer->done)....
> 
> ...this works straight away for the non-polled IRQ transactions since
> causes the wait_for_completion() to terminate cleanly, BUT in order to
> cut-short also the busy-wait in the polling case we need that additional
> try_wait_for_completion()....so as not to spin forever for a message
> that we dont care anymore...
> [ note that any late arriving immediate-reply will be in teh future
>  discarded at this point]
> 
> For this reason, I think that this patch, while correctly checking the
> poll_done() condition when the spinloop terminates for the reason
> explained in the commit-message, it should also check if the loop was not
> instead forcibly terminated by the OoO scenario...if not, we will end up
> considering the polling to have timeout, while instead it was forcibly
> terminated by the OoO state machine complete() and just want to ignore
> such missing message...
> 
> So what about instead of checking (untested):
> 
> +	if (!completion_done(&xfer->done) &&
> +	    !info->desc->ops->poll_done(cinfo, xfer))

Were you able to check this ? I am waiting to hear back from you on this.

-- 
Regards,
Sudeep

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ