linux-kernel - Re: [PATCH] dirvers: scmi: poll again when transfer reach timeout

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z5KdJa0QOIYIIRv4@pluto>
Date: Thu, 23 Jan 2025 19:48:53 +0000
From: Cristian Marussi <cristian.marussi@....com>
To: Dan Carpenter <dan.carpenter@...aro.org>
Cc: jack21 <jackhuang021@...il.com>, sudeep.holla@....com,
	cristian.marussi@....com, arm-scmi@...r.kernel.org,
	linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
	Huangjie <huangjie1663@...tium.com.cn>
Subject: Re: [PATCH] dirvers: scmi: poll again when transfer reach timeout

On Thu, Jan 23, 2025 at 01:38:30PM +0300, Dan Carpenter wrote:
> s/dirvers/drivers/
> 
> On Thu, Jan 23, 2025 at 04:33:24PM +0800, jack21 wrote:
> > From: Huangjie <huangjie1663@...tium.com.cn>
> > 
> > spin_until_cond() not really hold a spin lock, possible timeout may occur
> > in preemption kernel when preempted after spin_until_cond().
> > 
> > We check status again when reach timeout is reached to prevent incorrect
> > jugement of timeout.
> > 

Hi,

probably another not so short email of mine :P ... 

> 
> I suspect the real issue is that we exit the spin loop when
> try_wait_for_completion(&xfer->done) is true.  Probably we should add
> that as a Fixes tag?:
> 

The Kernel SCMI stack, acting as an SCMI agent have to cope with the
possible (even though rare) scenario of receiving Out of Order messages
when dealing with async commands...

...IOW, what to do if, after having issued an AsycnCmd, the Delayed response
is received BEFORE the corresponding immediate reply to the initial request:
such a wicked (and rare) situation could be the result of a misbehaving
platform server OR simply due to parallellization of activies on the A2P
and P2A on some transports, i.e. platform sent reply and delayed_response
in the correct order but the transport delivered those OoO, being
transmitted on 2 discinct physical  channels...hard but not impossible.

Kernel side we address this OoO scenario by assuming that, if a valid
Delayed Response to a in-flight Async-cmd is received on teh P2A chan,
before the immediate A2P reply, the transaction itself is good and we
can progress by just logging and ignoring/swallowing the missing
immediate-reply, that will probably arrive later, and just carry on
processing the Delayed Response.

In order to do that, we maintain, in fact, a per-message state-machine,
and inside scmi_msg_response_validate(), when we detect the OoO condition,
we cause the wait-for-immediate-reply to terminate by issuing forcibly a
complete(xfer->done)....

...this works straight away for the non-polled IRQ transactions since
causes the wait_for_completion() to terminate cleanly, BUT in order to
cut-short also the busy-wait in the polling case we need that additional
try_wait_for_completion()....so as not to spin forever for a message
that we dont care anymore...
[ note that any late arriving immediate-reply will be in teh future
 discarded at this point]

For this reason, I think that this patch, while correctly checking the
poll_done() condition when the spinloop terminates for the reason
explained in the commit-message, it should also check if the loop was not
instead forcibly terminated by the OoO scenario...if not, we will end up
considering the polling to have timeout, while instead it was forcibly
terminated by the OoO state machine complete() and just want to ignore
such missing message...

So what about instead of checking (untested):

+	if (!completion_done(&xfer->done) &&
+	    !info->desc->ops->poll_done(cinfo, xfer))

... so that we determine that the polling has timed out only when:

- the immediate-reply has NOT been received [!poll_done()]

 AND

- we were NOT forcibly make to ignore the missing reply in an OoO scenario
    [!completion_done]

NOTE THAT xfer->done on the polling path is ONLY completed in case of OoO.

> Fixes: ed7c04c1fea3 ("firmware: arm_scmi: Handle concurrent and out-of-order messages")
> 
> Btw, the scmi_xfer_done_no_timeout() has a confusing name, because it
> does timeout.  Was the "_no" an accident?
> 

Agreed. Not sure of the history of the bad naming...

Thans,
Cristian