[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z6tutxs2UE7vjjJ6@bogus>
Date: Tue, 11 Feb 2025 15:37:27 +0000
From: Sudeep Holla <sudeep.holla@....com>
To: jack21 <jackhuang021@...il.com>
Cc: Cristian Marussi <cristian.marussi@....com>,
Dan Carpenter <dan.carpenter@...aro.org>,
<arm-scmi@...r.kernel.org>, <linux-arm-kernel@...ts.infradead.org>,
<linux-kernel@...r.kernel.org>,
Huangjie <huangjie1663@...tium.com.cn>
Subject: Re: [PATCH] dirvers: scmi: poll again when transfer reach timeout
On Thu, Jan 23, 2025 at 07:48:53PM +0000, Cristian Marussi wrote:
> On Thu, Jan 23, 2025 at 01:38:30PM +0300, Dan Carpenter wrote:
> > s/dirvers/drivers/
> >
> > On Thu, Jan 23, 2025 at 04:33:24PM +0800, jack21 wrote:
> > > From: Huangjie <huangjie1663@...tium.com.cn>
> > >
> > > spin_until_cond() not really hold a spin lock, possible timeout may occur
> > > in preemption kernel when preempted after spin_until_cond().
> > >
> > > We check status again when reach timeout is reached to prevent incorrect
> > > jugement of timeout.
> > >
>
> Hi,
>
> probably another not so short email of mine :P ...
>
> >
> > I suspect the real issue is that we exit the spin loop when
> > try_wait_for_completion(&xfer->done) is true. Probably we should add
> > that as a Fixes tag?:
> >
>
> The Kernel SCMI stack, acting as an SCMI agent have to cope with the
> possible (even though rare) scenario of receiving Out of Order messages
> when dealing with async commands...
>
> ...IOW, what to do if, after having issued an AsycnCmd, the Delayed response
> is received BEFORE the corresponding immediate reply to the initial request:
> such a wicked (and rare) situation could be the result of a misbehaving
> platform server OR simply due to parallellization of activies on the A2P
> and P2A on some transports, i.e. platform sent reply and delayed_response
> in the correct order but the transport delivered those OoO, being
> transmitted on 2 discinct physical channels...hard but not impossible.
>
> Kernel side we address this OoO scenario by assuming that, if a valid
> Delayed Response to a in-flight Async-cmd is received on teh P2A chan,
> before the immediate A2P reply, the transaction itself is good and we
> can progress by just logging and ignoring/swallowing the missing
> immediate-reply, that will probably arrive later, and just carry on
> processing the Delayed Response.
>
> In order to do that, we maintain, in fact, a per-message state-machine,
> and inside scmi_msg_response_validate(), when we detect the OoO condition,
> we cause the wait-for-immediate-reply to terminate by issuing forcibly a
> complete(xfer->done)....
>
> ...this works straight away for the non-polled IRQ transactions since
> causes the wait_for_completion() to terminate cleanly, BUT in order to
> cut-short also the busy-wait in the polling case we need that additional
> try_wait_for_completion()....so as not to spin forever for a message
> that we dont care anymore...
> [ note that any late arriving immediate-reply will be in teh future
> discarded at this point]
>
> For this reason, I think that this patch, while correctly checking the
> poll_done() condition when the spinloop terminates for the reason
> explained in the commit-message, it should also check if the loop was not
> instead forcibly terminated by the OoO scenario...if not, we will end up
> considering the polling to have timeout, while instead it was forcibly
> terminated by the OoO state machine complete() and just want to ignore
> such missing message...
>
> So what about instead of checking (untested):
>
> + if (!completion_done(&xfer->done) &&
> + !info->desc->ops->poll_done(cinfo, xfer))
Were you able to check this ? I am waiting to hear back from you on this.
--
Regards,
Sudeep
Powered by blists - more mailing lists