[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABb+yY1T03YLwiFvBykxsAHQ9Kpu=r1nRTuaP3Emf5dP=Upm0g@mail.gmail.com>
Date: Wed, 10 Jun 2020 10:21:19 -0500
From: Jassi Brar <jassisinghbrar@...il.com>
To: Sudeep Holla <sudeep.holla@....com>
Cc: linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Viresh Kumar <viresh.kumar@...aro.org>,
Rob Herring <robh@...nel.org>,
Frank Rowand <frowand.list@...il.com>,
Bjorn Andersson <bjorn.andersson@...aro.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
"arnd@...db.de" <arnd@...db.de>,
Jassi Brar <jaswinder.singh@...aro.org>
Subject: Re: [PATCH] firmware: arm_scmi: fix timeout value for send_message
On Wed, Jun 10, 2020 at 3:23 AM Sudeep Holla <sudeep.holla@....com> wrote:
>
> On Sun, Jun 07, 2020 at 02:30:23PM -0500, jassisinghbrar@...il.com wrote:
> > From: Jassi Brar <jaswinder.singh@...aro.org>
> >
> > Currently scmi_do_xfer() submits a message to mailbox api and waits
> > for an apparently very short time. This works if there are not many
> > messages in the queue already. However, if many clients share a
> > channel and/or each client submits many messages in a row, the
>
> The recommendation in such scenarios is to use multiple channel.
>
If SCMI is to be accepted as a standard (which I hope), it has to
support most kinds of controllers, but currently the implementation is
too myopic. It is only a matter of time, when someone sees value in
reusing firmware implementation (scmi) but does not have a MHU like
controller.
> > timeout value becomes too short and returns error even if the mailbox
> > is working fine according to the load. The timeout occurs when the
> > message is still in the api/queue awaiting its turn to ride the bus.
> >
> > Fix this by increasing the timeout value enough (500ms?) so that it
> > fails only if there is an actual problem in the transmission (like a
> > lockup or crash).
> >
> > [If we want to capture a situation when the remote didn't
> > respond within expected latency, then the timeout should not
> > start here, but from tx_prepare callback ... just before the
> > message physically gets on the channel]
> >
>
> The bottle neck may not be in the remote. It may be mailbox serialising
> the requests even when it can parallelise.
>
Your logs show (in your test case), using 1 physical channel shows
better transfer (those that complete) rates than virtual channels.
The transfers that fail are purely because of this short timeout.
> >
> > if (xfer->hdr.poll_completion) {
> > - ktime_t stop = ktime_add_ns(ktime_get(), SCMI_MAX_POLL_TO_NS);
> > + ktime_t stop = ktime_add_ns(ktime_get(), 500 * 1000 * NSEC_PER_USEC);
> >
>
> This is unacceptable delay for schedutil fast_switch. So no for this one.
>
Increasing timeout does not increase latency.
Also scmi_xfer() can not know if it was reached from the fast_switch path.
If a platform has many users over a channel such that it can not
guarantee low enough latency, then it must not set the
fast_switch_possible flag, which is optional for this reason.
> > @@ -313,7 +313,7 @@ int scmi_do_xfer(const struct scmi_handle *handle, struct scmi_xfer *xfer)
> > ret = -ETIMEDOUT;
> > } else {
> > /* And we wait for the response. */
> > - timeout = msecs_to_jiffies(info->desc->max_rx_timeout_ms);
> > + timeout = msecs_to_jiffies(500);
>
> In general, this hides issues in the remote.
>
If you want to uncover remote issues, start the timeout in
tx_prepare() because that is when the message is physically sent to
the remote.
> We are trying to move towards
> tops 1ms for a request and with MBOX_QUEUE at 20, I see 20ms is more that
> big enough. We have it set to 30ms now. 500ms is way too large and not
> required IMO.
>
Again, increasing timeout does not slow the system down. It is to
support more variety of platform setups.
Cheers!
Powered by blists - more mailing lists