linux-kernel - Re: [PATCH v2] firmware: arm_scmi: Avoid to call mbox_client_txdone on txdone

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c6a24e4c9b97b8a47a822fb4fcbc4b955ac5fbc5.camel@mediatek.com>
Date: Thu, 7 Mar 2024 12:31:02 +0000
From: Flash Liu (劉炳傳) <Flash.Liu@...iatek.com>
To: "cristian.marussi@....com" <cristian.marussi@....com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mediatek@...ts.infradead.org" <linux-mediatek@...ts.infradead.org>,
	wsd_upstream <wsd_upstream@...iatek.com>,
	Cylen Yao (姚立三) <cylen.yao@...iatek.com>,
	"sudeep.holla@....com" <sudeep.holla@....com>,
	"linux-arm-kernel@...ts.infradead.org"
	<linux-arm-kernel@...ts.infradead.org>, "matthias.bgg@...il.com"
	<matthias.bgg@...il.com>, "angelogioacchino.delregno@...labora.com"
	<angelogioacchino.delregno@...labora.com>
Subject: Re: [PATCH v2] firmware: arm_scmi: Avoid to call mbox_client_txdone
 on txdone_irq mode

Hi Cristian,

Thanks for reply. :)

The uploaded patch is just to avoid getting error logs, since
mbox_client_txdone() will check the mode.

About your mentioned scenario of TxACK irq mode, not sure, but do you
mean that mbox driver may not "copy out" the REPLY mesg-X at shmem,
before it calls to mbox_chan_txdone()?
Could it become safe if we copy the REPLY mesg to a buffer and then
issue the tx_tick?


In addition, considering the following scenario:

mesg-X (thread-X: low priority)
mesg-Y (thread-Y: high priority)
mesg-Z (thread-Z: high priority)

1. scmi: thread-X calls mbox_send_message(X) to send mesg-X, and goes
to wait_completion()

2. mbox: mesg-X is sent

3. scmi: ANOTHER mesg-Y tx is attempted via mbox_send_message(Y),
thread-Y goest to wait_completion()
3'. scmi: ANOTHER mesg-Z tx is attempted via mbox_send_message(Z),
thread-Z goest to wait_completion()

4. mbox: mesg-Y is queued
4'. mbox: mesg-Z is queued

5. mbox irq received, ISR notify complete to thread-X, however system
is busy on other high priority threads,
so, thread-X doesn't back soon

6. mesg-Y, mesg-Z get timed-out.

Could the timeout situtaion of mesg-Y and Z be reduced, if tx_tick in
ISR?

... explains something might for TxACK irq mode... maybe you have other
observation or suggestions about this priority scenario.

Thanks again,
Pin-Chuan

On Wed, 2024-03-06 at 16:08 +0000, Cristian Marussi wrote:
> On Wed, Mar 06, 2024 at 09:49:16AM +0000, Cristian Marussi wrote:
> > On Wed, Mar 06, 2024 at 06:13:48AM +0000, Flash Liu (劉炳傳) wrote:
> > > Hi Cristian,
> > > 
> > > Kindly ping :)
> > > 
> > 
> > Hi Pin-Chuan,
> > 
> > sorry for the delay...I have NOT forgot :D, indeed I was just
> > testing
> > yesterday with some mailbox IP of ours equipped with a TxAck IRQ
> > and I
> > would have some question for you because I've seen some anomalies
> > while
> > using this: does your solution work reliably in your setup ALSO
> > when
> > multiple SCMI transactions are attempted ? (like cpufreq issuing
> > cmds
> > while polling a sensor from some other thread)
> > 
> > ...I'll dig deeper today in this matter, but my current suspect is
> > that
> > using the mailbox TXAck IRQ to let the controller tick the internal
> > mailbox
> > queues does not play well with SCMI, since the SCMI TX channel is
> > really the
> > SCMI "a2p" bidirectional channel and it is associated with just
> > only one shmem
> > area used to hold both the egressing CMD and to receive the
> > incoming REPLY
> > from the platform: so if you let the controller tick the queues as
> > soon as you
> > received the TXAck you are telling the mailbox subsystem to queue
> > another message
> > on the same area while you are not really done, since only the
> > client know
> > when it's done with the whole SCMI transaction and the reply has
> > been fetched.
> > 
> > Indeed, for these reasons we have the BUSY/FREE bit in the shmem to
> > protect it
> > from pending new requests until the previous one has completed, but
> > when the
> > waited-for reply comes in, the platform would have cleared the BUSY
> > bit and
> > let the new queued message overwrite the pending reply prematurely,
> > and one
> > message is lost...
> > 
> > ...but as said I want to delve deeper into this, as of now just
> > suppositions
> > and maybe I am just missing something more that has to be
> > configured
> > properly...
> > 
> > Thanks,
> > Cristian
> > 
> 
> Hi again :D,
> 
> so articulating more on my supposition that TxAck-capable mailboxes
> and
> SCMI do not play well together (and would not be worth either
> really...)
> 
> Consider the following scenario.
> 
> 1. scmi: mbox_send_message(X) is called from SCMI stack to send mesg-
> X
>          on the a2p channel (a command)
> 
> 2. mbox: mesg-X is
>   2a. queued by mbox subsystem [add_to_rbuf(X)]
>   2b. submitted for transmission [msg_submit(X)]
>   2c. prepared by SCMI clk->tx_prepare
>   2d. finally sent via mhu driver .send_data
>   2e. mesg-X is now an active_req for mbox and in-flight for SCMI
> 
> 3. scmi: ANOTHER mesg-Y tx is attempted via mbox_send_message(Y)
> 
> 4. mbox: mesg-Y is
>   4a. queued by mbox_subsys [add_to_rbuf()]
>   4b. NOT submitted since there is already an active_req=mesg-X
> pending
> 
> Any further SCMI mesg TX attempt will behave similarly:
> queued/not_submitted
> till at some in the future someone calls the txdone routines, which
> in turn
> calls tx_tick()...this SOMEONE can be the client, like it is now, or
> the
> controller if it is configured to use the TxAck IRQ (as per-your-
> patch)...
> ...so lets see what happen in your TxAck enabled case.
> 
> 5. TxAck IRQ received, transmission of mesg-X has been successfull
>   (NOTE that SCMI at this point is still waiting for a reply to mesg-
> X..)
> 
>   5a. controller calls mbox_chan_txdone()
>   5b. mbox in turn calls tx_tick
>   5c. active_req is cleared and next queued mesg-Y is submitted
>   5d. mesg-Y transmission gets anyway stuck on cl->tx_prepare since
>       we check the SCMI shmem BUSY bit and busy-loop there till it
>       clears: this clearing can happen ONLY after the mesg-X reply
>       has come through, since it is the platform SCMI server that
>       clears it having delivered the reply in the shmem.
> 
> 6. platform SCMI server replies to mesg-X finally:
>   6a. platform writes reply in shmem
>   6b. platform clears BUSY bit
> 
>   -- note SCMI stack is still waiting for a reply at this point...
>      so waiting for an IRQ OR by simply spinning on that same BUSY
> bit
>      if polling mode was requested for the transaction....
> 
>      ...lets assume you are in IRQ mode:
> 
> 7. mesg-Y sender which was spinning on BUSY bit (blocked on
> tx_prepare)
>    is immediately cleared to send and so tx_prepare can proceed
> further
>    and completely overwrite the just received mesg-X, which is now
> LOST
> 
> ..in case you were polling I guess you will have anyway some
> corruption
> due to the race between the polling-mesg-X-receiver retrieving the
> reply
> and the same tx_prepare codeflow as above...
> 
> Indeed, the spec says that you should protect the channel till the
> reply
> has been retrieved from the SCMI (even after the BUSY bit is
> cleared), and
> in other transport we DO have some form of locks, BUT here in
> mailboxes
> there is not since it is NOT needed IF you stick to the non-TxAck
> original
> behaviour, since the tx_tick, as it is now, will be run by the SCMI
> stack
> ONLY after it has waited for mesg-X and retrieved the mesg-X-reply
> payload
> ...not before.
> 
> Instead, if you enable the TxAck mode you are basically letting the
> controller
> itself issue the tx_tick(), which means "previous TX is done, please
> proceed
> with the next", BUT the current TX is really NOT done at all as
> intended
> by the client (SCMI), since the reply is missing and the only entity
> which
> can have the whole picture about when a transaction is completed (or
> timed-out)
> is the SCMI client.
> 
> As said, I think the fundamental clash is between what the mailbox
> subsystem considers a TXDONE event (and related actions) and what
> instead is considered a completed transaction on the SCMI a2p
> channel:
> i.e. CMD_sent + REPLY_retrieved.
> 
> At the end, anyway, would it be worth in any way to leverage such
> TxAck
> capabilities (somehow) of a mailbox in the SCMI world ?
> 
> I mean, even if we make this work, what is supposed to happen better
> and faster
> when using a TxAck instead of a TX_polling mode like it is now ?
> 
> ...because the SCMI stack cannot really do anything with this
> information in
> this case, given that there is just one single a2p_shem area for
> sending command
> and receiving replies...it has just to wait anyway even after the
> TxAck...
> 
> ..maybe it is a bit more power-favourable to sleep_wait for the TxAck
> IRQ
> instead of polling the MHU regs ?... other than this the TxAck means
> nothing
> really in the context of the SCMI world, since you cannot safely
> queue anything
> more till the previous exchange has fully completed...
> 
> ...in other non-SCMI scenarios that I experimented with, it really
> makes a
> difference havig the TxAck since it avoids all the internal
> polling/requeueing
> dance in the mailbox subsystem, but in this case I think is all made
> useless by
> the way SCMI/SMT based transport works...i.e. using a single shmem
> area for
> the a2p channel..
> 
> ...not really a short and sweet email... :P ... any feedback or
> further
> ideas are welcome anyway...especially if you can prove that all of
> the
> above is somehow wrong, or that there is a good reason to leverage
> the
> TxAck capable mailboxes  :D
> 
> Thanks,
> Cristian