[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <487bbd00-d763-0a86-9984-1dfd957fbb38@axis.com>
Date: Wed, 20 Apr 2022 10:28:00 +0200
From: Bjorn Ardo <bjorn.ardo@...s.com>
To: Jassi Brar <jassisinghbrar@...il.com>
CC: kernel <kernel@...s.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] mailbox: forward the hrtimer if not queued and under a
lock
Hi,
On 4/19/22 16:10, Jassi Brar wrote:
> I don't have access to your client driver, but if it submits another
> message from rx_callback() that is the problem.
>
> Please have a look at how other platforms do, for example
> imx_dsp_rproc_rx_tx_callback()
>
> /**
> * mbox_chan_received_data - A way for controller driver to push data
> * received from remote to the upper layer.
> * @chan: Pointer to the mailbox channel on which RX happened.
> * @mssg: Client specific message typecasted as void *
> *
> * After startup and before shutdown any data received on the chan
> * is passed on to the API via atomic mbox_chan_received_data().
> * The controller should ACK the RX only after this call returns.
> */
> void mbox_chan_received_data(struct mbox_chan *chan, void *mssg)
>
> If not this, please share your client code as well.
>
> thanks.
In rx_callback() we call tasklet_hi_schedule() to schedule a tasklet and
this tasklet calls mbox_send_message(). The mailbox is setup with
.tx_block = false.
I do not quite understand how calling mbox_send_message() from another
contex (like a work thread as in imx_dsp_rproc_rx_tx_callback()) will
resolve the race, could you explain this? Or do you mean that it should
not be called from any interrupt context at all? Looking at the
documentation of mbox_send_message() it states:
* This function could be called from atomic context as it simply
* queues the data and returns a token against the request.
If I look at the code in msg_submit() the part that calls
hrtimer_active() and hrtimer_start() is completely without a lock. Also
the code in txdone_hrtimer() that calls hrtimer_forward_now() is without
a lock. So I cannot see anything preventing these two functions to be
called concurrently on two different processors (as they do in my
trace). And looking at the hr_timer code then hrtimer_active() will
return true if the hrtimer is currently executing txdone_hrtimer().
The way I see the race is if the timer has called txdone_hrtimer() and
it has passed the part with the for-loop looking at the individual
channels when the msg_submit() runs. Then txdone_hrtimer() has decided
to not restart the timer, but hrtimer_active() will still return true
for a while (until txdone_hrtimer() return completely and the hrtimer
code can change its status). In this time there is nothing that prevents
msg_submit() from running, adding a new message that the timer should
monitor. But if it manages to call hrtimer_active() in the time period
before the hrtimer code updates it then the current code will never
start the timer.
Also the poll_hrt is shared between all channels in the mailbox, so it
does not have to be the same mailbox channel that hrtimer is currently
waiting for that is calling msg_submit(). So not calling
mbox_send_message() from rx_callback() will only alter timing for that
channel. There could still be a race between two different mailbox channels.
This is my understanding of the current code, please tell me if I have
missed or misunderstood any part of it?
Our current solution are using 4 different mailbox channels
asynchronously. The code is part of a larger system, but I can put down
some time and try and extract the relevant parts if you still think this
is a client issue? But with my current understanding of the code, the
race between msg_submit() and txdone_hrtimer() is quite clear, and with
my proposed patch that removes this race we have be able to run for very
long time without any problems (that is several days). Without the fix
we get the race after 5-10 min.
Best Regards,
Björn
Powered by blists - more mailing lists