linux-kernel - Re: [PATCH] mailbox: forward the hrtimer if not queued and under a lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <487bbd00-d763-0a86-9984-1dfd957fbb38@axis.com>
Date:   Wed, 20 Apr 2022 10:28:00 +0200
From:   Bjorn Ardo <bjorn.ardo@...s.com>
To:     Jassi Brar <jassisinghbrar@...il.com>
CC:     kernel <kernel@...s.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] mailbox: forward the hrtimer if not queued and under a
 lock

Hi,

On 4/19/22 16:10, Jassi Brar wrote:
> I don't have access to your client driver, but if it submits another
> message from rx_callback() that is the problem.
>
> Please have a look at how other platforms do, for example
> imx_dsp_rproc_rx_tx_callback()
>
> /**
>   * mbox_chan_received_data - A way for controller driver to push data
>   *              received from remote to the upper layer.
>   * @chan: Pointer to the mailbox channel on which RX happened.
>   * @mssg: Client specific message typecasted as void *
>   *
>   * After startup and before shutdown any data received on the chan
>   * is passed on to the API via atomic mbox_chan_received_data().
>   * The controller should ACK the RX only after this call returns.
>   */
> void mbox_chan_received_data(struct mbox_chan *chan, void *mssg)
>
> If not this, please share your client code as well.
>
> thanks.

In rx_callback() we call tasklet_hi_schedule() to schedule a tasklet and 
this tasklet calls mbox_send_message(). The mailbox is setup with 
.tx_block = false.

I do not quite understand how calling mbox_send_message() from another 
contex (like a work thread as in imx_dsp_rproc_rx_tx_callback()) will 
resolve the race, could you explain this? Or do you mean that it should 
not be called from any interrupt context at all? Looking at the 
documentation of mbox_send_message() it states:

  * This function could be called from atomic context as it simply
  * queues the data and returns a token against the request.

If I look at the code in msg_submit() the part that calls 
hrtimer_active() and hrtimer_start() is completely without a lock. Also 
the code in txdone_hrtimer() that calls hrtimer_forward_now() is without 
a lock. So I cannot see anything preventing these two functions to be 
called concurrently on two different processors (as they do in my 
trace). And looking at the hr_timer code then hrtimer_active() will 
return true if the hrtimer is currently executing txdone_hrtimer().

The way I see the race is if the timer has called txdone_hrtimer() and 
it has passed the part with the for-loop looking at the individual 
channels when the msg_submit() runs. Then txdone_hrtimer() has decided 
to not restart the timer, but hrtimer_active() will still return true 
for a while (until txdone_hrtimer() return completely and the hrtimer 
code can change its status). In this time there is nothing that prevents 
msg_submit() from running, adding a new message that the timer should 
monitor. But if it manages to call hrtimer_active() in the time period 
before the hrtimer code updates it then the current code will never 
start the timer.

Also the poll_hrt is shared between all channels in the mailbox, so it 
does not have to be the same mailbox channel that hrtimer is currently 
waiting for that is calling msg_submit(). So not calling 
mbox_send_message() from rx_callback() will only alter timing for that 
channel. There could still be a race between two different mailbox channels.

This is my understanding of the current code, please tell me if I have 
missed or misunderstood any part of it?

Our current solution are using 4 different mailbox channels 
asynchronously. The code is part of a larger system, but I can put down 
some time and try and extract the relevant parts if you still think this 
is a client issue? But with my current understanding of the code, the 
race between msg_submit() and txdone_hrtimer() is quite clear, and with 
my proposed patch that removes this race we have be able to run for very 
long time without any problems (that is several days). Without the fix 
we get the race after 5-10 min.

Best Regards,

Björn