lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 19 Apr 2022 14:15:46 +0200
From:   Bjorn Ardo <bjorn.ardo@...s.com>
To:     Jassi Brar <jassisinghbrar@...il.com>
CC:     kernel <kernel@...s.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] mailbox: forward the hrtimer if not queued and under a
 lock

Hi,


I can confirm that this is an actual issue found on our system, not just 
a theoretical case.


If I add the following trace-code to the original code:


diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
index 3e7d4b20ab34..8e9e82e5f4b1 100644
--- a/drivers/mailbox/mailbox.c
+++ b/drivers/mailbox/mailbox.c
@@ -57,6 +57,7 @@ static void msg_submit(struct mbox_chan *chan)
         void *data;
         int err = -EBUSY;

+       trace_printk("Entering msg_submit\n");
         spin_lock_irqsave(&chan->lock, flags);

         if (!chan->msg_count || chan->active_req)
@@ -85,9 +86,14 @@ static void msg_submit(struct mbox_chan *chan)
         /* kick start the timer immediately to avoid delays */
         if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
                 /* but only if not already active */
-               if (!hrtimer_active(&chan->mbox->poll_hrt))
+               if (!hrtimer_active(&chan->mbox->poll_hrt)) {
+                       trace_printk("Starting the hr timer from 
submit\n");
hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
+               } else {
+                       trace_printk("Not starting the hr timer from 
submit since it is active\n");
+               }
         }
+       trace_printk("Leaving msg_submit\n");
  }

  static void tx_tick(struct mbox_chan *chan, int r)
@@ -121,6 +127,7 @@ static enum hrtimer_restart txdone_hrtimer(struct 
hrtimer *hrtimer)
         bool txdone, resched = false;
         int i;

+       trace_printk("Entering txdone_hrtimer\n");
         for (i = 0; i < mbox->num_chans; i++) {
                 struct mbox_chan *chan = &mbox->chans[i];

@@ -134,8 +141,10 @@ static enum hrtimer_restart txdone_hrtimer(struct 
hrtimer *hrtimer)

         if (resched) {
                 hrtimer_forward_now(hrtimer, 
ms_to_ktime(mbox->txpoll_period));
+               trace_printk("Leaving txdone_hrtimer with restart\n");
                 return HRTIMER_RESTART;
         }
+       trace_printk("Leaving txdone_hrtimer without restart\n");
         return HRTIMER_NORESTART;
  }

Then I get the following trace output (I have cropped a small portion 
around where the error appears):


        vhost-475-480     [000] d..1.   217.440325: msg_submit: Entering 
msg_submit
        vhost-475-480     [000] d..1.   217.440332: msg_submit: Starting 
the hr timer from submit
        vhost-475-480     [000] d..1.   217.440336: msg_submit: Leaving 
msg_submit
        vhost-475-480     [000] d.h1.   217.440342: txdone_hrtimer: 
Entering txdone_hrtimer
        vhost-475-480     [000] d.h1.   217.440349: txdone_hrtimer: 
Leaving txdone_hrtimer without restart
        vhost-475-480     [000] d..1.   217.440597: msg_submit: Entering 
msg_submit
        vhost-475-480     [000] d..1.   217.440599: msg_submit: Starting 
the hr timer from submit
        vhost-475-480     [000] d..1.   217.440602: msg_submit: Leaving 
msg_submit
           <idle>-0       [001] ..s1.   217.440604: msg_submit: Entering 
msg_submit
        vhost-475-480     [000] d.h1.   217.440606: txdone_hrtimer: 
Entering txdone_hrtimer
        vhost-475-480     [000] d.h1.   217.440608: txdone_hrtimer: 
Leaving txdone_hrtimer without restart
           <idle>-0       [001] ..s1.   217.440609: msg_submit: Not 
starting the hr timer from submit since it is active
           <idle>-0       [001] ..s1.   217.440610: msg_submit: Leaving 
msg_submit


If I break down the log above we first have one case that works as 
intended. That is a message being written and a timer started and the 
message have been read when the timer hits:

        vhost-475-480     [000] d..1.   217.440325: msg_submit: Entering 
msg_submit
        vhost-475-480     [000] d..1.   217.440332: msg_submit: Starting 
the hr timer from submit
        vhost-475-480     [000] d..1.   217.440336: msg_submit: Leaving 
msg_submit
        vhost-475-480     [000] d.h1.   217.440342: txdone_hrtimer: 
Entering txdone_hrtimer
        vhost-475-480     [000] d.h1.   217.440349: txdone_hrtimer: 
Leaving txdone_hrtimer without restart


After this we write a new message and a new timer is started:

        vhost-475-480     [000] d..1.   217.440597: msg_submit: Entering 
msg_submit
        vhost-475-480     [000] d..1.   217.440599: msg_submit: Starting 
the hr timer from submit
        vhost-475-480     [000] d..1.   217.440602: msg_submit: Leaving 
msg_submit


However here comes the race. Now a new message is being written at the 
same time as the hr-timer is handling the first reply (on different CPU's):

           <idle>-0       [001] ..s1.   217.440604: msg_submit: Entering 
msg_submit
        vhost-475-480     [000] d.h1.   217.440606: txdone_hrtimer: 
Entering txdone_hrtimer

And the hr-timer decides not to restart itself:

        vhost-475-480     [000] d.h1.   217.440608: txdone_hrtimer: 
Leaving txdone_hrtimer without restart

And also the new message written decides to not start the timer (even 
thou it requires one), since it has the status of being active during 
the time that the hr-timer callback is running:

           <idle>-0       [001] ..s1.   217.440609: msg_submit: Not 
starting the hr timer from submit since it is active
           <idle>-0       [001] ..s1.   217.440610: msg_submit: Leaving 
msg_submit


So now we end up in a state where we have a message waiting for a reply 
but no timer started that can provide this reply.


Our own mailbox code is attached (it is not upstreamed yet, but it is on 
its way).


Best Regards

Björn

On 4/14/22 16:30, Jassi Brar wrote:
> On Thu, Mar 31, 2022 at 2:01 AM Björn Ardö <bjorn.ardo@...s.com> wrote:
>> This reverts commit c7dacf5b0f32957b24ef29df1207dc2cd8307743,
>> "mailbox: avoid timer start from callback"
>>
>> The previous commit was reverted since it lead to a race that
>> caused the hrtimer to not be started at all. The check for
>> hrtimer_active() in msg_submit() will return true if the
>> callback function txdone_hrtimer() is currently running. This
>> function could return HRTIMER_NORESTART and then the timer
>> will not be restarted, and also msg_submit() will not start
>> the timer. This will lead to a message actually being submitted
>> but no timer will start to check for its compleation.
>>
>> The original fix that added checking hrtimer_active() was added to
>> avoid a warning with hrtimer_forward. Looking in the kernel
>> another solution to avoid this warning is to check hrtimer_is_queued()
>> before calling hrtimer_forward_now() instead. This however requires a
>> lock so the timer is not started by msg_submit() inbetween this check
>> and the hrtimer_forward() call.
>>
> This is a very dense api used by many use-cases, I am not confident
> making any changes without confirming its a real issue with the common
> code. Please share your client code and traces, that will help me get
> a clearer picture.
>
> Thanks.
View attachment "mailbox-artpec8.c" of type "text/x-csrc" (7507 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ