[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c4aa28b2-5550-ed39-3869-556b451ff54b@axis.com>
Date: Tue, 19 Apr 2022 14:15:46 +0200
From: Bjorn Ardo <bjorn.ardo@...s.com>
To: Jassi Brar <jassisinghbrar@...il.com>
CC: kernel <kernel@...s.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] mailbox: forward the hrtimer if not queued and under a
lock
Hi,
I can confirm that this is an actual issue found on our system, not just
a theoretical case.
If I add the following trace-code to the original code:
diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
index 3e7d4b20ab34..8e9e82e5f4b1 100644
--- a/drivers/mailbox/mailbox.c
+++ b/drivers/mailbox/mailbox.c
@@ -57,6 +57,7 @@ static void msg_submit(struct mbox_chan *chan)
void *data;
int err = -EBUSY;
+ trace_printk("Entering msg_submit\n");
spin_lock_irqsave(&chan->lock, flags);
if (!chan->msg_count || chan->active_req)
@@ -85,9 +86,14 @@ static void msg_submit(struct mbox_chan *chan)
/* kick start the timer immediately to avoid delays */
if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
/* but only if not already active */
- if (!hrtimer_active(&chan->mbox->poll_hrt))
+ if (!hrtimer_active(&chan->mbox->poll_hrt)) {
+ trace_printk("Starting the hr timer from
submit\n");
hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
+ } else {
+ trace_printk("Not starting the hr timer from
submit since it is active\n");
+ }
}
+ trace_printk("Leaving msg_submit\n");
}
static void tx_tick(struct mbox_chan *chan, int r)
@@ -121,6 +127,7 @@ static enum hrtimer_restart txdone_hrtimer(struct
hrtimer *hrtimer)
bool txdone, resched = false;
int i;
+ trace_printk("Entering txdone_hrtimer\n");
for (i = 0; i < mbox->num_chans; i++) {
struct mbox_chan *chan = &mbox->chans[i];
@@ -134,8 +141,10 @@ static enum hrtimer_restart txdone_hrtimer(struct
hrtimer *hrtimer)
if (resched) {
hrtimer_forward_now(hrtimer,
ms_to_ktime(mbox->txpoll_period));
+ trace_printk("Leaving txdone_hrtimer with restart\n");
return HRTIMER_RESTART;
}
+ trace_printk("Leaving txdone_hrtimer without restart\n");
return HRTIMER_NORESTART;
}
Then I get the following trace output (I have cropped a small portion
around where the error appears):
vhost-475-480 [000] d..1. 217.440325: msg_submit: Entering
msg_submit
vhost-475-480 [000] d..1. 217.440332: msg_submit: Starting
the hr timer from submit
vhost-475-480 [000] d..1. 217.440336: msg_submit: Leaving
msg_submit
vhost-475-480 [000] d.h1. 217.440342: txdone_hrtimer:
Entering txdone_hrtimer
vhost-475-480 [000] d.h1. 217.440349: txdone_hrtimer:
Leaving txdone_hrtimer without restart
vhost-475-480 [000] d..1. 217.440597: msg_submit: Entering
msg_submit
vhost-475-480 [000] d..1. 217.440599: msg_submit: Starting
the hr timer from submit
vhost-475-480 [000] d..1. 217.440602: msg_submit: Leaving
msg_submit
<idle>-0 [001] ..s1. 217.440604: msg_submit: Entering
msg_submit
vhost-475-480 [000] d.h1. 217.440606: txdone_hrtimer:
Entering txdone_hrtimer
vhost-475-480 [000] d.h1. 217.440608: txdone_hrtimer:
Leaving txdone_hrtimer without restart
<idle>-0 [001] ..s1. 217.440609: msg_submit: Not
starting the hr timer from submit since it is active
<idle>-0 [001] ..s1. 217.440610: msg_submit: Leaving
msg_submit
If I break down the log above we first have one case that works as
intended. That is a message being written and a timer started and the
message have been read when the timer hits:
vhost-475-480 [000] d..1. 217.440325: msg_submit: Entering
msg_submit
vhost-475-480 [000] d..1. 217.440332: msg_submit: Starting
the hr timer from submit
vhost-475-480 [000] d..1. 217.440336: msg_submit: Leaving
msg_submit
vhost-475-480 [000] d.h1. 217.440342: txdone_hrtimer:
Entering txdone_hrtimer
vhost-475-480 [000] d.h1. 217.440349: txdone_hrtimer:
Leaving txdone_hrtimer without restart
After this we write a new message and a new timer is started:
vhost-475-480 [000] d..1. 217.440597: msg_submit: Entering
msg_submit
vhost-475-480 [000] d..1. 217.440599: msg_submit: Starting
the hr timer from submit
vhost-475-480 [000] d..1. 217.440602: msg_submit: Leaving
msg_submit
However here comes the race. Now a new message is being written at the
same time as the hr-timer is handling the first reply (on different CPU's):
<idle>-0 [001] ..s1. 217.440604: msg_submit: Entering
msg_submit
vhost-475-480 [000] d.h1. 217.440606: txdone_hrtimer:
Entering txdone_hrtimer
And the hr-timer decides not to restart itself:
vhost-475-480 [000] d.h1. 217.440608: txdone_hrtimer:
Leaving txdone_hrtimer without restart
And also the new message written decides to not start the timer (even
thou it requires one), since it has the status of being active during
the time that the hr-timer callback is running:
<idle>-0 [001] ..s1. 217.440609: msg_submit: Not
starting the hr timer from submit since it is active
<idle>-0 [001] ..s1. 217.440610: msg_submit: Leaving
msg_submit
So now we end up in a state where we have a message waiting for a reply
but no timer started that can provide this reply.
Our own mailbox code is attached (it is not upstreamed yet, but it is on
its way).
Best Regards
Björn
On 4/14/22 16:30, Jassi Brar wrote:
> On Thu, Mar 31, 2022 at 2:01 AM Björn Ardö <bjorn.ardo@...s.com> wrote:
>> This reverts commit c7dacf5b0f32957b24ef29df1207dc2cd8307743,
>> "mailbox: avoid timer start from callback"
>>
>> The previous commit was reverted since it lead to a race that
>> caused the hrtimer to not be started at all. The check for
>> hrtimer_active() in msg_submit() will return true if the
>> callback function txdone_hrtimer() is currently running. This
>> function could return HRTIMER_NORESTART and then the timer
>> will not be restarted, and also msg_submit() will not start
>> the timer. This will lead to a message actually being submitted
>> but no timer will start to check for its compleation.
>>
>> The original fix that added checking hrtimer_active() was added to
>> avoid a warning with hrtimer_forward. Looking in the kernel
>> another solution to avoid this warning is to check hrtimer_is_queued()
>> before calling hrtimer_forward_now() instead. This however requires a
>> lock so the timer is not started by msg_submit() inbetween this check
>> and the hrtimer_forward() call.
>>
> This is a very dense api used by many use-cases, I am not confident
> making any changes without confirming its a real issue with the common
> code. Please share your client code and traces, that will help me get
> a clearer picture.
>
> Thanks.
View attachment "mailbox-artpec8.c" of type "text/x-csrc" (7507 bytes)
Powered by blists - more mailing lists