[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a1b12c17-5d65-ce29-3d4f-e09de4fdcf3f@linaro.org>
Date: Mon, 1 Feb 2021 08:35:03 -0600
From: Alex Elder <elder@...aro.org>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: David Miller <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>, elder@...nel.org,
evgreen@...omium.org, bjorn.andersson@...aro.org,
cpratapa@...eaurora.org,
Subash Abhinov Kasiviswanathan <subashab@...eaurora.org>,
Network Development <netdev@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend
On 1/31/21 7:36 PM, Willem de Bruijn wrote:
> On Sun, Jan 31, 2021 at 10:32 AM Alex Elder <elder@...aro.org> wrote:
>>
>> On 1/31/21 8:52 AM, Willem de Bruijn wrote:
>>> On Sat, Jan 30, 2021 at 11:29 PM Alex Elder <elder@...aro.org> wrote:
>>>>
>>>> On 1/30/21 9:25 AM, Willem de Bruijn wrote:
>>>>> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@...aro.org> wrote:
>>>>>>
>>>>>> The channel stop and suspend paths both call __gsi_channel_stop(),
>>>>>> which quiesces channel activity, disables NAPI, and (on other than
>>>>>> SDM845) stops the channel. Similarly, the start and resume paths
>>>>>> share __gsi_channel_start(), which starts the channel and re-enables
>>>>>> NAPI again.
>>>>>>
>>>>>> Disabling NAPI should be done when stopping a channel, but this
>>>>>> should *not* be done when suspending. It's not necessary in the
>>>>>> suspend path anyway, because the stopped channel (or suspended
>>>>>> endpoint on SDM845) will not cause interrupts to schedule NAPI,
>>>>>> and gsi_channel_trans_quiesce() won't return until there are no
>>>>>> more transactions to process in the NAPI polling loop.
>>>>>
>>>>> But why is it incorrect to do so?
>>>>
>>>> Maybe it's not; I also thought it was fine before, but...
. . .
>> The "hang" occurs on an RX endpoint, and in particular it
>> occurs on an endpoint that we *know* will be receiving a
>> packet as part of the suspend process (when clearing the
>> hardware pipeline). I can go into that further but won't'
>> unless asked.
>>
>>>> A stopped channel won't interrupt,
>>>> so we don't bother disabling the completion interrupt,
>>>> with no interrupts, NAPI won't be scheduled, so there's
>>>> no need to disable NAPI either.
>>>
>>> That sounds plausible. But it doesn't explain why napi_disable "should
>>> *not* be done when suspending" as the commit states.
>>>
>>> Arguably, leaving that won't have much effect either way, and is in
>>> line with other drivers.
>>
>> Understood and agreed. In fact, if the hang occurrs in
>> napi_disable() when waiting for NAPI_STATE_SCHED to clear,
>> it would occur in napi_synchronize() as well.
>
> Agreed.
>
> So you have an environment to test a patch in, it might be worthwhile
> to test essentially the same logic reordering as in this patch set,
> but while still disabling napi.
What is the purpose of this test? Just to guarantee
that the NAPI hang goes away? Because you agree that
the napi_schedule() call would *also* hang if that
problem exists, right?
Anyway, what you're suggesting is to simply test with
this last patch removed. I can do that but I really
don't expect it to change anything. I will start that
test later today when I'm turning my attention to
something else for a while.
> The disappearing race may be due to another change rather than
> napi_disable vs napi_synchronize. A smaller, more targeted patch could
> also be a net (instead of net-next) candidate.
I am certain it is.
I can tell you that we have seen a hang (after I think 2500+
suspend/resume cycles) with the IPA code that is currently
upstream.
But with this latest series of 9, there is no hang after
10,000+ cycles. That gives me a bisect window, but I really
don't want to go through a full bisect of even those 9,
because it's 4 tests, each of which takes days to complete.
Looking at the 9 patches, I think this one is the most
likely culprit:
net: ipa: disable IEOB interrupt after channel stop
I think the race involves the I/O completion handler
interacting with NAPI in an unwanted way, but I have
not come up with the exact sequence that would lead
to getting stuck in napi_disable().
Here are some possible events that could occur on an
RX channel in *some* order, prior to that patch. And
in the order I show there's at least a problem of a
receive not being processed immediately.
. . . (suspend initiated)
replenish_stop()
quiesce()
IRQ fires (receive ready)
napi_disable()
napi_schedule() (ignored)
irq_disable()
IRQ condition; pending
channel_stop()
. . . (resume triggered)
channel_start()
irq_enable()
pending IRQ fires
napi_schedule() (ignored)
napi_enable()
. . . (suspend initiated)
>> At this point
>> it's more about the whole set of rework here, and keeping
>> NAPI enabled during suspend seems a little cleaner.
>
> I'm not sure. I haven't looked if there is a common behavior across
> devices. That might be informative. igb, for one, releases all
> resources.
I tried to do a survey of that too and did not see a
consistent pattern. I didn't look *that* hard because
doing so would be more involved than I wanted to get.
So in summary:
- I'm putting together version 2 of this series now
- Testing this past week seems to show that this series
makes the hang in napi_disable() (or synchronize)
go away.
- I think the most likely patch in this series that
fixes the problem is the IRQ ordering one I mention
above, but right now I can't cite a specific sequence
of events that would prove it.
- I will begin some long testing later today without
this last patch applied
--> But I think testing without the IRQ ordering
patch would be more promising, and I'd like
to hear your opinion on that
Thanks again for your input and help on this.
-Alex
. . .
Powered by blists - more mailing lists