lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sun, 31 Jan 2021 07:11:54 -0600
From:   Alex Elder <elder@...aro.org>
To:     Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc:     David Miller <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>, elder@...nel.org,
        evgreen@...omium.org, bjorn.andersson@...aro.org,
        cpratapa@...eaurora.org,
        Subash Abhinov Kasiviswanathan <subashab@...eaurora.org>,
        Network Development <netdev@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend

On 1/30/21 10:29 PM, Alex Elder wrote:
> On 1/30/21 9:25 AM, Willem de Bruijn wrote:
>> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@...aro.org> wrote:
>>>
>>> The channel stop and suspend paths both call __gsi_channel_stop(),
>>> which quiesces channel activity, disables NAPI, and (on other than
>>> SDM845) stops the channel.  Similarly, the start and resume paths
>>> share __gsi_channel_start(), which starts the channel and re-enables
>>> NAPI again.
>>>
>>> Disabling NAPI should be done when stopping a channel, but this
>>> should *not* be done when suspending.  It's not necessary in the
>>> suspend path anyway, because the stopped channel (or suspended
>>> endpoint on SDM845) will not cause interrupts to schedule NAPI,
>>> and gsi_channel_trans_quiesce() won't return until there are no
>>> more transactions to process in the NAPI polling loop.
>>
>> But why is it incorrect to do so?
> 
> Maybe it's not; I also thought it was fine before, but...
> 
> Someone at Qualcomm asked me why I thought NAPI needed
> to be disabled on suspend.  My response was basically
> that it was a lightweight operation, and it shouldn't
> really be a problem to do so.
> 
> Then, when I posted two patches last month, Jakub's
> response told me he didn't understand why I was doing
> what I was doing, and I stepped back to reconsider
> the details of what was happening at suspend time.
> 
> https://lore.kernel.org/netdev/20210107183803.47308e23@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/

I should have mentioned that *this* response from Jakub
to a question I had also led to my conclusion that NAPI
should not be disabled on suspend--at least for IPA.
  https://lore.kernel.org/netdev/20210105122328.3e5569a4@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/

The channel is *not* reset on suspend for IPA.

					-Alex

> Four things were happening to suspend a channel:
> quiesce activity; disable interrupt; disable NAPI;
> and stop the channel.  It occurred to me that a
> stopped channel would not generate interrupts, so if
> the channel was stopped earlier there would be no need
> to disable the interrupt.  Similarly there would be
> (essentially) no need to disable NAPI once a channel
> was stopped.
> 
> Underlying all of this is that I started chasing a
> hang that was occurring on suspend over a month ago.
> It was hard to reproduce (hundreds or thousands of
> suspend/resume cycles without hitting it), and one
> of the few times I actually hit the problem it was
> stuck in napi_disable(), apparently waiting for
> NAPI_STATE_SCHED to get cleared by napi_complete().
> 
> My best guess about how this could occur was if there
> were a race of some kind between the interrupt handler
> (scheduling NAPI) and the poll function (completing
> it).  I found a number of problems while looking
> at this, and in the past few weeks I've posted some
> fixes to improve things.  Still, even with some of
> these fixes in place we have seen a hang (but now
> even more rarely).
> 
> So this grand rework of suspending/stopping channels
> is an attempt to resolve this hang on suspend.
> 
> The channel is now stopped early, and once stopped,
> everything that completed prior to the channel being
> stopped is polled before considering the suspend
> function done.  A stopped channel won't interrupt,
> so we don't bother disabling the completion interrupt,
> with no interrupts, NAPI won't be scheduled, so there's
> no need to disable NAPI either.
> 
> The net result is simpler, and seems logical, and
> should preclude any possible race between the interrupt
> handler and poll function.  I'm trying to solve the
> hang problem analytically, because it takes *so* long
> to reproduce.
> 
> I'm open to other suggestions.
> 
>                     -Alex
> 
>>  From a quick look, virtio-net disables on both remove and freeze, for instance.
>>
>>> Instead, enable NAPI in gsi_channel_start(), when the completion
>>> interrupt is first enabled.  Disable it again in gsi_channel_stop(),
>>> when finally disabling the interrupt.
>>>
>>> Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure
>>> NAPI polling is done before moving on.
>>>
>>> Signed-off-by: Alex Elder <elder@...aro.org>
>>> ---
>> =
>>> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)
>>>          struct gsi_channel *channel = &gsi->channel[channel_id];
>>>          int ret;
>>>
>>> -       /* Enable the completion interrupt */
>>> +       /* Enable NAPI and the completion interrupt */
>>> +       napi_enable(&channel->napi);
>>>          gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
>>>
>>>          ret = __gsi_channel_start(channel, true);
>>> -       if (ret)
>>> -               gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
>>> +       if (!ret)
>>> +               return 0;
>>> +
>>> +       gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
>>> +       napi_disable(&channel->napi);
>>>
>>>          return ret;
>>>   }
>>
>> subjective, but easier to parse when the normal control flow is linear
>> and the error path takes a branch (or goto, if reused).
>>
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ