lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 30 Jan 2021 22:29:19 -0600
From:   Alex Elder <elder@...aro.org>
To:     Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc:     David Miller <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>, elder@...nel.org,
        evgreen@...omium.org, bjorn.andersson@...aro.org,
        cpratapa@...eaurora.org,
        Subash Abhinov Kasiviswanathan <subashab@...eaurora.org>,
        Network Development <netdev@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend

On 1/30/21 9:25 AM, Willem de Bruijn wrote:
> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@...aro.org> wrote:
>>
>> The channel stop and suspend paths both call __gsi_channel_stop(),
>> which quiesces channel activity, disables NAPI, and (on other than
>> SDM845) stops the channel.  Similarly, the start and resume paths
>> share __gsi_channel_start(), which starts the channel and re-enables
>> NAPI again.
>>
>> Disabling NAPI should be done when stopping a channel, but this
>> should *not* be done when suspending.  It's not necessary in the
>> suspend path anyway, because the stopped channel (or suspended
>> endpoint on SDM845) will not cause interrupts to schedule NAPI,
>> and gsi_channel_trans_quiesce() won't return until there are no
>> more transactions to process in the NAPI polling loop.
> 
> But why is it incorrect to do so?

Maybe it's not; I also thought it was fine before, but...

Someone at Qualcomm asked me why I thought NAPI needed
to be disabled on suspend.  My response was basically
that it was a lightweight operation, and it shouldn't
really be a problem to do so.

Then, when I posted two patches last month, Jakub's
response told me he didn't understand why I was doing
what I was doing, and I stepped back to reconsider
the details of what was happening at suspend time.
 
https://lore.kernel.org/netdev/20210107183803.47308e23@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/

Four things were happening to suspend a channel:
quiesce activity; disable interrupt; disable NAPI;
and stop the channel.  It occurred to me that a
stopped channel would not generate interrupts, so if
the channel was stopped earlier there would be no need
to disable the interrupt.  Similarly there would be
(essentially) no need to disable NAPI once a channel
was stopped.

Underlying all of this is that I started chasing a
hang that was occurring on suspend over a month ago.
It was hard to reproduce (hundreds or thousands of
suspend/resume cycles without hitting it), and one
of the few times I actually hit the problem it was
stuck in napi_disable(), apparently waiting for
NAPI_STATE_SCHED to get cleared by napi_complete().

My best guess about how this could occur was if there
were a race of some kind between the interrupt handler
(scheduling NAPI) and the poll function (completing
it).  I found a number of problems while looking
at this, and in the past few weeks I've posted some
fixes to improve things.  Still, even with some of
these fixes in place we have seen a hang (but now
even more rarely).

So this grand rework of suspending/stopping channels
is an attempt to resolve this hang on suspend.

The channel is now stopped early, and once stopped,
everything that completed prior to the channel being
stopped is polled before considering the suspend
function done.  A stopped channel won't interrupt,
so we don't bother disabling the completion interrupt,
with no interrupts, NAPI won't be scheduled, so there's
no need to disable NAPI either.

The net result is simpler, and seems logical, and
should preclude any possible race between the interrupt
handler and poll function.  I'm trying to solve the
hang problem analytically, because it takes *so* long
to reproduce.

I'm open to other suggestions.

					-Alex

>  From a quick look, virtio-net disables on both remove and freeze, for instance.
> 
>> Instead, enable NAPI in gsi_channel_start(), when the completion
>> interrupt is first enabled.  Disable it again in gsi_channel_stop(),
>> when finally disabling the interrupt.
>>
>> Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure
>> NAPI polling is done before moving on.
>>
>> Signed-off-by: Alex Elder <elder@...aro.org>
>> ---
> =
>> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)
>>          struct gsi_channel *channel = &gsi->channel[channel_id];
>>          int ret;
>>
>> -       /* Enable the completion interrupt */
>> +       /* Enable NAPI and the completion interrupt */
>> +       napi_enable(&channel->napi);
>>          gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
>>
>>          ret = __gsi_channel_start(channel, true);
>> -       if (ret)
>> -               gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
>> +       if (!ret)
>> +               return 0;
>> +
>> +       gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
>> +       napi_disable(&channel->napi);
>>
>>          return ret;
>>   }
> 
> subjective, but easier to parse when the normal control flow is linear
> and the error path takes a branch (or goto, if reused).
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ