netdev - Re: [PATCH net] ixgbe: check return value of napi_complete

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <E4A7BD0D-B091-4FBD-94A9-F0104729DCF6@fb.com>
Date:   Thu, 20 Sep 2018 23:43:04 +0000
From:   Song Liu <songliubraving@...com>
To:     Eric Dumazet <eric.dumazet@...il.com>
CC:     Jeff Kirsher <jeffrey.t.kirsher@...el.com>,
        netdev <netdev@...r.kernel.org>,
        "intel-wired-lan@...ts.osuosl.org" <intel-wired-lan@...ts.osuosl.org>,
        Kernel Team <Kernel-team@...com>,
        "stable@...r.kernel.org" <stable@...r.kernel.org>,
        Alexei Starovoitov <ast@...nel.org>
Subject: Re: [PATCH net] ixgbe: check return value of napi_complete_done()



> On Sep 20, 2018, at 4:22 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
> 
> 
> 
> On 09/20/2018 03:42 PM, Song Liu wrote:
>> 
>> 
>>> On Sep 20, 2018, at 2:01 PM, Jeff Kirsher <jeffrey.t.kirsher@...el.com> wrote:
>>> 
>>> On Thu, 2018-09-20 at 13:35 -0700, Eric Dumazet wrote:
>>>> On 09/20/2018 12:01 PM, Song Liu wrote:
>>>>> The NIC driver should only enable interrupts when napi_complete_done()
>>>>> returns true. This patch adds the check for ixgbe.
>>>>> 
>>>>> Cc: stable@...r.kernel.org # 4.10+
>>>>> Cc: Jeff Kirsher <jeffrey.t.kirsher@...el.com>
>>>>> Suggested-by: Eric Dumazet <edumazet@...gle.com>
>>>>> Signed-off-by: Song Liu <songliubraving@...com>
>>>>> ---
>>>> 
>>>> 
>>>> Well, unfortunately we do not know why this is needed,
>>>> this is why I have not yet sent this patch formally.
>>>> 
>>>> netpoll has correct synchronization :
>>>> 
>>>> poll_napi() places into napi->poll_owner current cpu number before
>>>> calling poll_one_napi()
>>>> 
>>>> netpoll_poll_lock() does also use napi->poll_owner
>>>> 
>>>> When netpoll calls ixgbe poll() method, it passed a budget of 0,
>>>> meaning napi_complete_done() is not called.
>>>> 
>>>> As long as we can not explain the problem properly in the changelog,
>>>> we should investigate, otherwise we will probably see coming dozens of
>>>> patches
>>>> trying to fix a 'potential hazard'.
>>> 
>>> Agreed, which is why I have our validation and developers looking into it,
>>> while we test the current patch from Song.
>> 
>> I figured out what is the issue here. And I have a proposal to fix it. I 
>> have verified that this fixes the issue in our tests. But Alexei suggests
>> that it may not be the right way to fix. 
>> 
>> Here is what happened:
>> 
>> netpoll tries to send skb with netpoll_start_xmit(). If that fails, it 
>> calls netpoll_poll_dev(), which calls ndo_poll_controller(). Then, in 
>> the driver, ndo_poll_controller() calls napi_schedule() for ALL NAPIs 
>> within the same NIC. 
>> 
>> This is problematic, because at the end napi_schedule() calls:
>> 
>>    ____napi_schedule(this_cpu_ptr(&softnet_data), n);
>> 
>> which attached these NAPIs to softnet_data on THIS CPU. This is done
>> via napi->poll_list. 
>> 
>> Then suddenly ksoftirqd on this CPU owns multiple NAPIs. And it will
>> not give up the ownership until it calls napi_complete_done(). However, 
>> for a very busy server, we usually use 16 CPUs to poll NAPI, so this
>> CPU can easily be overloaded. And as a result, each call of napi->poll() 
>> will hit budget (of 64), and it will not call napi_complete_done(), 
>> and the NAPI stays in the poll_list of this CPU. 
>> 
>> When this happens, the host usually cannot get out of this state until
>> we throttle/stop client traffic. 
>> 
>> 
>> I am pretty confident this is what happened. Please let me know if 
>> anything above doesn't make sense. 
>> 
>> 
>> Here is my proposal to fix it: Instead of polling all NAPIs within one
>> NIC, I would have netpoll to only poll the NAPI that will free space
>> for netpoll_start_xmit(). I attached my two RFC patches to the end of 
>> this email. 
>> 
>> I chatted with Alexei about this. He think polling only one NAPI may 
>> not guarantee netpoll make progress with the TX queue we are aiming 
>> for. Also, the bigger problem may be the fact that NAPIs could get 
>> pinned to one CPU and cannot get released. 
>> 
>> At this point, I really don't know what is the best way to fix this. 
>> 
>> I will also work on a repro with netperf. 
> 
> Thanks !
> 
>> 
>> Please let me know your suggestions. 
>> 
> 
> Yeah, maybe that NICs using NAPI could not provide an ndo_poll_controller() method at all,
> since it is very risky (potentially grab many NAPI, and end up in this locked situation)
> 
> poll_napi() could attempt to free skbs one napi at a time,
> without the current cpu stealing all NAPI.
> 
> 
> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> index 57557a6a950cc9cdff959391576a03381d328c1a..a992971d366090ba69d5c1af32eadd554d6880cf 100644
> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -205,13 +205,8 @@ static void netpoll_poll_dev(struct net_device *dev)
>        }
> 
>        ops = dev->netdev_ops;
> -       if (!ops->ndo_poll_controller) {
> -               up(&ni->dev_lock);
> -               return;
> -       }
> -
> -       /* Process pending work on NIC */
> -       ops->ndo_poll_controller(dev);
> +       if (ops->ndo_poll_controller)
> +               ops->ndo_poll_controller(dev);
> 
>        poll_napi(dev);
> 

I tried to totally skip ndo_poll_controller() here. It did avoid hitting
the issue. However, netpoll will drop (fail to send) more packets. 

Thanks,
Song