netdev - Re: [PATCH bpf v3] xdp: fix hang while unregistering device bound to xdp socket

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <625663fb-1b2b-c7b9-1af9-68bb80a3c12b@samsung.com>
Date:   Tue, 11 Jun 2019 18:42:22 +0300
From:   Ilya Maximets <i.maximets@...sung.com>
To:     Björn Töpel <bjorn.topel@...il.com>
Cc:     Jonathan Lemon <jonathan.lemon@...il.com>,
        Netdev <netdev@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
        Xdp <xdp-newbies@...r.kernel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Björn Töpel <bjorn.topel@...el.com>,
        Magnus Karlsson <magnus.karlsson@...el.com>,
        Jakub Kicinski <jakub.kicinski@...ronome.com>
Subject: Re: [PATCH bpf v3] xdp: fix hang while unregistering device bound
 to xdp socket

On 11.06.2019 15:13, Björn Töpel wrote:
> On Tue, 11 Jun 2019 at 10:42, Ilya Maximets <i.maximets@...sung.com> wrote:
>>
>> On 11.06.2019 11:09, Björn Töpel wrote:
>>> On Mon, 10 Jun 2019 at 22:49, Jonathan Lemon <jonathan.lemon@...il.com> wrote:
>>>>
>>>> On 10 Jun 2019, at 9:15, Ilya Maximets wrote:
>>>>
>>>>> Device that bound to XDP socket will not have zero refcount until the
>>>>> userspace application will not close it. This leads to hang inside
>>>>> 'netdev_wait_allrefs()' if device unregistering requested:
>>>>>
>>>>>   # ip link del p1
>>>>>   < hang on recvmsg on netlink socket >
>>>>>
>>>>>   # ps -x | grep ip
>>>>>   5126  pts/0    D+   0:00 ip link del p1
>>>>>
>>>>>   # journalctl -b
>>>>>
>>>>>   Jun 05 07:19:16 kernel:
>>>>>   unregister_netdevice: waiting for p1 to become free. Usage count = 1
>>>>>
>>>>>   Jun 05 07:19:27 kernel:
>>>>>   unregister_netdevice: waiting for p1 to become free. Usage count = 1
>>>>>   ...
>>>>>
>>>>> Fix that by implementing NETDEV_UNREGISTER event notification handler
>>>>> to properly clean up all the resources and unref device.
>>>>>
>>>>> This should also allow socket killing via ss(8) utility.
>>>>>
>>>>> Fixes: 965a99098443 ("xsk: add support for bind for Rx")
>>>>> Signed-off-by: Ilya Maximets <i.maximets@...sung.com>
>>>>> ---
>>>>>
>>>>> Version 3:
>>>>>
>>>>>     * Declaration lines ordered from longest to shortest.
>>>>>     * Checking of event type moved to the top to avoid unnecessary
>>>>>       locking.
>>>>>
>>>>> Version 2:
>>>>>
>>>>>     * Completely re-implemented using netdev event handler.
>>>>>
>>>>>  net/xdp/xsk.c | 65
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++-
>>>>>  1 file changed, 64 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>>>>> index a14e8864e4fa..273a419a8c4d 100644
>>>>> --- a/net/xdp/xsk.c
>>>>> +++ b/net/xdp/xsk.c
>>>>> @@ -693,6 +693,57 @@ static int xsk_mmap(struct file *file, struct
>>>>> socket *sock,
>>>>>                              size, vma->vm_page_prot);
>>>>>  }
>>>>>
>>>>> +static int xsk_notifier(struct notifier_block *this,
>>>>> +                     unsigned long msg, void *ptr)
>>>>> +{
>>>>> +     struct net_device *dev = netdev_notifier_info_to_dev(ptr);
>>>>> +     struct net *net = dev_net(dev);
>>>>> +     int i, unregister_count = 0;
>>>>> +     struct sock *sk;
>>>>> +
>>>>> +     switch (msg) {
>>>>> +     case NETDEV_UNREGISTER:
>>>>> +             mutex_lock(&net->xdp.lock);
>>>>
>>>> The call is under the rtnl lock, and we're not modifying
>>>> the list, so this mutex shouldn't be needed.
>>>>
>>>
>>> The list can, however, be modified outside the rtnl lock (e.g. at
>>> socket creation). AFAIK the hlist cannot be traversed lock-less,
>>> right?
>>
>> We could use 'rcu_read_lock' instead and iterate with 'sk_for_each_rcu',
>> but we'll not be able to synchronize inside.
>>
>>>
>>>>
>>>>> +             sk_for_each(sk, &net->xdp.list) {
>>>>> +                     struct xdp_sock *xs = xdp_sk(sk);
>>>>> +
>>>>> +                     mutex_lock(&xs->mutex);
>>>>> +                     if (dev != xs->dev) {
>>>>> +                             mutex_unlock(&xs->mutex);
>>>>> +                             continue;
>>>>> +                     }
>>>>> +
>>>>> +                     sk->sk_err = ENETDOWN;
>>>>> +                     if (!sock_flag(sk, SOCK_DEAD))
>>>>> +                             sk->sk_error_report(sk);
>>>>> +
>>>>> +                     /* Wait for driver to stop using the xdp socket. */
>>>>> +                     xdp_del_sk_umem(xs->umem, xs);
>>>>> +                     xs->dev = NULL;
>>>>> +                     synchronize_net();
>>>> Isn't this by handled by the unregister_count case below?
>>>>
>>>
>>> To clarify, setting dev to NULL and xdp_del_sk_umem() + sync makes
>>> sure that a driver doesn't touch the Tx and Rx rings. Nothing can be
>>> assumed about completion + fill ring (umem), until zero-copy has been
>>> disabled via ndo_bpf.
>>>
>>>>> +
>>>>> +                     /* Clear device references in umem. */
>>>>> +                     xdp_put_umem(xs->umem);
>>>>> +                     xs->umem = NULL;
>>>>
>>>> This makes me uneasy.  We need to unregister the umem from
>>>> the device (xdp_umem_clear_dev()) but this can remove the umem
>>>> pages out from underneath the xsk.
>>>>
>>>
>>> Yes, this is scary. The socket is alive, and userland typically has
>>> the fill/completion rings mmapped. Then the umem refcount is decreased
>>> and can potentially free the umem (fill rings etc.), as Jonathan says,
>>> underneath the xsk. Also, setting the xs umem/dev to zero, while the
>>> socket is alive, would allow a user to re-setup the socket, which we
>>> don't want to allow.
>>>
>>>> Perhaps what's needed here is the equivalent of an unbind()
>>>> call that just detaches the umem/sk from the device, but does
>>>> not otherwise tear them down.
>>>>
>>>
>>> Yeah, I agree. A detached/zombie state is needed during the socket lifetime.
>>
>>
>> I could try to rip the 'xdp_umem_release()' apart so the 'xdp_umem_clear_dev()'
>> could be called separately. This will allow to not tear down the 'umem'.
>> However, it seems that it'll not be easy to synchronize all parts.
>> Any suggestions are welcome.
>>
> 
> Thanks for continuing to work on this, Ilya.
> 
> What need to be done is exactly an "unbind()", i.e. returning the
> socket to the state prior bind, but disallowing any changes from
> userland (e.g. setsockopt/bind). So, unbind() + track that we're in
> "unbound" mode. :-) I think breaking up xdp_umem_release() is good way
> to go.

Thanks, I'll move in this direction.

BTW, I'll be out of office from tomorrow until the end of the week.
So, I'll most probably return to this on Monday.

> 
>> Also, there is no way to not clear the 'dev' as we have to put the reference.
>> Maybe we could add the additional check to 'xsk_bind()' for current device
>> state (dev->reg_state == NETREG_REGISTERED). This will allow us to avoid
>> re-setup of the socket.
>>
> 
> Yes, and also make sure that the rest of the syscall implementations
> don't allow for re-setup.

OK.

> 
> 
> Björn
> 
>>>
>>>>
>>>>> +                     mutex_unlock(&xs->mutex);
>>>>> +                     unregister_count++;
>>>>> +             }
>>>>> +             mutex_unlock(&net->xdp.lock);
>>>>> +
>>>>> +             if (unregister_count) {
>>>>> +                     /* Wait for umem clearing completion. */
>>>>> +                     synchronize_net();
>>>>> +                     for (i = 0; i < unregister_count; i++)
>>>>> +                             dev_put(dev);
>>>>> +             }
>>>>> +
>>>>> +             break;
>>>>> +     }
>>>>> +
>>>>> +     return NOTIFY_DONE;
>>>>> +}
>>>>> +
>>>>>  static struct proto xsk_proto = {
>>>>>       .name =         "XDP",
>>>>>       .owner =        THIS_MODULE,
>>>>> @@ -727,7 +778,8 @@ static void xsk_destruct(struct sock *sk)
>>>>>       if (!sock_flag(sk, SOCK_DEAD))
>>>>>               return;
>>>>>
>>>>> -     xdp_put_umem(xs->umem);
>>>>> +     if (xs->umem)
>>>>> +             xdp_put_umem(xs->umem);
>>>> Not needed - xdp_put_umem() already does a null check.
>>
>> Indeed. Thanks.
>>
>>>> --
>>>> Jonathan
>>>>
>>>>
>>>>>
>>>>>       sk_refcnt_debug_dec(sk);
>>>>>  }
>>>>> @@ -784,6 +836,10 @@ static const struct net_proto_family
>>>>> xsk_family_ops = {
>>>>>       .owner  = THIS_MODULE,
>>>>>  };
>>>>>
>>>>> +static struct notifier_block xsk_netdev_notifier = {
>>>>> +     .notifier_call  = xsk_notifier,
>>>>> +};
>>>>> +
>>>>>  static int __net_init xsk_net_init(struct net *net)
>>>>>  {
>>>>>       mutex_init(&net->xdp.lock);
>>>>> @@ -816,8 +872,15 @@ static int __init xsk_init(void)
>>>>>       err = register_pernet_subsys(&xsk_net_ops);
>>>>>       if (err)
>>>>>               goto out_sk;
>>>>> +
>>>>> +     err = register_netdevice_notifier(&xsk_netdev_notifier);
>>>>> +     if (err)
>>>>> +             goto out_pernet;
>>>>> +
>>>>>       return 0;
>>>>>
>>>>> +out_pernet:
>>>>> +     unregister_pernet_subsys(&xsk_net_ops);
>>>>>  out_sk:
>>>>>       sock_unregister(PF_XDP);
>>>>>  out_proto:
>>>>> --
>>>>> 2.17.1
>>>
>>>
> 
>