netdev - Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <eba8e7e4-e24b-82cc-5823-906716e9edc2@fb.com>
Date:   Sat, 29 Apr 2017 18:35:40 -0700
From:   Alexei Starovoitov <ast@...com>
To:     Hannes Frederic Sowa <hannes@...essinduktion.org>,
        John Fastabend <john.fastabend@...il.com>,
        Jesper Dangaard Brouer <brouer@...hat.com>,
        "Andy Gospodarek" <andy@...yhouse.net>
CC:     Alexei Starovoitov <alexei.starovoitov@...il.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Daniel Borkmann <borkmann@...earbox.net>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "xdp-newbies@...r.kernel.org" <xdp-newbies@...r.kernel.org>
Subject: Re: xdp_redirect ifindex vs port. Was: best API for returning/setting
 egress port?

On 4/28/17 12:43 PM, Hannes Frederic Sowa wrote:
> On 28.04.2017 07:30, Alexei Starovoitov wrote:
>> On 4/27/17 10:06 PM, John Fastabend wrote:
>>> That is more or less what I was thinking as well. The other question
>>> I have though is should we have a bpf_redirect() call for the simple
>>> case where I use the ifindex directly. This will be helpful for taking
>>> existing programs from tc_cls into xdp. I think it makes sense to have
>>> both bpf_tx_allports(), bpf_tx_port(), and bpf_redirect().
>>
>> I think so too.
>> Once netdevice is stored into netdev_array map the netdevice is pinned
>> and we need to figure out what to do if somebody tries to delete it.
>> Should we add a new netlink notifier that this netdev's refcnt is
>> almost zero and it's only in netdev_array(s) ?
>
> We basically do that automatically in netdev_wait_allrefs:
>
> pr_emerg("unregister_netdevice: waiting for %s to become free. Usage
> count = %d\n",
>          dev->name, refcnt);
>
> It is a very unpleasant warning and users probably think about a bug in
> the kernel at first.
>
> I don't think we should wait for user space to clean that up but have to
> do it automatically from the kernel. Maybe we can introduce a special
> value that basically NOPs the transmission. The hash table itself would
> install a netdevice notifier and would clean all tables. Could
> definitely cause some storm in the kernel, if a lot of keys are mapped
> to the same interface.
>
>> or should it be deleted from the array(s) automatically and
>> then user space will be notified post-deletion?
>> Both approaches have their pros and cons.
>
> I am leaning more towards deleting it automatically. But walking all
> tables and in there all keys might cause some unwanted load spikes.

yeah, when netdev is being unregistered we have to drop the ref
to avoid the warning.

Speaking of delete notifiers for maps.
Until recently all deletions were explicit and the program
could do extra perf_event_submit() if it needed to notify
user space of deletion.
But right now we have LRU map that deletes automatically and this
netdev_array will be deleting automatically too, so we probably
need some sort of generic map notifier for all map types.
It can be as simple as user space sleeping on read(map_fd, buf);
or waiting for epoll on map_fd and kernel wakes it up
when map element is deleted and pushes 'key/value' pair
into the buf. That should be generic for all map types,
but it needs some mechanism to avoid blocking, since number
of deletions can be huge. Something similar to perf's event record
'lost N records' should do the trick.