netdev - Re: [PATCH net-next v4] rtnetlink: Support fine-grained netdevice bulk deletion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YazDh1HkLBM4BiCW@shredder>
Date:   Sun, 5 Dec 2021 15:49:59 +0200
From:   Ido Schimmel <idosch@...sch.org>
To:     Lahav Schlesinger <lschlesinger@...venets.com>
Cc:     netdev@...r.kernel.org, kuba@...nel.org, dsahern@...il.com
Subject: Re: [PATCH net-next v4] rtnetlink: Support fine-grained netdevice
 bulk deletion

On Sun, Dec 05, 2021 at 02:11:00PM +0200, Lahav Schlesinger wrote:
> On Sun, Dec 05, 2021 at 11:53:03AM +0200, Ido Schimmel wrote:
> > CAUTION: External E-Mail - Use caution with links and attachments
> >
> >
> > On Thu, Dec 02, 2021 at 07:45:02PM +0200, Lahav Schlesinger wrote:
> > > Under large scale, some routers are required to support tens of thousands
> > > of devices at once, both physical and virtual (e.g. loopbacks, tunnels,
> > > vrfs, etc).
> > > At times such routers are required to delete massive amounts of devices
> > > at once, such as when a factory reset is performed on the router (causing
> > > a deletion of all devices), or when a configuration is restored after an
> > > upgrade, or as a request from an operator.
> > >
> > > Currently there are 2 means of deleting devices using Netlink:
> > > 1. Deleting a single device (either by ifindex using ifinfomsg::ifi_index,
> > > or by name using IFLA_IFNAME)
> > > 2. Delete all device that belong to a group (using IFLA_GROUP)
> > >
> > > Deletion of devices one-by-one has poor performance on large scale of
> > > devices compared to "group deletion":
> > > After all device are handled, netdev_run_todo() is called which
> > > calls rcu_barrier() to finish any outstanding RCU callbacks that were
> > > registered during the deletion of the device, then wait until the
> > > refcount of all the devices is 0, then perform final cleanups.
> > >
> > > However, calling rcu_barrier() is a very costly operation, each call
> > > taking in the order of 10s of milliseconds.
> > >
> > > When deleting a large number of device one-by-one, rcu_barrier()
> > > will be called for each device being deleted.
> > > As an example, following benchmark deletes 10K loopback devices,
> > > all of which are UP and with only IPv6 LLA being configured:
> > >
> > > 1. Deleting one-by-one using 1 thread : 243 seconds
> > > 2. Deleting one-by-one using 10 thread: 70 seconds
> > > 3. Deleting one-by-one using 50 thread: 54 seconds
> > > 4. Deleting all using "group deletion": 30 seconds
> > >
> > > Note that even though the deletion logic takes place under the rtnl
> > > lock, since the call to rcu_barrier() is outside the lock we gain
> > > some improvements.
> > >
> > > But, while "group deletion" is the fastest, it is not suited for
> > > deleting large number of arbitrary devices which are unknown a head of
> > > time. Furthermore, moving large number of devices to a group is also a
> > > costly operation.
> >
> > These are the number I get in a VM running on my laptop.
> >
> > Moving 16k dummy netdevs to a group:
> >
> > # time -p ip -b group.batch
> > real 1.91
> > user 0.04
> > sys 0.27
> >
> > Deleting the group:
> >
> > # time -p ip link del group 10
> > real 6.15
> > user 0.00
> > sys 3.02
> >
> 
> Hi Ido, in your tests in which state the dummy devices are before
> deleting/changing group?
> When they are DOWN I get similar numbers to yours (16k devices):
> 
> # time ip -b group_16000_batch
> real	0m0.640s
> user	0m0.152s
> sys	0m0.478s
> 
> # time ip link delete group 100
> real	0m5.324s
> user	0m0.017s
> sys	0m4.991s
> 
> But when the devices are in state UP, I get:
> 
> # time ip -b group_16000_batch
> real	0m48.605s
> user	0m0.218s
> sys	0m48.244s
> 
> # time ip link delete group 100
> real	1m13.219s
> user	0m0.010s
> sys	1m9.117s
> 
> And for completeness, setting the devices to DOWN prior to deleting them
> is as fast as deleting them in the first place while they're UP.
> 
> Also, while this is probably a minuscule issue, changing the group of
> 10ks+ of interfaces will result in a storm of netlink events that will
> make any userspace program listening on link events to spend time
> handling these events.  This will result in twice as many events
> compared to directly deleting the devices.

Yes, in my setup the netdevs were down. Looking at the code, I think the
reason for the 75x increase in latency is the fact that netlink
notifications are not generated when the netdev is down. See
netdev_state_change().

In your specific case, it is quite useless for the kernel to generate
16k notifications when moving the netdevs to a group since the entire
reason they are moved to a group is so that they could be deleted in a
batch.

I assume that there are other use cases where having the kernel suppress
notifications can be useful. Did you consider adding such a flag to the
request? I think such a mechanism is more generic/useful than an ad-hoc
API to delete a list of netdevs and should allow you to utilize the
existing group deletion mechanism.

> 
> > IMO, these numbers do not justify a new API. Also, your user space can
> > be taught to create all the netdevs in the same group to begin with:
> >
> > # ip link add name dummy1 group 10 type dummy
> > # ip link show dev dummy1
> > 10: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group 10 qlen 1000
> >     link/ether 12:b6:7d:ff:48:99 brd ff:ff:ff:ff:ff:ff
> >
> > Moreover, unlike the list API that is specific to deletion, the group
> > API also lets you batch set operations:
> >
> > # ip link set group 10 mtu 2000
> > # ip link show dev dummy1
> > 10: dummy1: <BROADCAST,NOARP> mtu 2000 qdisc noop state DOWN mode
> > DEFAULT group 10 qlen 1000
> >     link/ether 12:b6:7d:ff:48:99 brd ff:ff:ff:ff:ff:ff
> 
> The list API can be extended to support other operations as well
> (similar to group set operations, we can call do_setlink() for each
> device specified in an IFLA_IFINDEX).
> I didn't implement it in this patch because we don't have a use for it
> currently.
> 
> >
> > If you are using namespaces, then during "factory reset" you can delete
> > the namespace which should trigger batch deletion of the netdevs inside
> > it.
> >
> 
> In some scenarios we are required to delete only a subset of devices
> (e.g. when a physical link becomes DOWN, we need to delete all the VLANs
> and any tunnels configured on that device).  Furthermore, a user is
> allowed to load a new configuration in which he deletes only some of the
> devices (e.g. delete all of the loopbacks in the system), while not
> touching the other devices.
> 
> > >
> > > This patch adds support for passing an arbitrary list of ifindex of
> > > devices to delete with a new IFLA_IFINDEX attribute. A single message
> > > may contain multiple instances of this attribute).
> > > This gives a more fine-grained control over which devices to delete,
> > > while still resulting in rcu_barrier() being called only once.
> > > Indeed, the timings of using this new API to delete 10K devices is
> > > the same as using the existing "group" deletion.
> > >
> > > Signed-off-by: Lahav Schlesinger <lschlesinger@...venets.com>