netdev - Re: [PATCH net-next v4] rtnetlink: Support fine-grained netdevice bulk deletion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211205121059.btshgxt7s7hfnmtr@kgollan-pc>
Date:   Sun, 5 Dec 2021 14:11:00 +0200
From:   Lahav Schlesinger <lschlesinger@...venets.com>
To:     Ido Schimmel <idosch@...sch.org>
Cc:     netdev@...r.kernel.org, kuba@...nel.org, dsahern@...il.com
Subject: Re: [PATCH net-next v4] rtnetlink: Support fine-grained netdevice
 bulk deletion

On Sun, Dec 05, 2021 at 11:53:03AM +0200, Ido Schimmel wrote:
> CAUTION: External E-Mail - Use caution with links and attachments
>
>
> On Thu, Dec 02, 2021 at 07:45:02PM +0200, Lahav Schlesinger wrote:
> > Under large scale, some routers are required to support tens of thousands
> > of devices at once, both physical and virtual (e.g. loopbacks, tunnels,
> > vrfs, etc).
> > At times such routers are required to delete massive amounts of devices
> > at once, such as when a factory reset is performed on the router (causing
> > a deletion of all devices), or when a configuration is restored after an
> > upgrade, or as a request from an operator.
> >
> > Currently there are 2 means of deleting devices using Netlink:
> > 1. Deleting a single device (either by ifindex using ifinfomsg::ifi_index,
> > or by name using IFLA_IFNAME)
> > 2. Delete all device that belong to a group (using IFLA_GROUP)
> >
> > Deletion of devices one-by-one has poor performance on large scale of
> > devices compared to "group deletion":
> > After all device are handled, netdev_run_todo() is called which
> > calls rcu_barrier() to finish any outstanding RCU callbacks that were
> > registered during the deletion of the device, then wait until the
> > refcount of all the devices is 0, then perform final cleanups.
> >
> > However, calling rcu_barrier() is a very costly operation, each call
> > taking in the order of 10s of milliseconds.
> >
> > When deleting a large number of device one-by-one, rcu_barrier()
> > will be called for each device being deleted.
> > As an example, following benchmark deletes 10K loopback devices,
> > all of which are UP and with only IPv6 LLA being configured:
> >
> > 1. Deleting one-by-one using 1 thread : 243 seconds
> > 2. Deleting one-by-one using 10 thread: 70 seconds
> > 3. Deleting one-by-one using 50 thread: 54 seconds
> > 4. Deleting all using "group deletion": 30 seconds
> >
> > Note that even though the deletion logic takes place under the rtnl
> > lock, since the call to rcu_barrier() is outside the lock we gain
> > some improvements.
> >
> > But, while "group deletion" is the fastest, it is not suited for
> > deleting large number of arbitrary devices which are unknown a head of
> > time. Furthermore, moving large number of devices to a group is also a
> > costly operation.
>
> These are the number I get in a VM running on my laptop.
>
> Moving 16k dummy netdevs to a group:
>
> # time -p ip -b group.batch
> real 1.91
> user 0.04
> sys 0.27
>
> Deleting the group:
>
> # time -p ip link del group 10
> real 6.15
> user 0.00
> sys 3.02
>

Hi Ido, in your tests in which state the dummy devices are before
deleting/changing group?
When they are DOWN I get similar numbers to yours (16k devices):

# time ip -b group_16000_batch
real	0m0.640s
user	0m0.152s
sys	0m0.478s

# time ip link delete group 100
real	0m5.324s
user	0m0.017s
sys	0m4.991s

But when the devices are in state UP, I get:

# time ip -b group_16000_batch
real	0m48.605s
user	0m0.218s
sys	0m48.244s

# time ip link delete group 100
real	1m13.219s
user	0m0.010s
sys	1m9.117s

And for completeness, setting the devices to DOWN prior to deleting them
is as fast as deleting them in the first place while they're UP.

Also, while this is probably a minuscule issue, changing the group of
10ks+ of interfaces will result in a storm of netlink events that will
make any userspace program listening on link events to spend time
handling these events.  This will result in twice as many events
compared to directly deleting the devices.

> IMO, these numbers do not justify a new API. Also, your user space can
> be taught to create all the netdevs in the same group to begin with:
>
> # ip link add name dummy1 group 10 type dummy
> # ip link show dev dummy1
> 10: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group 10 qlen 1000
>     link/ether 12:b6:7d:ff:48:99 brd ff:ff:ff:ff:ff:ff
>
> Moreover, unlike the list API that is specific to deletion, the group
> API also lets you batch set operations:
>
> # ip link set group 10 mtu 2000
> # ip link show dev dummy1
> 10: dummy1: <BROADCAST,NOARP> mtu 2000 qdisc noop state DOWN mode
> DEFAULT group 10 qlen 1000
>     link/ether 12:b6:7d:ff:48:99 brd ff:ff:ff:ff:ff:ff

The list API can be extended to support other operations as well
(similar to group set operations, we can call do_setlink() for each
device specified in an IFLA_IFINDEX).
I didn't implement it in this patch because we don't have a use for it
currently.

>
> If you are using namespaces, then during "factory reset" you can delete
> the namespace which should trigger batch deletion of the netdevs inside
> it.
>

In some scenarios we are required to delete only a subset of devices
(e.g. when a physical link becomes DOWN, we need to delete all the VLANs
and any tunnels configured on that device).  Furthermore, a user is
allowed to load a new configuration in which he deletes only some of the
devices (e.g. delete all of the loopbacks in the system), while not
touching the other devices.

> >
> > This patch adds support for passing an arbitrary list of ifindex of
> > devices to delete with a new IFLA_IFINDEX attribute. A single message
> > may contain multiple instances of this attribute).
> > This gives a more fine-grained control over which devices to delete,
> > while still resulting in rcu_barrier() being called only once.
> > Indeed, the timings of using this new API to delete 10K devices is
> > the same as using the existing "group" deletion.
> >
> > Signed-off-by: Lahav Schlesinger <lschlesinger@...venets.com>