netdev - net: Improve route scalability via support for nexthop objects

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <cf07acf6-050c-ef1d-be64-4822476ef54e@gmail.com>
Date:   Thu, 14 Mar 2019 14:19:53 -0600
From:   David Ahern <dsahern@...il.com>
To:     David Miller <davem@...emloft.net>
Cc:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Roopa Prabhu <roopa@...ulusnetworks.com>,
        Ido Schimmel <idosch@...sch.org>
Subject: net: Improve route scalability via support for nexthop objects

TL;DR:
The nexthop changes are finally ready for consideration for inclusion
into the networking stack. The patch count currently stands at 86
including tests. The majority of those are refactoring the existing code
base with the last 16 implementing the nexthop feature and selftests.

The first 27 patches move the IPv4 and IPv6 code to work with a
fib_nh_common, a new struct which contains the common elements of fib_nh
and fib6_nh, and then refactor the existing fib_dump_info to work for
both protocols. With that in place, the next 15 patches do more changes
to IPv4 to enable IPv6 gateways with IPv4 routes (a.k.a RFC 5549) using
the RTA_VIA attribute.

>From there the next 24 patches refactor IPv6, introducing a fib6_result
similar to IPv4's fib_result which allows a fib6_nh that is not within a
fib6_info and adding hooks to the ipv6 stubs (bump sernum, send route
notifications and delete routes based on nexthop updates).

This is followed by a few IPv4 exports and then the last 16 patches add
the nexthop feature.

I plan to start sending patches next week once net-next opens. Since it
will take a while to get all of them in, I wanted to make sure the end
goal is known and understood.

For anyone interested in seeing the patches ahead of time, they are here:
   https://github.com/dsahern/linux nexthops-v5.1-next-v2

(order of the patches may change)

==
Long version:

As mentioned at netconf in Seoul, we would like to introduce nexthops as
independent objects from the routes to better align with both routing
daemons and hardware and to improve route insertion times into the kernel.

This series adds nexthop objects with their own lifecycle. The model
retains a lot of the established semantics from routes. One difference
with nexthop objects is the behavior better aligns with the target user
- routing daemons and switch ASICs. Specifically, with the exception of
the blackhole nexthop, all nexthops must reference a netdevice and the
device must be admin up with carrier. If a device goes down (admin or
carrier) the nexthop is evicted along with all routes referencing it.

Work flow wise, nexthops are created first:
  { nexthop }  --> { gateway, device }

And then prefixes are installed pointing to the nexthop by id:
  { prefix } ----> { nexthop }

with the resulting route looking very similar to the existing code:
  { prefix } ----> { nexthop }  --> { gateway, device }

A nexthop can be a group which references other nexthops:

            /---> { nexthop, weight }
  { nexthop }         ...
            \---> { nexthop, weight }

Prefixes referencing the group nexthop are then multipath routes:

                             /---> { nexthop, weight } --> { gw, dev }
  { prefix } ----> { nexthop }         ...
                             \---> { nexthop, weight } --> { gw, dev }

Nexthop data (gw, dev or entries in a group) can be updated atomoically,
allowing for the efficient update of all prefixes in one replace command.


Notifications
=============
1. A new rtnl group is defined, RTNLGRP_NEXTHOP.

Since its group id is > 31, applications need to use the setsockopt
option to add nexthop group to the listeners:

     unsigned int group = RTNLGRP_NEXTHOP;
     setsockopt(fd, SOL_NETLINK, NETLINK_ADD_MEMBERSHIP,
                &group, sizeof(group));

2. Nexthop notifications are generated for the usual add, delete,
replace lifecycle.

3. Notifications for route adds, replace, and delete are identical to
the legacy ones with one new attribute, RTA_NHID, if the route
references a standalone nexthop object. This applies to route changes
and to any add, delete, replace of a nexthop object used by FIB entries.
This model retains backwards compatibility such that the existing
ecosystem of software that does not natively understand nexthop objects
is not impacted by the use of nexthop objects. The expectation is that
unknown attributes (the new RTA_NHID) are ignored by legacy apps.

4. Nexthop notifications are NOT generated when a nexthop is removed due
to a device event (eg., admin or carrier down). Userspace is expected to
monitor link events and remove nexthops and routes associated with the
device. By extension, notifications are NOT generated for routes evicted
because of the removal of a nexthop when it is removed by a device event.


Key Features
============
1. Atomic replace of the configuration data for any nexthop object - a
standalone nexthop or a group. This allows existing route entries to
have their nexthop config updated without the overhead of removing and
re-inserting (or replacing) the routes individually. Instead, one update
of the nexthop object implicitly updates all routes referencing it.

One limitation with the atomic replace is that a nexthop group can only
be replaced with a new group spec, and similarly a single nexthop can
only be replaced by a single nexthop spec. Specifically, a nexthop id
can not move between a single nexthop and a group nexthop except by
delete and add.

2. Blackhole nexthop: a nexthop object can be designated a blackhole
which means any lookups that resolve to it fail with the result
RTN_BLACKHOLE. Blackhole nexthops can be used with nexthop groups but
only as the sole nexthop. Combined with atomic replace this allows
routes to be installed pointing to a blackhole nexthop or group and then
switched to an actual gateway or multipath nexthop with a single replace
command (or vice versa, a gateway/device nexthop can be flipped to a
blackhole).

3. Nexthop groups for multipath routes. A nexthop group is a nexthop
that references other nexthops with a weight for weighted multipath. A
multipath group can not be used as a nexthop in another nexthop group
(ie., groups can not be nested).

4. Multipath routes for IPv6 with device only nexthops. There is a
demonstrated need for this feature and the existing route semantics do
not allow it because of mistakes with past implementation of multipath
routes. This series provides a means for that end - create a nexthop
that has a device only specification.

5. IPv6 nexthops with IPv4 routes for users wanting support of RFC 5549.
This feature is enabled natively (without nexthop objects) as a result
of the heavy refactoring of fib_nh and fib6_nh into a common nexthop.

6. Dramatic reduction in time to install routes in the kernel, most
notably with increasing number of legs in a multipath route. Formal data
will be presented with the nexthop commits.

7. Lower memory footprint for IPv6

While individual data structures shows a minor increase in size

                      old   new
                      ===   ===
      fib_nh_common     -    72
      IPv4
        fib_nh        104   120
        fib_info      104   128*
        rtable        160   176

      IPv6
        fib6_nh        48    96
        fib6_info     224   160*
        rt6_info      224   224

        [*] with 0 nexthops; for 1 fib{6}_nh add the respective cost

the *effective* allocation sizes for fib{6}_info have not changed. The
data structure increase is due to a combination of factors: nexthop
reference and list_head tracking of fib entries along with IPv6 address
in IPv4 structures.

While IPv4 has consolidated similar nexthop data into single fib_info
instances that are referenced by multiple fib entries, IPv6 does not.
For IPv6 the nexthop data is repeated for each route. This means with
nexthop objects the memory overhead of IPv6 fib entries drop
significantly - especially with multipath routes.

8 Future extensions
I believe thee code is setup to allow a future extension where apps can
pass a flag that effectively says "I understand nexthop objects - don't
expand them in the route dump" and for a sysctl knob for notifiers to do
the same once all apps running in the control plane are known to
understand nexthop objects. Together this means less data going from
kernel to userspace and less processing by userspace.

Examples
========
1. Single path
    $ ip nexthop add id 1 via 10.99.1.2 dev veth1
    $ ip route add 10.1.1.0/24 nhid 1

    $ ip next ls
    id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link

    $ ip ro ls
    10.1.1.0/24 nhid 1 scope link
    ...

2. ECMP
    $ ip nexthop add id 2 via 10.99.3.2 dev veth3
    $ ip nexthop add id 1001 group 1/2
      --> creates a nexthop group with 2 component nexthops:
          id 1 and id 2 both the same weight

    $ ip route add 10.1.2.0/24 nhid 1001

    $ ip next ls
    id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link
    id 2 via 10.99.3.2 src 10.99.3.1 dev veth3 scope link
    id 1001 group 1/2

    $ ip ro ls
    10.1.1.0/24 nhid 1 scope link
    10.1.2.0/24 nhid 1001 scope link
    ...

3. Weighted multipath
    $ ip nexthop add id 1002 group 1,10/2,20
      --> creates a nexthop group with 2 component nexthops:
          id 1 with a weight of 10 and id 2 with a weight of 20

    $ ip route add 10.1.3.0/24 nhid 1002

    $  ip next ls
    id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link
    id 2 via 10.99.3.2 src 10.99.3.1 dev veth3 scope link
    id 1001 group 1/2
    id 1002 group 1,10/2,20

    $ ip ro ls
    10.1.1.0/24 nhid 1 scope link
    10.1.2.0/24 nhid 1001 scope link
    10.1.3.0/24 nhid 1002 scope link
    ...

Acknowledgements
- test writers - especially Stefano Brivio. The pmtu script has been
invaluable in verifying changes to the exception code (and it exposed a
few other bugs).

- kbuild robot for catching compile errors through the maze of config
options and Dan Carpenter's tools for catching a NULL vs IS_ERR check


v2
- a few changes to the uapi - most notably, the requirement to have a v4
or v6 version of all single nexthops. A group is done as AF_UNSPEC, but
individual nexthops must be AF_INET or AF_INET6. This is forced by the
cached (per cpu) routes and exceptions, and the expectation in too many
places that a fib{6}_nh exists.

- backwards compatibility - route notifications changed to include
nexthop data such that legacy apps are not impacted. This drove much of
the refactoring towards a fib_nh_common but that also enabled the IPv6
gateways with IPv4 routes with trivial additional changes.