[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20180901004954.7145-1-dsahern@kernel.org>
Date: Fri, 31 Aug 2018 17:49:35 -0700
From: dsahern@...nel.org
To: netdev@...r.kernel.org
Cc: roopa@...ulusnetworks.com, sharpd@...ulusnetworks.com,
idosch@...lanox.com, davem@...emloft.net,
David Ahern <dsahern@...il.com>
Subject: [PATCH RFC net-next 00/18] net: Improve route scalability via support for nexthop objects
From: David Ahern <dsahern@...il.com>
As mentioned at netconf in Seoul, we would like to introduce nexthops as
independent objects from the routes to better align with both routing
daemons and hardware and to improve route insertion times into the kernel.
This series adds nexthop objects with their own lifecycle. The model
retains a lot of the established semantics from routes and re-uses some
of the data structures like fib_nh and fib6_nh to more easily align with
the existing code. One difference with nexthop objects is the behavior
better aligns with the target user - routing daemons and switch ASICs.
Specifically, with the exception of the blackhole nexthop, all nexthops
must reference a netdevice (or have a gateway that resolves to a device)
and the device must be admin up with carrier.
Prefixes are then installed pointing to the nexthop by id:
{ prefix } --> { nexthop } --> { gateway, device }
The nexthop object contains the gateway and device reference.
Benchmarks
The following data shows the route insert time for 720,022 routes (a full
IPv4 internet feed from August 28th). "current" means the current code
where a route insert specifies the device and gateway inline with the
prefix; the "nexthop" columns mean use of the nexthop objects.
1-hop 1-hop | 2-hops 2-hops
current nexthop | current nexthop
--------------------------|-------------------------
real 0m21.872s 0m12.982s | 0m28.723s 0m12.406s
user 0m2.929s 0m1.816s | 0m3.966s 0m1.935s
sys 0m13.469s 0m6.010s | 0m18.992s 0m5.913s
With nexthop objects the time to insert the routes is reduced by more
than 30% with the kernel time cut in half. The current model has a route
insertion rate of about 32,000 prefixes / second and with nexthop objects
that increases to a little over 55,000 prefixes/second.
For routes with multiple nexthops the install time is cut by more than
half with system time reduce by a factor of 3. Further, with nexthop
objects insert times for multipath routes drops down to the same as
single path routes since the multipath spec is given once (ie., with the
current model, the time to insert routes increases with the number of
paths in the route compared to nexthop objects where the number of paths
is handled once and the prefixes referencing it are installed in constant
time.
The difference between real and system times shows there is room for
improvement with the trie implementation. As an example, increasing the
sync_pages from 128 to 1024 delays the call to synchronize_rcu increasing
the insert rate to more than 78,000 prefixes/sec!
Some key features:
1. Allows atomic replace of any nexthop object - a nexthop or a group.
This allows existing route entries to have their nexthop updated
without the overhead of removing and re-inserting (or replacing)
them. Instead, one update of the nexthop object implicitly updates
all routes referencing it.
One limitation with the atomic replace is that a nexthop group can
only be replaced with a new group spec and similarly a nexthop can
only be replaced by a nexthop spec. Specifically, a nexthop id can
not move between a single nexthop and a group nexthop.
2. Blackhole nexthop: a nexthop object can be designated a blackhole
which means any lookups that resolve to it, packets are dropped as
if the lookup failed with the result RTN_BLACKHOLE. Blackhole nexthops
can not be used with nexthop groups. Combined with atomic replace
this allows routes to be installed pointing to a blackhole nexthop
and then switched to an actual gateway with a single nexthop replace
command (or vice versa, a gateway nexthop is flipped to a blackhole).
3. Nexthop groups for multipath routes. A nexthop group is a nexthop
that references other nexthops. A multipath group can not be used
as a nexthop in another nexthop group (ie., groups can not be nested).
4. Multipath routes for IPv6 with device only nexthops. There is a
demonstrated need for this feature and the existing route semantics
do not allow it. This series provides a means for that end - create a
nexthop that has a device only specification.
5. Admin and carrier up are required. If the device goes down (admin or
carrier) the nexthop is removed in which case routes referencing the
nexthop are evicted and any nexthop groups referencing it are adjusted.
6. Follow on patches will allow IPv6 nexthops with IPv4 routes for users
wanting support of RFC 5549.
7. Future extensions: active / backup nexthop. The nexthop groups are
structured to allow a new group type to be added. One example is a
group where a nexthop has a preferred device and gateway, but should
the device go down or the gateway not resolve, the backup nexthop is
used.
Additional Benefits
- smaller route notifications - messages contain a single nexthop id versus
the detailed nexthop specification. This is especially noticeable as the
number of paths increases. Smaller messages have a reduced load on
userspace as well.
- smaller memory footprint for IPv6 routes.
Examples
1. Single path
$ ip nexthop add id 1 via 10.99.1.2 dev veth1
$ ip route add 10.1.1.0/24 nhid 1
$ ip next ls
id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link
$ ip ro ls
10.1.1.0/24 nhid 1 scope link
...
2. ECMP
$ ip nexthop add id 2 via 10.99.3.2 dev veth3
$ ip nexthop add id 1001 group 1/2
--> creates a nexthop group with 2 component nexthops:
id 1 and id 2 both the same weight
$ ip route add 10.1.2.0/24 nhid 1001
$ ip next ls
id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link
id 2 via 10.99.3.2 src 10.99.3.1 dev veth3 scope link
id 1001 group 1/2
$ ip ro ls
10.1.1.0/24 nhid 1 scope link
10.1.2.0/24 nhid 1001 scope link
...
3. Weighted multipath
$ ip nexthop add id 1002 group 1,10/2,20
--> creates a nexthop group with 2 component nexthops:
id 1 with a weight of 10 and id 2 with a weight of 20
$ ip route add 10.1.3.0/24 nhid 1002
$ ip next ls
id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link
id 2 via 10.99.3.2 src 10.99.3.1 dev veth3 scope link
id 1001 group 1/2
id 1002 group 1,10/2,20
$ ip ro ls
10.1.1.0/24 nhid 1 scope link
10.1.2.0/24 nhid 1001 scope link
10.1.3.0/24 nhid 1002 scope link
...
Open Items
There is long to-do list before this is ready (e.g., IPv6 multipath, lwt
encap, and updating mlxsw). The point of this RFC is to get comments on
the API and overall idea. Specifically, any interested parties should
think about the API, the objects, the workflow, how it fits and
possibility for future extensions.
David Ahern (18):
net: Rename net/nexthop.h net/rtnh.h
net: ipv4: export fib_good_nh and fib_flush
net/ipv4: export fib_info_update_nh_saddr
net/ipv4: export fib_check_nh
net/ipv4: Define fib_get_nhs when CONFIG_IP_ROUTE_MULTIPATH is
disabled
net/ipv4: Create init and release helpers for fib_nh
net: ipv4: Add fib_nh to fib_result
net/ipv4: Move device validation to helper
net/ipv6: Create init and release helpers for fib6_nh
net/ipv6: Make fib6_nh optional at the end of fib6_info
net: Initial nexthop code
net/ipv4: Add nexthop helpers for ipv4 integration
net/ipv4: Convert existing use of fib_info to new helpers
net/ipv4: Allow routes to use nexthop objects
net/ipv6: Use helpers to access fib6_nh data
net/ipv6: Allow routes to use nexthop objects
net: Add support for nexthop groups
net/ipv4: Optimization for fib_info lookup
.../net/ethernet/mellanox/mlxsw/spectrum_router.c | 4 +-
drivers/net/ethernet/rocker/rocker_ofdpa.c | 20 +-
include/net/addrconf.h | 5 +
include/net/ip6_fib.h | 22 +-
include/net/ip6_route.h | 12 +-
include/net/ip_fib.h | 39 +-
include/net/net_namespace.h | 2 +
include/net/netns/nexthop.h | 18 +
include/net/nexthop.h | 253 +++-
include/net/rtnh.h | 34 +
include/trace/events/fib6.h | 15 +-
include/uapi/linux/nexthop.h | 56 +
include/uapi/linux/rtnetlink.h | 8 +
net/core/filter.c | 13 +-
net/core/lwtunnel.c | 2 +-
net/decnet/dn_fib.c | 2 +-
net/ipv4/Makefile | 2 +-
net/ipv4/fib_frontend.c | 60 +-
net/ipv4/fib_rules.c | 3 +-
net/ipv4/fib_semantics.c | 433 ++++--
net/ipv4/fib_trie.c | 54 +-
net/ipv4/ipmr.c | 2 +-
net/ipv4/nexthop.c | 1541 ++++++++++++++++++++
net/ipv4/route.c | 34 +-
net/ipv6/addrconf.c | 5 +-
net/ipv6/addrconf_core.c | 9 +
net/ipv6/af_inet6.c | 1 +
net/ipv6/ip6_fib.c | 27 +-
net/ipv6/ndisc.c | 15 +-
net/ipv6/route.c | 474 +++---
net/mpls/af_mpls.c | 2 +-
security/selinux/nlmsgtab.c | 5 +-
32 files changed, 2690 insertions(+), 482 deletions(-)
create mode 100644 include/net/netns/nexthop.h
create mode 100644 include/net/rtnh.h
create mode 100644 include/uapi/linux/nexthop.h
create mode 100644 net/ipv4/nexthop.c
--
2.11.0
Powered by blists - more mailing lists