[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c428aa96-8570-1064-a37b-7b030cfa0a5a@gmail.com>
Date: Sun, 24 Mar 2019 06:56:42 -0600
From: David Ahern <dsahern@...il.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>,
David Miller <davem@...emloft.net>
Cc: netdev@...r.kernel.org, edumazet@...gle.com
Subject: Re: [PATCH net-next] ipv6: Move ipv6 stubs to a separate header file
On 3/23/19 9:55 PM, Alexei Starovoitov wrote:
> On Sat, Mar 23, 2019 at 09:40:23PM -0400, David Miller wrote:
>> From: David Ahern <dsahern@...nel.org>
>> Date: Fri, 22 Mar 2019 06:06:09 -0700
>>
>>> From: David Ahern <dsahern@...il.com>
>>>
>>> The number of stubs is growing and has nothing to do with addrconf.
>>> Move the definition of the stubs to a separate header file and update
>>> users. In the move, drop the vxlan specific comment before ipv6_stub.
>>>
>>> Code move only; no functional change intended.
>>>
>>> Signed-off-by: David Ahern <dsahern@...il.com>
>>
>> Eric, I fully support David's overall plan to make separate nexthop
>> objects as it will significantly empower the stack to do more sensible
>> things when links flap etc.
>
> let's agree to disagree.
> 'link flaps' were not mentioned in the cover letter for:
> "net: Improve route scalability via support for nexthop objects"
>
> The _only_ value of 86 patches is to align linux kernel routing
> with switch ASICs, because cumulus is trying to reuse iproute2
> to manage them.
> It was broken model to begin with and it keeps complicating routing
> when linux is used as a host while not achieving the goal of iproute2
> for switches.
> Can anyone use off the shelf linux to manage trident/tomahawk switches? Nope.
> brcm sdk is still necessary.
> nexthop objects are essential to configure enterprise switches.
> Clearly cumulus customers don't like iproute2 style because it's missing
> this feature, so David's proposal is to add that to the kernel.
> Even after kernel and iproute2 understand nexthop id the kernel is still
> not going to be competitive with switching os. The linux kernel is an OS
> to run on the host cpu and to run on a control plane cpu of a switch.
> That is all great, but the reasons to push routing into the kernel
> of control plane cpu were weak. It's not using these routes.
> Such architecture allowed temporary reuse of bgp daemons, but it fails to scale.
> No need to push route to the kernel when kernel won't use them.
> Hence an alternative proposal:
> - introduce hooks at netlink layer and steal back and forth messages
> from your favorite daemon without populating the kernel
> - same for iproute2 netlink interaction
>
The use case here is not just Cumulus or switchdev, but ANY OS using the
Linux API and the kernel to configure and manage its networking state
[1] and that includes XDP based use cases [2] and routing on the
host.[3] This is not about iproute2 driving networking deployments. This
is about continuing to remove the 'fails to scale' notion which *forces*
a NOS architecture away from the kernel databases as the single source
of truth and the kernel's IPC/notification mechanisms, and the
subsequent impacts of that choice which negates the Linux ecosystem
forcing a customization of all of the software running in the control
plane to work in some vendor's custom environment. You should read the
paper I wrote last summer [1].
This current patch set is not just about link flaps, but improving the
overall scaling properties of managing the FIB. This is about leveraging
existing ideas about network models and their scalability properties and
bringing that efficiency to Linux. With nexthops, the time to insert
routes is near constant regardless of the number of nexthops in the
route. So the time to insert a single path route and the time to insert
a route with 2, 4, 8, 16, 32, … nexthops is the same. That is a HUGE
scalability improvement from a simple idea. The “near” constant is
because of the need to expand nexthop definitions in the route
notifications to userspace to enable legacy applications to work with
this new API. In time, a lever can be added to not expand the
definitions and let the RTA_NHID alone point to it, allowing companies
who know that there are no legacy apps that need the nexthop expansion
to gain the full scaling improvements.
This change also enables many other key features:
1. IPv4 multipath routes are not evicted just because 1 hop goes down.
2. IPv6 multipath routes with device only nexthops (e.g., tunnels).
3. IPv6 nexthop with IPv4 route (aka, RFC 5549) which enables a more
natural BGP unnumbered.
4. Lower memory consumption for IPv6 FIB entries which has no sharing at
all like IPv4 does.
5. Allows atomic update of nexthop definitions with a single replace
command as opposed to replacing the N-routes using it.
The list goes on, but 2-5 of the above were in the cover letter I sent
on March 14.
I have spent a lot of time over the last few years not just working on
features like VRF and MPLS, but improving the scaling properties of
Linux and removing this 'fails to scale' notion you and others hold.
This current patch set is just another step in that path.
[1] https://www.files.netdevconf.org/d/f982086fdd6946d9b596/
[2] http://vger.kernel.org/lpc_net2018_talks/dsa-xdp-kernel-tables-paper.pdf
[3]
https://netdevconf.org/1.2/slides/oct7/01_ahern_microservice_net_vrf_on_host.pdf
Powered by blists - more mailing lists