netdev - Re: [PATCH RFC net-next 00/18] net: Improve route scalability via support for nexthop objects

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Wed, 12 Dec 2018 13:27:59 -0700
From:   David Ahern <dsahern@...il.com>
To:     Jan Maria Matejka <jan.matejka@....cz>
Cc:     linux-netdev <netdev@...r.kernel.org>,
        Ondrej Zajicek <santiago@...reenet.org>,
        David Miller <davem@...emloft.net>
Subject: Re: [PATCH RFC net-next 00/18] net: Improve route scalability via
 support for nexthop objects

On 12/11/18 5:52 AM, Jan Maria Matejka wrote:
> Hello!
> 
> On 9/4/18 5:57 PM, David Ahern wrote:
>> On 9/2/18 11:34 AM, David Miller wrote:
>>> From: dsahern@...nel.org
>>> Date: Fri, 31 Aug 2018 17:49:35 -0700
>>>
>>>> Examples
>>>> 1. Single path
>>>>     $ ip nexthop add id 1 via 10.99.1.2 dev veth1
>>>>     $ ip route add 10.1.1.0/24 nhid 1
>>>>
>>>>     $ ip next ls
>>>>     id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link
>>>>
>>>>     $ ip ro ls
>>>>     10.1.1.0/24 nhid 1 scope link
>>>>     ...
>>>
>>> First of all, this whole idea is awesome!  But, you knew that already. :)
>>
>> :-)
> 
> Joining the wow's. This seems that it may save valuable kernel table
> synchronization time in BIRD.
> 
>>> However, I worry what happesn in a mixed environment where we have routing
>>> daemons and tools inserting nexthop based routes, and some doing things
>>> the old way using and expecting inline nexthop information in the routes.
>>>
>>> That mixed environment situation has to function correctly.  Older
>>> apps have to see the per-route nexthop info in the format and layout
>>> they expect (gw/dev pairs).  They cannot be expected to just studdenly
>>> understand the nexthop ID etc.
>>>
>>> Otherwise the concept and ideas are fine, so as long as you can resolve
>>> the mixed environment situation I fully support this work and look forward
>>> to it being in a state where I can integrate it :-)
>>>
>>
>> RTA_NH_ID is on par with other new attriwo parallel commands doing butes (RTA_ENCAP for example) --
>> userspace apps get a route attribute and have no idea what it means
>> until support is added (e.g., it took more than 2 years for libnl to get
>> support for RTA_ENCAP). I take your comment to mean you prefer this new
>> attribute to be treated differently -- assume apps are clueless unless
>> they indicate otherwise. Given the number of ioctl based apps that might
>> be the better option for this case.
>>
>> I can add an attribute for apps to specify 'hey, I understand nexthops'
>> on dump and get requests (per-app flag), and then I can add a sysctl
>> that controls whether the nexthop spec is included. The sysctl would be
>> for notifications and a global option for dumps/gets. Users who know
>> their OS is safe for the short form can set it and get the benefit of
>> smaller messages. While the biggest win here is pushing routes to the
>> kernel faster, there is also a gain with less data from the kernel in
>> route dumps and notifications, especially with multipath environments.
> 
> Personally, I use mixed environments where iproute2 version doesn't match
> kernel version most of the time and I'd like to use both new iproute2 and
> old BIRD or whatever at the same machine. For myself, not knowing exactly

That is a requirement for any solution but with limits. There is always
the problem of an older app running on a new kernel and not
understanding some whiz bang feature that another app configured. An
example of that today is lwt encaps.


> how to implement it, the best way how to interact with the nexthop layer,
> would be a flag set while opening the netlink socket, or maybe another
> RTMGRP_* like RTMGRP_IPV4_NEXTHOP and RTMGRP_IPV6_NEXTHOP. Who asks for
> these, they will get nexthop-ed notifications; who asks for
> RTMGRP_IPV[46]_ROUTE, gets them the old way. Anyway, I may live quite well
> with a system-wide knob in sysctl setting this.

Notifications (e.g, the RTMGRP's) do not allow per-app customizations -
an app registers for the event and gets the message however it is
created by the code implementing the NEW or DEL message.

Yes, I do plan a separate NEXTHOP RTMGRP.

Dave's comment is really about RTM_NEWROUTE and RTM_DELROUTE messages.
The most efficient implementation is to add only the NHID to the message
and skip the rest of the nexthop specs (dev, gw, encap, ...). That means
all apps listening for the notifications understand the abbreviated NHID
reference. A sysctl allows an admin to say 'yes, all apps are converted
and understand this nexthop thingy so use the most efficient message'.
Without it, route notifications can add the new NHID attribute but still
need to add the full nexthop spec.

An app requesting a dump of all entries in the FIB is a different
matter. In this case the requesting app can take the nexthop
efficiencies while processing the initial configuration without
affecting any other processes.

> 
> I'd also like to have documented that the multipath is forbidden to be
> recursive and only one level of depth is allowed. It seems OK for me on
> first sight how it works, just please write it down for future generations
> for them to say "it is OK to fail badly when kernel sends us a recursive
> nexthop group definition"

That is documented, but apps should be robust and never fail badly when
the kernel sends something unknown. That is the basic backwards/forwards
compatibility thing. If you do not understand an attribute in a message
handle it gracefully.

> 
> At last, what I consider important, is a namespace separation by
> nh_protocol. Let's suppose that there is a routing daemon maintaining some
> routes and also an operator who has her own routes. Then the operator does
> a stupid typo while running ip nexthop replace and instead of replacing
> her own nexthop, she replaces the routing daemon's nexthop.
> 
> I suggest that the key in the nexthop table should include nh_protocol
> together with nh_id to split these records. It also seems that there is

That creates a difference between routes and nexthops. With FIB entries
the protocol is not used for route adds / replace. Same problem but now
two different responses from the kernel. Consistency is important.


> only one nexthop table; is this OK wrt. multiple VRF's and so? There is
> also the nh_table parameter in struct nh_config which only validates the
> gateway in a given table. If I look correctly, this nexthop may be then
> used anywhere which is insane from my point of view.

Yes, it should be fine. e.g., consider VRF route leaking.

> 
> BTW, let's consider this:
> 
> # ip nexthop add id 1 via 10.99.0.1 dev veth1
> # ip nexthop add id 2 via 10.99.0.2 dev veth1
> # ip nexthop add id 3 via 10.99.0.3 dev veth1
> 
> and then both commands at the same time:
> 
> # ip nexthop replace id 1 group 2/3

you can not replace a non-group with a group or vice versa. You need to
choose a different id.

> # ip nexthop replace id 2 group 1/3

same problem here, but also remember these are successive operations -
only one change is allowed at a time due to the rtnl lock.

> 
> Is this synchronized and one of these commands fails? (I'm not familiar
> with the code such deeply, sorry if I'm asking for something obvious.)
> 
> Thank you for considering my objections and sorry for reviving such an
> old thread.
> 
> Maria
> BIRD developer

If you have time to add support to BIRD a second implementation helps
ensure we have the right API.