[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <08c51b7a-0e6d-45b4-81a3-cb3062eb855d@iogearbox.net>
Date: Fri, 13 Jun 2025 10:23:26 +0200
From: Daniel Borkmann <daniel@...earbox.net>
To: Ido Schimmel <idosch@...dia.com>, netdev@...r.kernel.org
Cc: davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com,
edumazet@...gle.com, horms@...nel.org, donald.hunter@...il.com,
petrm@...dia.com, razor@...ckwall.org
Subject: Re: [PATCH net-next 1/2] neighbor: Add NTF_EXT_VALIDATED flag for
externally validated entries
Hi Ido,
On 6/11/25 4:15 PM, Ido Schimmel wrote:
> tl;dr
> =====
>
> Add a new neighbor flag ("extern_valid") that can be used to indicate to
> the kernel that a neighbor entry was learned and determined to be valid
> externally. The kernel will not try to remove or invalidate such an
> entry, leaving these decisions to the user space control plane. This is
> needed for EVPN multi-homing where a neighbor entry for a multi-homed
> host needs to be synced across all the VTEPs among which the host is
> multi-homed.
>
> Background
> ==========
>
> In a typical EVPN multi-homing setup each host is multi-homed using a
> set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf
> switches (VTEPs). VTEPs that are connected to the same ES are called ES
> peers.
>
> When a neighbor entry is learned on a VTEP, it is distributed to both ES
> peers and remote VTEPs using EVPN MAC/IP advertisement routes. ES peers
> use the neighbor entry when routing traffic towards the multi-homed host
> and remote VTEPs use it for ARP/NS suppression.
>
> Motivation
> ==========
>
> If the ES link between a host and the VTEP on which the neighbor entry
> was locally learned goes down, the EVPN MAC/IP advertisement route will
> be withdrawn and the neighbor entries will be removed from both ES peers
> and remote VTEPs. Routing towards the multi-homed host and ARP/NS
> suppression can fail until another ES peer locally learns the neighbor
> entry and distributes it via an EVPN MAC/IP advertisement route.
>
> "draft-rbickhart-evpn-ip-mac-proxy-adv-03" [1] suggests avoiding these
> intermittent failures by having the ES peers install the neighbor
> entries as before, but also injecting EVPN MAC/IP advertisement routes
> with a proxy indication. When the previously mentioned ES link goes down
> and the original EVPN MAC/IP advertisement route is withdrawn, the ES
> peers will not withdraw their neighbor entries, but instead start aging
> timers for the proxy indication.
>
> If an ES peer locally learns the neighbor entry (i.e., it becomes
> "reachable"), it will restart its aging timer for the entry and emit an
> EVPN MAC/IP advertisement route without a proxy indication. An ES peer
> will stop its aging timer for the proxy indication if it observes the
> removal of the proxy indication from at least one of the ES peers
> advertising the entry.
>
> In the event that the aging timer for the proxy indication expired, an
> ES peer will withdraw its EVPN MAC/IP advertisement route. If the timer
> expired on all ES peers and they all withdrew their proxy
> advertisements, the neighbor entry will be completely removed from the
> EVPN fabric.
>
> Implementation
> ==============
>
> In the above scheme, when the control plane (e.g., FRR) advertises a
> neighbor entry with a proxy indication, it expects the corresponding
> entry in the data plane (i.e., the kernel) to remain valid and not be
> removed due to garbage collection. The control plane also expects the
> kernel to notify it if the entry was learned locally (i.e., became
> "reachable") so that it will remove the proxy indication from the EVPN
> MAC/IP advertisement route. That is why these entries cannot be
> programmed with dummy states such as "permanent" or "noarp".
Meaning, in contrast to "permanent" the initial user-provided lladdr
can still be updated by the kernel if it learned that there was a
migration, right?
> Instead, add a new neighbor flag ("extern_valid") which indicates that
> the entry was learned and determined to be valid externally and should
> not be removed or invalidated by the kernel. The kernel can probe the
> entry and notify user space when it becomes "reachable". However, if the
> kernel does not receive a confirmation, have it return the entry to the
> "stale" state instead of the "failed" state.
>
> In other words, an entry marked with the "extern_valid" flag behaves
> like any other dynamically learned entry other than the fact that the
> kernel cannot remove or invalidate it.
How is the expected neigh_flush_dev() behavior? I presume in that case if
the neigh entry is in use and was NUD_STALE then we go into NUD_NONE state
right? (Asking as NUD_PERMANENT skips all that and whether that should be
similar or not for NTF_EXT_VALIDATED?)
> One can argue that the "extern_valid" flag should not prevent garbage
> collection and that instead a neighbor entry should be programmed with
> both the "extern_valid" and "extern_learn" flags. There are two reasons
> for not doing that:
>
> 1. Unclear why a control plane would like to program an entry that the
> kernel cannot invalidate but can completely remove.
>
> 2. The "extern_learn" flag is used by FRR for neighbor entries learned
> on remote VTEPs (for ARP/NS suppression) whereas here we are
> concerned with local entries. This distinction is currently irrelevant
> for the kernel, but might be relevant in the future.
>
> Given that the flag only makes sense when the neighbor has a valid
> state, reject attempts to add a neighbor with an invalid state and with
> this flag set. For example:
>
> # ip neigh add 192.0.2.1 nud none dev br0.10 extern_valid
> Error: Cannot create externally validated neighbor with an invalid state.
> # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid
> # ip neigh replace 192.0.2.1 nud failed dev br0.10 extern_valid
> Error: Cannot mark neighbor as externally validated with an invalid state.
>
> The above means that a neighbor cannot be created with the
> "extern_valid" flag and flags such as "use" or "managed" as they result
> in a neighbor being created with an invalid state ("none") and
> immediately getting probed:
>
> # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
> Error: Cannot create externally validated neighbor with an invalid state.
>
> However, these flags can be used together with "extern_valid" after the
> neighbor was created with a valid state:
>
> # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid
> # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
>
> One consequence of preventing the kernel from invalidating a neighbor
> entry is that by default it will only try to determine reachability
> using unicast probes. This can be changed using the "mcast_resolicit"
> sysctl:
>
> # sysctl net.ipv4.neigh.br0/10.mcast_resolicit
> 0
> # tcpdump -nn -e -i br0.10 -Q out arp &
> # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
> 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
> 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
> 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
> # sysctl -wq net.ipv4.neigh.br0/10.mcast_resolicit=3
> # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
> 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
> 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
> 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
> 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
> 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
> 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
>
> iproute2 patches can be found here [2].
>
> [1] https://datatracker.ietf.org/doc/html/draft-rbickhart-evpn-ip-mac-proxy-adv-03
> [2] https://github.com/idosch/iproute2/tree/submit/extern_valid_v1
>
> Signed-off-by: Ido Schimmel <idosch@...dia.com>
[...]
Thanks,
Daniel
Powered by blists - more mailing lists