[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c110dcb5-cfd3-5abd-1533-4f9dc1d45531@gmail.com>
Date: Sun, 17 Jul 2022 11:31:30 +0100
From: Mike Manning <mvrmanning@...il.com>
To: Jan Luebbe <jluebbe@...net.de>, David Ahern <dsahern@...nel.org>,
Robert Shearman <robertshearman@...il.com>,
Andy Roulin <aroulin@...dia.com>
Cc: "David S. Miller" <davem@...emloft.net>, netdev@...r.kernel.org,
regressions@...ts.linux.dev,
linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [REGRESSION] connection timeout with routes to VRF
On 06/07/2022 19:49, Jan Luebbe wrote:
> On Sun, 2022-06-26 at 21:06 +0100, Mike Manning wrote:
> ...
>> Andy Roulin suggested the same fix to the same problem a few weeks back.
>> Let's do it along with a test case in fcnl-test.sh which covers all of
>> these vrf permutations.
>>
> Reverting 3c82a21f4320 would remove isolation between the default and other VRFs
> needed when no VRF route leaking has been configured between these: there may be
> unintended leaking of packets arriving on a device enslaved to an l3mdev due to
> the potential match on an unbound socket.
>
> Thanks for the explanation.
>
> VRF route leaking requires routes to be present for both ingress and egress
> VRFs,
> the testcase shown only has a route from default to red VRF. The implicit return
> path from red to default VRF due to match on unbound socket is no longer
> present.
>
>
> If there is a better configuration that makes this work in the general case
> without a change to the kernel, we'd be happy as well.
>
> In our full setup, the outbound TCP connection (from the default VRF) gets a
> local IP from the interface enslaved to the VRF. Before 3c82a21f4320, this would
> simply work.
>
> How would the return path route from the red VRF to the default VRF look in that
> case?
I am unaware of your topology, can you add a route in the red VRF table
(see 'ip route ls vrf red'), so 'ip route add vrf red <prefix> via
<next-hop>'.
The isolation between default and other VRFs necessary for forwarding
purposes means that running a local process in the default VRF to access
another VRF no longer works since the change made in 2018 that you
identified. So in your example, 'ip vrf exec red nc ...' will work, but
I assume that this is of no use to you.
> Match on unbound socket in all VRFs and not only in the default VRF should be
> possible by setting this option (see
> https://www.kernel.org/doc/Documentation/networking/vrf.txt):
>
>
> Do you mean unbound as in listening socket not bound to an IP with bind()? Or as
> in a socket in the default VRF?
Unbound meaning a socket in the default VRF, as opposed to a a socket
set into a VRF context by binding it to a VRF master interface using
SO_BINDTODEVICE. One must also be able to bind to an appropriate IP
address with bind() regardless of whether the socket is in the default
or another VRF, but that is not relevant here.
> sysctl net.ipv4.tcp_l3mdev_accept=1
>
>
> The sysctl docs sound like this should only apply to listening sockets. In this
> case, we have an unconnected outbound socket.
With this option disabled (by default), any stream socket in a VRF is
only selected for packets in that VRF, this is done in the input path
see e.g. tcp_v4_rcv() for IPv4.
> However, for this to work a change similar to the following is needed (I have
> shown the change to the macro for consistency with above, it is now an inline
> fn):
>
>
> I can also test on master and only used the macro form only because I wasn't
> completely sure how to translate it to the inline function form.
>
> ---
> include/net/inet_hashtables.h | 10 ++++------
> 1 file changed, 4 insertions(+), 6 deletions(-)
>
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -300,9 +300,8 @@
> #define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif,
> __sdif) \
> (((__sk)->sk_portpair == (__ports)) && \
> ((__sk)->sk_addrpair == (__cookie)) && \
> - (((__sk)->sk_bound_dev_if == (__dif)) || \
> - ((__sk)->sk_bound_dev_if == (__sdif))) && \
> - net_eq(sock_net(__sk), (__net)))
> + net_eq(sock_net(__sk), (__net)) && \
> + inet_sk_bound_dev_eq((__net), (__sk)->sk_bound_dev_if, (__dif),
> (__sdif)))
> #else /* 32-bit arch */
> #define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
> const int __name __deprecated __attribute__((unused))
> @@ -311,9 +310,8 @@
> (((__sk)->sk_portpair == (__ports)) && \
> ((__sk)->sk_daddr == (__saddr)) && \
> ((__sk)->sk_rcv_saddr == (__daddr)) && \
> - (((__sk)->sk_bound_dev_if == (__dif)) || \
> - ((__sk)->sk_bound_dev_if == (__sdif))) && \
> - net_eq(sock_net(__sk), (__net)))
> + net_eq(sock_net(__sk), (__net)) && \
> + inet_sk_bound_dev_eq((__net), (__sk)->sk_bound_dev_if, (__dif),
> (__sdif)))
> #endif /* 64-bit arch */
>
> /* Sockets in TCP_CLOSE state are _always_ taken out of the hash, so we need
>
> I can confirm that this gets my testcase working with
> net.ipv4.tcp_l3mdev_accept=1.
I can submit this change to kernel-net (modified for latest code) if
David is ok with this approach. It should not have a significant
performance impact (due to the additional kernel parameter check) for
most use-cases, as typically sdif = 0 for the unbound case. I am not in
a position to carry out any performance testing.
> This is to get the testcase to pass, I will leave it to others to comment on
> the testcase validity in terms of testing forwarding using commands on 1 device.
>
> So a network-namespace-based testcase would be preferred? We used the simple
> setup because it seemed easier to understand.
>
> The series that 3c82a21f4320 is part of were introduced into the kernel in 2018
> by the Vyatta team, who regularly run an extensive test suite for routing
> protocols
> for VRF functionality incl. all combinations of route leaking between default
> and
> other VRFs, so there is no known issue in this regard. I will attempt to reach
> out
> to them so as to advise them of this thread.
>
> Are these testcases public? Perhaps I could use them find a better configuration
> that handles our use-case.
The test automation to bring up network topologies is not public, but
the test cases would not be readily transferable for general use in any
case. I have advised the Vyatta team of this thread.
> Thanks,
>
> Jan
>
Powered by blists - more mailing lists