netdev - Re: [REGRESSION] connection timeout with routes to VRF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <c7968dd38360ac4c7d7ed2a45ec1af3176a2ce6a.camel@lasnet.de>
Date:   Sun, 17 Jul 2022 21:27:16 +0200
From:   Jan Lübbe <jluebbe@...net.de>
To:     Mike Manning <mvrmanning@...il.com>,
        David Ahern <dsahern@...nel.org>,
        Robert Shearman <robertshearman@...il.com>,
        Andy Roulin <aroulin@...dia.com>
Cc:     "David S. Miller" <davem@...emloft.net>, netdev@...r.kernel.org,
        regressions@...ts.linux.dev,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [REGRESSION] connection timeout with routes to VRF

On Sun, 2022-07-17 at 11:31 +0100, Mike Manning wrote:
> On 06/07/2022 19:49, Jan Luebbe wrote:
> > On Sun, 2022-06-26 at 21:06 +0100, Mike Manning wrote:
> > ...
> > > Andy Roulin suggested the same fix to the same problem a few weeks back.
> > > Let's do it along with a test case in fcnl-test.sh which covers all of
> > > these vrf permutations.
> > > 
> > Reverting 3c82a21f4320 would remove isolation between the default and other VRFs
> > needed when no VRF route leaking has been configured between these: there may be
> > unintended leaking of packets arriving on a device enslaved to an l3mdev due to
> > the potential match on an unbound socket.
> > 
> > Thanks for the explanation.
> > 
> > VRF route leaking requires routes to be present for both ingress and egress
> > VRFs,
> > the testcase shown only has a route from default to red VRF. The implicit return
> > path from red to default VRF due to match on unbound socket is no longer
> > present.
> > 
> > 
> > If there is a better configuration that makes this work in the general case
> > without a change to the kernel, we'd be happy as well.
> > 
> > In our full setup, the outbound TCP connection (from the default VRF) gets a
> > local IP from the interface enslaved to the VRF. Before 3c82a21f4320, this would
> > simply work.
> > 
> > How would the return path route from the red VRF to the default VRF look in that
> > case?
> 
> I am unaware of your topology, can you add a route in the red VRF table
> (see 'ip route ls vrf red'), so 'ip route add vrf red <prefix> via
> <next-hop>'.

With 4.19 (and my workaround), the outbound packets were simply assigned an IP
from the red VRF if the have to leave the local machine. The IPs in the default
VRF are public IPs, which would not be routed back to the same machine. 


I'm not sure if it helps, but I'll try to explain our topology:
https://gist.github.com/jluebbe/001c7b9ba531ad04d7e5c0a58f400967

Were using VRF for the https://freifunk-bs.de/ network. In the diagram above, a
'freifunk' VRF is used to create an overlay network, consisting of Wireguard
tunnels with OSPF and BGP between the concentrators and exists. 

Each of the these servers has the main VRF with the physical network interface
and externally routeable IPs, which is used to transport the Wireguard packets
and management access.

The 'freifunk' VRF encapsulates the overlay traffic, so that it is only routed
to other interfaces assigned to the VRF. Previously, we used policy routing for
this, but VRF has made the setup much easier to understand and debug. :)

> The isolation between default and other VRFs necessary for forwarding
> purposes means that running a local process in the default VRF to access
> another VRF no longer works since the change made in 2018 that you
> identified. So in your example, 'ip vrf exec red nc ...' will work, but
> I assume that this is of no use to you.

I don't actually use forwarding across VRFs via route leaking, but only for
locally originating connections. A simple example is that a 'ssh <IP accessible
via VRF>' just worked for non-root users, which also makes ssh's ProxyJump work
to hosts accessible via the VRF. Both 'ip vrf exec' and binding to the VRF
interface would require root.

For real services (such as bind), we already used 'ip vrf exec ...' and that was
(and is) working fine.

> > Match on unbound socket in all VRFs and not only in the default VRF should be
> > possible by setting this option (see
> > https://www.kernel.org/doc/Documentation/networking/vrf.txt):
> > 
> > 
> > Do you mean unbound as in listening socket not bound to an IP with bind()? Or as
> > in a socket in the default VRF?
> 
> Unbound meaning a socket in the default VRF, as opposed to a a socket
> set into a VRF context by binding it to a VRF master interface using
> SO_BINDTODEVICE. One must also be able to bind to an appropriate IP
> address with bind() regardless of whether the socket is in the default
> or another VRF, but that is not relevant here.

OK.

> > sysctl net.ipv4.tcp_l3mdev_accept=1
> > 
> > 
> > The sysctl docs sound like this should only apply to listening sockets. In this
> > case, we have an unconnected outbound socket.
> 
> With this option disabled (by default), any stream socket in a VRF is
> only selected for packets in that VRF, this is done in the input path
> see e.g. tcp_v4_rcv() for IPv4.
> 

Yes, we've been using this option for a long time to make sshd accessible from
both the default VRF and the red VRF.

> > However, for this to work a change similar to the following is needed (I have
> > shown the change to the macro for consistency with above, it is now an inline
> > fn):
> > 
> > 
> > I can also test on master and only used the macro form only because I wasn't
> > completely sure how to translate it to the inline function form.
> > 
> > ---
> >  include/net/inet_hashtables.h |   10 ++++------
> >  1 file changed, 4 insertions(+), 6 deletions(-)
> > 
> > --- a/include/net/inet_hashtables.h
> > +++ b/include/net/inet_hashtables.h
> > @@ -300,9 +300,8 @@
> >  #define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif,
> > __sdif) \
> >         (((__sk)->sk_portpair == (__ports))                     &&      \
> >          ((__sk)->sk_addrpair == (__cookie))                    &&      \
> > -        (((__sk)->sk_bound_dev_if == (__dif))                  ||      \
> > -         ((__sk)->sk_bound_dev_if == (__sdif)))                &&      \
> > -        net_eq(sock_net(__sk), (__net)))
> > +        net_eq(sock_net(__sk), (__net))                        &&      \
> > +        inet_sk_bound_dev_eq((__net), (__sk)->sk_bound_dev_if, (__dif),
> > (__sdif)))
> >  #else /* 32-bit arch */
> >  #define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
> >         const int __name __deprecated __attribute__((unused))
> > @@ -311,9 +310,8 @@
> >         (((__sk)->sk_portpair == (__ports))             &&              \
> >          ((__sk)->sk_daddr      == (__saddr))           &&              \
> >          ((__sk)->sk_rcv_saddr  == (__daddr))           &&              \
> > -        (((__sk)->sk_bound_dev_if == (__dif))          ||              \
> > -         ((__sk)->sk_bound_dev_if == (__sdif)))        &&              \
> > -        net_eq(sock_net(__sk), (__net)))
> > +        net_eq(sock_net(__sk), (__net))                &&              \
> > +        inet_sk_bound_dev_eq((__net), (__sk)->sk_bound_dev_if, (__dif),
> > (__sdif)))
> >  #endif /* 64-bit arch */
> > 
> >  /* Sockets in TCP_CLOSE state are _always_ taken out of the hash, so we need
> > 
> > I can confirm that this gets my testcase working with 
> > net.ipv4.tcp_l3mdev_accept=1.
> 
> I can submit this change to kernel-net (modified for latest code) if
> David is ok with this approach. It should not have a significant
> performance impact (due to the additional kernel parameter check) for
> most use-cases, as typically sdif = 0 for the unbound case. I am not in
> a position to carry out any performance testing.

I've tried your patch on our production network and notices an additional
complication I've missed so far: It only fixes v4.

As there is no tcp_l3mdev_accept for v6 as far as I can tell, I've updated my
workaround (see below) and we've been using it since last week without obvious
issues.

> > This is to get the testcase to pass, I will leave it to others to comment on
> > the testcase validity in terms of testing forwarding using commands on 1 device.
> > 
> > So a network-namespace-based testcase would be preferred? We used the simple
> > setup because it seemed easier to understand.
> > 
> > The series that 3c82a21f4320 is part of were introduced into the kernel in 2018
> > by the Vyatta team, who regularly run an extensive test suite for routing
> > protocols for VRF functionality incl. all combinations of route leaking between default
> > and other VRFs, so there is no known issue in this regard. I will attempt to reach
> > out to them so as to advise them of this thread.
> > 
> > Are these testcases public? Perhaps I could use them find a better configuration
> > that handles our use-case.
> 
> The test automation to bring up network topologies is not public, but
> the test cases would not be readily transferable for general use in any
> case. I have advised the Vyatta team of this thread.

Ah, thanks!

My current workaround on top of 5.18.11 (derived from a 3c82a21f4320 revert):

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 81b965953036..1944f8730b42 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -110,8 +110,9 @@ int inet6_hash(struct sock *sk);
 	 ((__sk)->sk_family == AF_INET6)			&&	\
 	 ipv6_addr_equal(&(__sk)->sk_v6_daddr, (__saddr))		&&	\
 	 ipv6_addr_equal(&(__sk)->sk_v6_rcv_saddr, (__daddr))	&&	\
-	 (((__sk)->sk_bound_dev_if == (__dif))	||			\
-	  ((__sk)->sk_bound_dev_if == (__sdif)))		&&	\
+	 (!(__sk)->sk_bound_dev_if	||				\
+	   ((__sk)->sk_bound_dev_if == (__dif))	||			\
+	   ((__sk)->sk_bound_dev_if == (__sdif)))		&&	\
 	 net_eq(sock_net(__sk), (__net)))
 
 #endif /* _INET6_HASHTABLES_H */
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 98e1ec1a14f0..41e7f20d7e51 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -310,8 +310,9 @@ static inline struct sock *inet_lookup_listener(struct net *net,
 #define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \
 	(((__sk)->sk_portpair == (__ports))			&&	\
 	 ((__sk)->sk_addrpair == (__cookie))			&&	\
-	 (((__sk)->sk_bound_dev_if == (__dif))			||	\
-	  ((__sk)->sk_bound_dev_if == (__sdif)))		&&	\
+	 (!(__sk)->sk_bound_dev_if	||				\
+	   ((__sk)->sk_bound_dev_if == (__dif))			||	\
+	   ((__sk)->sk_bound_dev_if == (__sdif)))		&&	\
 	 net_eq(sock_net(__sk), (__net)))
 #else /* 32-bit arch */
 #define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
@@ -321,8 +322,9 @@ static inline struct sock *inet_lookup_listener(struct net *net,
 	(((__sk)->sk_portpair == (__ports))		&&		\
 	 ((__sk)->sk_daddr	== (__saddr))		&&		\
 	 ((__sk)->sk_rcv_saddr	== (__daddr))		&&		\
-	 (((__sk)->sk_bound_dev_if == (__dif))		||		\
-	  ((__sk)->sk_bound_dev_if == (__sdif)))	&&		\
+	 (!(__sk)->sk_bound_dev_if	||				\
+	   ((__sk)->sk_bound_dev_if == (__dif))		||		\
+	   ((__sk)->sk_bound_dev_if == (__sdif)))	&&		\
 	 net_eq(sock_net(__sk), (__net)))
 #endif /* 64-bit arch */
 

Thanks again,
Jan