[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d38916e3-c6d7-d5f4-4815-8877efc50a2a@gmail.com>
Date: Mon, 16 Aug 2021 12:08:00 -0600
From: David Ahern <dsahern@...il.com>
To: Lahav Schlesinger <lschlesinger@...venets.com>,
netdev@...r.kernel.org
Cc: dsahern@...nel.org, davem@...emloft.net, kuba@...nel.org
Subject: Re: [PATCH] vrf: Reset skb conntrack connection on VRF rcv
On 8/15/21 6:00 AM, Lahav Schlesinger wrote:
> To fix the "reverse-NAT" for replies.
>
> When a packet is sent over a VRF, the POST_ROUTING hooks are called
> twice: Once from the VRF interface, and once from the "actual"
> interface the packet will be sent from:
> 1) First SNAT: l3mdev_l3_out() -> vrf_l3_out() -> .. -> vrf_output_direct()
> This causes the POST_ROUTING hooks to run.
> 2) Second SNAT: 'ip_output()' calls POST_ROUTING hooks again.
>
> Similarly for replies, first ip_rcv() calls PRE_ROUTING hooks, and
> second vrf_l3_rcv() calls them again.
>
> As an example, consider the following SNAT rule:
>> iptables -t nat -A POSTROUTING -p udp -m udp --dport 53 -j SNAT --to-source 2.2.2.2 -o vrf_1
>
> In this case sending over a VRF will create 2 conntrack entries.
> The first is from the VRF interface, which performs the IP SNAT.
> The second will run the SNAT, but since the "expected reply" will remain
> the same, conntrack randomizes the source port of the packet:
> e..g With a socket bound to 1.1.1.1:10000, sending to 3.3.3.3:53, the conntrack
> rules are:
> udp 17 29 src=2.2.2.2 dst=3.3.3.3 sport=10000 dport=53 packets=1 bytes=68 [UNREPLIED] src=3.3.3.3 dst=2.2.2.2 sport=53 dport=61033 packets=0 bytes=0 mark=0 use=1
> udp 17 29 src=1.1.1.1 dst=3.3.3.3 sport=10000 dport=53 packets=1 bytes=68 [UNREPLIED] src=3.3.3.3 dst=2.2.2.2 sport=53 dport=10000 packets=0 bytes=0 mark=0 use=1
>
> i.e. First SNAT IP from 1.1.1.1 --> 2.2.2.2, and second the src port is
> SNAT-ed from 10000 --> 61033.
>
> But when a reply is sent (3.3.3.3:53 -> 2.2.2.2:61033) only the later
> conntrack entry is matched:
> udp 17 29 src=2.2.2.2 dst=3.3.3.3 sport=10000 dport=53 packets=1 bytes=68 src=3.3.3.3 dst=2.2.2.2 sport=53 dport=61033 packets=1 bytes=49 mark=0 use=1
> udp 17 28 src=1.1.1.1 dst=3.3.3.3 sport=10000 dport=53 packets=1 bytes=68 [UNREPLIED] src=3.3.3.3 dst=2.2.2.2 sport=53 dport=10000 packets=0 bytes=0 mark=0 use=1
>
> And a "port 61033 unreachable" ICMP packet is sent back.
>
> The issue is that when PRE_ROUTING hooks are called from vrf_l3_rcv(),
> the skb already has a conntrack flow attached to it, which means
> nf_conntrack_in() will not resolve the flow again.
>
> This means only the dest port is "reverse-NATed" (61033 -> 10000) but
> the dest IP remains 2.2.2.2, and since the socket is bound to 1.1.1.1 it's
> not received.
> This can be verified by logging the 4-tuple of the packet in '__udp4_lib_rcv()'.
>
> The fix is then to reset the flow when skb is received on a VRF, to let
> conntrack resolve the flow again (which now will hit the earlier flow).
>
> To reproduce: (Without the fix "Got pkt_to_nat_port" will not be printed by
> running 'bash ./repro'):
> $ cat run_in_A1.py
> import logging
> logging.getLogger("scapy.runtime").setLevel(logging.ERROR)
> from scapy.all import *
> import argparse
>
> def get_packet_to_send(udp_dst_port, msg_name):
> return Ether(src='11:22:33:44:55:66', dst=iface_mac)/ \
> IP(src='3.3.3.3', dst='2.2.2.2')/ \
> UDP(sport=53, dport=udp_dst_port)/ \
> Raw(f'{msg_name}\x0012345678901234567890')
>
> parser = argparse.ArgumentParser()
> parser.add_argument('-iface_mac', dest="iface_mac", type=str, required=True,
> help="From run_in_A3.py")
> parser.add_argument('-socket_port', dest="socket_port", type=str,
> required=True, help="From run_in_A3.py")
> parser.add_argument('-v1_mac', dest="v1_mac", type=str, required=True,
> help="From script")
>
> args, _ = parser.parse_known_args()
> iface_mac = args.iface_mac
> socket_port = int(args.socket_port)
> v1_mac = args.v1_mac
>
> print(f'Source port before NAT: {socket_port}')
>
> while True:
> pkts = sniff(iface='_v0', store=True, count=1, timeout=10)
> if 0 == len(pkts):
> print('Something failed, rerun the script :(', flush=True)
> break
> pkt = pkts[0]
> if not pkt.haslayer('UDP'):
> continue
>
> pkt_sport = pkt.getlayer('UDP').sport
> print(f'Source port after NAT: {pkt_sport}', flush=True)
>
> pkt_to_send = get_packet_to_send(pkt_sport, 'pkt_to_nat_port')
> sendp(pkt_to_send, '_v0', verbose=False) # Will not be received
>
> pkt_to_send = get_packet_to_send(socket_port, 'pkt_to_socket_port')
> sendp(pkt_to_send, '_v0', verbose=False)
> break
>
> $ cat run_in_A2.py
> import socket
> import netifaces
>
> print(f"{netifaces.ifaddresses('e00000')[netifaces.AF_LINK][0]['addr']}",
> flush=True)
> s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE,
> str('vrf_1' + '\0').encode('utf-8'))
> s.connect(('3.3.3.3', 53))
> print(f'{s. getsockname()[1]}', flush=True)
> s.settimeout(5)
>
> while True:
> try:
> # Periodically send in order to keep the conntrack entry alive.
> s.send(b'a'*40)
> resp = s.recvfrom(1024)
> msg_name = resp[0].decode('utf-8').split('\0')[0]
> print(f"Got {msg_name}", flush=True)
> except Exception as e:
> pass
>
> $ cat repro.sh
> ip netns del A1 2> /dev/null
> ip netns del A2 2> /dev/null
> ip netns add A1
> ip netns add A2
>
> ip -n A1 link add _v0 type veth peer name _v1 netns A2
> ip -n A1 link set _v0 up
>
> ip -n A2 link add e00000 type bond
> ip -n A2 link add lo0 type dummy
> ip -n A2 link add vrf_1 type vrf table 10001
> ip -n A2 link set vrf_1 up
> ip -n A2 link set e00000 master vrf_1
>
> ip -n A2 addr add 1.1.1.1/24 dev e00000
> ip -n A2 link set e00000 up
> ip -n A2 link set _v1 master e00000
> ip -n A2 link set _v1 up
> ip -n A2 link set lo0 up
> ip -n A2 addr add 2.2.2.2/32 dev lo0
>
> ip -n A2 neigh add 1.1.1.10 lladdr 77:77:77:77:77:77 dev e00000
> ip -n A2 route add 3.3.3.3/32 via 1.1.1.10 dev e00000 table 10001
>
> ip netns exec A2 iptables -t nat -A POSTROUTING -p udp -m udp --dport 53 -j \
> SNAT --to-source 2.2.2.2 -o vrf_1
>
> sleep 5
> ip netns exec A2 python3 run_in_A2.py > x &
> XPID=$!
> sleep 5
>
> IFACE_MAC=`sed -n 1p x`
> SOCKET_PORT=`sed -n 2p x`
> V1_MAC=`ip -n A2 link show _v1 | sed -n 2p | awk '{print $2'}`
> ip netns exec A1 python3 run_in_A1.py -iface_mac ${IFACE_MAC} -socket_port \
> ${SOCKET_PORT} -v1_mac ${SOCKET_PORT}
> sleep 5
>
> kill -9 $XPID
> wait $XPID 2> /dev/null
> ip netns del A1
> ip netns del A2
> tail x -n 2
> rm x
> set +x
>
> Signed-off-by: Lahav Schlesinger <lschlesinger@...venets.com>
> ---
> drivers/net/vrf.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
Thanks for the detailed explanation and use case.
Looks correct to me.
Reviewed-by: David Ahern <dsahern@...nel.org>
Powered by blists - more mailing lists