[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211013122843.wxj7jtyzifwng3j4@kgollan-pc>
Date: Wed, 13 Oct 2021 15:28:43 +0300
From: Lahav Schlesinger <lschlesinger@...venets.com>
To: Eugene Crosser <crosser@...rage.org>
Cc: netdev@...r.kernel.org, netfilter-devel@...r.kernel.org,
David Ahern <dsahern@...nel.org>
Subject: Re: Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb
conntrack connection on VRF rcv" breaks expected netfilter behaviour
On Tue, Oct 12, 2021 at 03:28:39PM +0200, Eugene Crosser wrote:
> CAUTION: External E-Mail - Use caution with links and attachments
>
> Date: Tue, 12 Oct 2021 15:28:39 +0200
> From: Eugene Crosser <crosser@...rage.org>
> To: netdev@...r.kernel.org
> Cc: netfilter-devel@...r.kernel.org, Lahav Schlesinger
> <lschlesinger@...venets.com>, David Ahern <dsahern@...nel.org>
> Subject: Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb
> conntrack connection on VRF rcv" breaks expected netfilter behaviour
> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
> Thunderbird/78.13.0
>
> Hello all,
>
>
> Commit mentioned in the subject was intended to avoid creation of stray
> conntrack entries when input interface is enslaved in a VRF, and thus
> prerouting conntrack hook is called twice: once in the context of the
> original input interface, and once in the context of the VRF interface.
> Solution was to nuke conntrack related data associated with the skb when
> it enters VRF context.
>
>
>
> However this breaks netfilter operation. Imagine a use case when
> conntrack zone must be assigned based on the (original, "real") input
> interface, rather than VRF interface (that can enslave multiple "real"
> interfaces, that would become indistinguishable). One could create
> netfilter rules similar to these:
>
>
>
> chain rawprerouting {
>
> type filter hook prerouting priority raw;
>
> iif realiface1 ct zone set 1 return
>
> iif realiface2 ct zone set 2 return
>
> }
>
>
>
> This works before the mentioned commit, but not after: zone assignment
> is "forgotten", and any subsequent NAT or filtering that is dependent on
> the conntrack zone does not work.
>
>
>
> There is a reproducer script at the bottom of this message that
> demonstrates the difference in behaviour.
>
>
>
> Maybe a better solution for stray conntrack entries would be to
> introduce finer control in netfilter? One possible idea would be to
> implement both "track" and "notrack" targets; then a working
> configuration would look like this:
>
>
>
> chain rawprerouting {
>
> type filter hook prerouting priority raw;
>
> iif realiface1 ct zone set 1 notrack
>
> iif realiface2 ct zone set 2 notrack
>
> iif vrfmaster track
>
> }
>
>
>
> so in the original input interface context, zone is assigned, but
> conntrack processing itself is shortcircuited. When the packet enters
> VRF context, conntracking is reenabled, so one entry is created, in the
> zone assigned at an earlier stage.
>
>
>
> This is just an idea, I don't have enough knowledge to judge how
> workable is it.
>
>
>
> For reference, this is a thread about the issue in netfilter-devel:
> https://marc.info/?t=163310182600001&r=1&w=2
>
>
>
> Thank you,
>
>
>
> Eugene
>
>
>
> ==========
>
> #!/bin/sh
>
>
>
> # This script demonstrates unexpected change of nftables behaviour
>
> # caused by commit 09e856d54bda5f28 ""vrf: Reset skb conntrack
>
> # connection on VRF rcv"
>
> #
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=09e856d54bda5f288ef8437a90ab2b9b3eab83d1
>
> #
>
> # Before the commit, it was possible to assign conntrack zone to a
>
> # packet (or mark it for `notracking`) in the prerouting chanin, raw
>
> # priority, based on the `iif` (interface from which the packet
>
> # arrived).
>
> # After the change, # if the interface is enslaved in a VRF, such
>
> # assignment is lost. Instead, assignment based on the `iif` matching
>
> # the VRF master interface is honored. Thus it is impossible to
>
> # distinguish packets based on the original interface.
>
> #
>
> # This script demonstrates this change of behaviour: conntrack zone 1
>
> # or 2 is assigned depending on the match with the original interface
>
> # or the vrf master interface. It can be observed that conntrack entry
>
> # appears in different zone in the kernel versions before and after
>
> # the commit. Additionaly, the script produces netfilter trace files
>
> # that can be used for debugging the issue.
>
>
>
> IPIN=172.30.30.1
>
> IPOUT=172.30.30.2
>
> PFXL=30
>
>
>
> ip li sh vein >/dev/null 2>&1 && ip li del vein
>
> ip li sh tvrf >/dev/null 2>&1 && ip li del tvrf
>
> nft list table testct >/dev/null 2>&1 && nft delete table testct
>
>
>
> ip li add vein type veth peer veout
>
> ip li add tvrf type vrf table 9876
>
> ip li set veout master tvrf
>
> ip li set vein up
>
> ip li set veout up
>
> ip li set tvrf up
>
> /sbin/sysctl -w net.ipv4.conf.veout.accept_local=1
>
> /sbin/sysctl -w net.ipv4.conf.veout.rp_filter=0
>
> ip addr add $IPIN/$PFXL dev vein
>
> ip addr add $IPOUT/$PFXL dev veout
>
>
>
> nft -f - <<__END__
>
> table testct {
>
> chain rawpre {
>
> type filter hook prerouting priority raw;
>
> iif { veout, tvrf } meta nftrace set 1
>
> iif veout ct zone set 1 return
>
> iif tvrf ct zone set 2 return
>
> notrack
>
> }
>
> chain rawout {
>
> type filter hook output priority raw;
>
> notrack
>
> }
>
> }
>
> __END__
>
>
>
> uname -rv
>
> conntrack -F
>
> stdbuf -o0 nft monitor trace >nftrace.`uname -r`.txt &
>
> monpid=$!
>
> ping -W 1 -c 1 -I vein $IPOUT
>
> conntrack -L
>
> sleep 1
>
> kill -15 $monpid
>
> wait
>
Hi Eugene, I apologie for the late response (was on vacation).
The call to nf_reset_ct() I added was to match the existing call in the
egress flow, which I didn't want to change in order to not break
existing behaviour (which I unintentionally still did :-)).
Seems like any combination of calling nf_reset_ct() will lead to
something breaking. So continuing on what Florian suggested, another
possibility is to make the calls to nf_reset_ct() in both ingress and egress
flow configurable (procfs or new flags to RTM_NEWLINK).
One benefit of this is that disabling nf_reset_ct() on the egress flow will
mean no port SNAT will take place when SNAT rule is installed on a VRF
(as I described in my original commit), which can break applications
that depend on using a specific source port.
I'll need to think about it some more though, I don't have any better solution
right now.
Powered by blists - more mailing lists