[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1518615241.1215.61.camel@cohaesio.com>
Date: Wed, 14 Feb 2018 13:34:13 +0000
From: "Anders K. Pedersen | Cohaesio" <akp@...aesio.com>
To: "weiwan@...gle.com" <weiwan@...gle.com>
CC: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"netfilter-devel@...r.kernel.org" <netfilter-devel@...r.kernel.org>
Subject: Re: [bisected] Forwarded packets occasionally has loopback output
interface in Netfilter
On man, 2018-01-22 at 15:54 +0100, Anders K. Pedersen | Cohaesio wrote:
> On tor, 2018-01-11 at 10:18 -0800, Wei Wang wrote:
> > On Thu, Jan 11, 2018 at 9:25 AM, Anders K. Pedersen | Cohaesio
> > <akp@...aesio.com> wrote:
> > > On tir, 2017-12-26 at 12:05 +0100, Anders K. Pedersen | Cohaesio
> > > wrote:
> > > > Hello,
> > > >
> > > > On one of our border routers, Netfilter is occasionally logging
> > > > packets
> > > > with "OUT=lo" (output interface lo) even though the packet should
> > > > be
> > > > going out via a regular interface. This behavior is present on
> > > > Linux
> > > > 4.13.0 to 4.14.9, and a bisection of the problem points to
> > > >
> > > > [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call
> > > > dst_dev_put() properly
> > > >
> > > > as the first bad commit. This commit adds dst_dev_put() calls
> > > > before
> > > > some dst_release() calls, and dst_dev_put() does
> > > >
> > > > dst->dev = dev_net(dst->dev)->loopback_dev;
> > > >
> > > > (among other things), which fits the problem we're seeing.
> > > >
> > > > The essential part of our nftables rule set that shows this
> > > > behavior is
> > > >
> > > > chain forward {
> > > > type filter hook forward priority 0;
> > > >
> > > > meta oif { $internal_interfaces } accept
> > > >
> > > > meta oif lo ip daddr != 127.0.0.0/8 \
> > > > log group 0 snaplen 80 prefix "oif-lo"
> > > > counter
> > > >
> > > > ip saddr { $our_ip_series } \
> > > > flow table acct_out \
> > > > { meta oif . rt nexthop . ip saddr
> > > > timeout 12m counter } \
> > > > accept
> > > >
> > > > log group 0 snaplen 80 prefix "DROP" counter drop
> > > > }
> > > >
> > > > The router only does stateless packet filtering and no
> > > > redirection or
> > > > rewriting of the packets (connection tracking, NAT, ipvs etc. are
> > > > not
> > > > even compiled for this kernel).
> > > >
> > > > As a result of this problem we see packets that should be going
> > > > to an
> > > > internal interface (and thus accepted by the first rule above)
> > > > being
> > > > logged and dropped by the last rule. Some examples:
> > > >
> > > > Dec 22 11:57:02 cix4 oif-lo IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118
> > > > DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF
> > > > PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705
> > > > WINDOW=3295 ACK URGP=0 MARK=0
> > > > Dec 22 11:57:02 cix4 DROP IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118
> > > > DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF
> > > > PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705
> > > > WINDOW=3295 ACK URGP=0 MARK=0
> > > >
> > > > Dec 22 12:47:07 cix4 oif-lo IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18
> > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF
> > > > PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103
> > > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > > Dec 22 12:47:07 cix4 DROP IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18
> > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF
> > > > PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103
> > > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > >
> > > > Dec 22 12:53:56 cix4 oif-lo IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34
> > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF
> > > > PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612
> > > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > > Dec 22 12:53:56 cix4 DROP IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34
> > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF
> > > > PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612
> > > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > >
> > > > It also happens for outbound traffic, where the packets are
> > > > logged and
> > > > counted in the acct_out flow table with "meta oif" = "lo", but a
> > > > correct "rt nexthop" - an example:
> > > >
> > > > Dec 22 12:29:13 cix4 oif-lo IN=team0.20 OUT=lo
> > > > MAC=3c:fd:fe:15:db:a8:00:24:a8:ff:f0:00:08:00 SRC=212.97.129.25
> > > > DST=95.166.119.129 LEN=40 TOS=00 PREC=0x00 TTL=62 ID=19481 DF
> > > > PROTO=TCP SPT=443 DPT=52560 SEQ=3034827396 ACK=2862814901
> > > > WINDOW=12618 ACK URGP=0 MARK=0
> > > >
> > > > # nft list flow table filter acct_out|tr ',' '\n'|grep lo
> > > > flow table acct_out {
> > > > "lo" . 94.101.208.217 . 212.97.129.25 expires 3m17s : counter
> > > > packets 1 bytes 40
> > > >
> > > > I don't know if these packets are actually sent out on the
> > > > correct
> > > > outbound interface thanks to the proper nexthop (the MAC=
> > > > information
> > > > in the Netfilter log is from the received packet and thus not
> > > > useful
> > > > here).
> > > >
> > > > I tried running a tcpdump on the lo interface to see if these
> > > > packets
> > > > would show up there, but during the three days I had it running,
> > > > it
> > > > only logged one such packet, while Netfilter logs 20+ outbound
> > > > packets
> > > > every day, and the one packet logged by tcpdump was *not* logged
> > > > by
> > > > Netfilter.
> > >
> > > Further testing of the individual parts of the first bad commit
> > > shows
> > > that the five first additions of the dst_dev_put() call doesn't
> > > trigger
> > > the problem, while the last one does (also without the first five),
> > > so
> > > the problematic part is:
> > >
> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > index 3dee004..d986d80 100644
> > > --- a/net/ipv4/route.c
> > > +++ b/net/ipv4/route.c
> > > @@ -1369,6 +1372,7 @@ static bool rt_cache_route(struct fib_nh *nh,
> > > struct rtable *rt)
> > > prev = cmpxchg(p, orig, rt);
> > > if (prev == orig) {
> > > if (orig) {
> > > + dst_dev_put(&orig->dst);
> > > dst_release(&orig->dst);
> > > rt_free(orig);
> > > }
> > >
> >
> > Hi Anders,
> >
> > First of all, the bad commit you pointed to is part of the effort to
> > remove dst garbage collector. And it is expected that certain routing
> > behavior will change after this whole patch set.
> > In your specific case:
> > Specifically for the one line change you pointed above,
> > rt_cache_route() is called to replace an existing cached route with a
> > new one because the existing cached route is already invalid. And the
> > criteria to determine if it is valid is in rt_cache_valid(). If a
> > route becomes invalid, it means the route is either being deleted, or
> > a new route is being inserted to the same table, or there is certain
> > netdev event happening that could make this route unusable.
>
> Hi Wei,
>
> I looked into this and activated debug logging for all route changes
> from the Quagga/zebra routing daemons, and none of the logged packets
> with OUT=lo happened within a minute of a route change for a prefix
> that covered the destination IP in the packet.
>
> Also, some of the logged packets with OUT=lo are for destination IPs
> that are on the subnet of a directly connected interface on the router,
> so there are no dynamic route changes and there haven't been any link
> changes either.
>
> Which other "certain netdev event" could cause this?
>
> > In all
> > cases, a route re-lookup is required. dst_dev_put() here does cleanup
> > work to free up the device refcount and all packets using this route
> > will be dropped. And as you said, you don't really see the packets
> > going out of lo.
>
> I wrote that we don't see the same packets being logged *and* show up
> in a "tcpdump -i lo", but we do see packets in the tcpdump that should
> have been routed out a regular interface - it's just not the same
> packets that netfilter is logging (and it happens less frequently). It
> looks like it depends on how far the packet is in the path through the
> kernel, when dst_dev_put() is called.
I spent some more time testing this in order to understand the scenario
that causes this. It turns out that the packets being dropped due to
OUT=lo only happen for the __mkroute_input() -> rt_set_nexthop() ->
rt_cache_route() -> dst_dev_put() call chain, when rt_cache_valid()
returns false due to rt_is_expired(), which compares rth->rt_genid
with rt_genid_ipv4(dev_net(rth->dst.dev)).
The latter is bumped every time *any* fib route is changed (and in
other cases), which thus invalidates everything in the route cache. On
a router with full BGP feeds that happens quite frequently, and I
believe the scenario we're seeing is:
- Packet A received for destination X is assigned rth R
- Some route in the FIB is added, deleted, or changed, so rt_genid is
bumped
- Packet B received for destination X before packet A has been
forwarded
- __mkroute_input() -> rt_cache_valid() -> rt_is_expired() determines
that rth R is no longer valid
- A new rth S is created and rth R is invalidated via __mkroute_input()
-> rt_set_nexthop() -> rt_cache_route() -> dst_dev_put()
- Packet A now has rth R with dst->dev = dst->output = dst_discard_out
and dev_net(dst->dev)->loopback_dev as it moves through the kernel
forwarding (and Netfilter) logic, so it gets dropped
To avoid this I developed the patch below, which updates rth->rt_genid
to the current value of rt_genid_ipv4(dev_net(rth->dst.dev)), if the
information in it is the same as what would be put into a new rth. This
avoids invalidating the old rth and allocating a new one with the same
information, and packet A from the scenario above is no longer dropped.
The patch has a lot of net_warn_ratelimited() for debugging, but none
of them have been triggered in my testing. I also revived and abused
the in_hit counter to validate that the code path is being hit. After a
few days with the patch there haven't been any more packet drops, and
we have the following statistics:
# rtstat -k entries,in_hit,in_slow_tot,in_slow_mc,in_no_route
rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
entries| in_hit|in_slow_|in_slow_|in_no_ro|
| | tot| mc| ute|
1015| 2305460| 1177662| 758764| 1080447|
Would a cleaned up (i.e. without the warnings and counter) version of
this patch be appropriate for submission, or are there cases where this
would break things?
--- linux-4.14.18/include/net/route.h.orig 2017-11-12 19:46:13.000000000 +0100
+++ linux-4.14.18/include/net/route.h 2018-02-11 11:20:34.133788420 +0100
@@ -96,6 +96,7 @@
};
struct rt_cache_stat {
+ unsigned int in_hit;
unsigned int in_slow_tot;
unsigned int in_slow_mc;
unsigned int in_no_route;
--- linux-4.14.18/net/ipv4/route.c.orig 2018-02-11 11:12:41.801093116 +0100
+++ linux-4.14.18/net/ipv4/route.c 2018-02-14 07:27:54.290124737 +0100
@@ -295,7 +295,7 @@
seq_printf(seq,"%08x %08x %08x %08x %08x %08x %08x %08x "
" %08x %08x %08x %08x %08x %08x %08x %08x %08x \n",
dst_entries_get_slow(&ipv4_dst_ops),
- 0, /* st->in_hit */
+ st->in_hit, /* 0, / * st->in_hit */
st->in_slow_tot,
st->in_slow_mc,
st->in_no_route,
@@ -1668,6 +1668,7 @@
{
struct fib_nh_exception *fnhe;
struct rtable *rth;
+ struct fib_nh *nh;
int err;
struct in_device *out_dev;
bool do_cache;
@@ -1713,23 +1714,110 @@
fnhe = find_exception(&FIB_RES_NH(*res), daddr);
if (do_cache) {
+ nh = &FIB_RES_NH(*res);
if (fnhe) {
rth = rcu_dereference(fnhe->fnhe_rth_input);
if (rth && rth->dst.expires &&
time_after(jiffies, rth->dst.expires)) {
- ip_del_fnhe(&FIB_RES_NH(*res), daddr);
+ ip_del_fnhe(nh, daddr);
fnhe = NULL;
} else {
goto rt_cache;
}
}
- rth = rcu_dereference(FIB_RES_NH(*res).nh_rth_input);
+ rth = rcu_dereference(nh->nh_rth_input);
rt_cache:
if (rt_cache_valid(rth)) {
skb_dst_set_noref(skb, &rth->dst);
goto out;
+ } else if (rth &&
+ rth->dst.obsolete == DST_OBSOLETE_FORCE_CHK &&
+ rth->dst.error == 0 &&
+ rth->dst.dev == out_dev->dev &&
+ rth->dst.ops == &ipv4_dst_ops &&
+#ifdef CONFIG_XFRM
+ rth->dst.xfrm == NULL &&
+#endif
+ rth->dst.flags ==
+ ((IN_DEV_CONF_GET(in_dev, NOPOLICY) ?
+ DST_NOPOLICY : 0) |
+ (IN_DEV_CONF_GET(out_dev, NOXFRM) ?
+ DST_NOXFRM : 0)) &&
+ rth->dst.lwtstate == nh->nh_lwtstate &&
+ rth->rt_type == res->type &&
+ rth->rt_table_id ==
+ (res->table ? res->table->tb_id : 0) &&
+ rth->rt_gateway ==
+ ((nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ?
+ nh->nh_gw : 0) &&
+ rth->rt_uses_gateway ==
+ ((nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ?
+ 1 : 0)) {
+ rth->rt_genid = rt_genid_ipv4(dev_net(rth->dst.dev));
+ skb_dst_set_noref(skb, &rth->dst);
+ RT_CACHE_STAT_INC(in_hit);
+ if (rth->dst.header_len != 0)
+ net_warn_ratelimited("rth->dst.header_len (%x) != 0\n", rth->dst.header_len);
+ if (rth->dst.trailer_len != 0)
+ net_warn_ratelimited("rth->dst.trailer_len (%x) != 0\n", rth->dst.trailer_len);
+ if (rth->rt_is_input != 1)
+ net_warn_ratelimited("rth->rt_is_input != 1\n");
+ if (rth->rt_flags != 0)
+ net_warn_ratelimited("rth->rt_flags (%x) != 0\n", rth->rt_flags);
+ if (rth->rt_iif != 0)
+ net_warn_ratelimited("rth->rt_iif (%x) != 0\n", rth->rt_iif);
+ if (rth->rt_pmtu != 0)
+ net_warn_ratelimited("rth->rt_pmtu (%x) != 0\n", rth->rt_pmtu);
+ if (rth->dst.output != ip_output)
+ net_warn_ratelimited("rth->dst.output != ip_output\n");
+ if (rth->dst.input != ip_forward)
+ net_warn_ratelimited("rth->dst.input != ip_forward\n");
+ goto out;
+ } else if (rth &&
+ rth->dst.obsolete == DST_OBSOLETE_FORCE_CHK) {
+ if (rth->dst.error != 0)
+ net_warn_ratelimited("rth->dst.error (%x) != 0\n", rth->dst.error);
+ if (rth->dst.dev != out_dev->dev)
+ net_warn_ratelimited("mkroute_input: rth->dst.dev != out_dev->dev\n");
+ if (rth->dst.ops != &ipv4_dst_ops)
+ net_warn_ratelimited("mkroute_input: rth->dst.ops != &ipv4_dst_ops\n");
+#ifdef CONFIG_XFRM
+ if (rth->dst.xfrm != NULL)
+ net_warn_ratelimited("mkroute_input: rth->dst.xfrm != NULL\n");
+#endif
+ if (rth->dst.flags !=
+ ((IN_DEV_CONF_GET(in_dev, NOPOLICY) ?
+ DST_NOPOLICY : 0) |
+ (IN_DEV_CONF_GET(out_dev, NOXFRM) ?
+ DST_NOXFRM : 0)))
+ net_warn_ratelimited("rth->dst.flags (%x) != %x\n",
+ rth->dst.flags,
+ (IN_DEV_CONF_GET(in_dev, NOPOLICY) ?
+ DST_NOPOLICY : 0) |
+ (IN_DEV_CONF_GET(out_dev, NOXFRM) ?
+ DST_NOXFRM : 0));
+ if (rth->rt_type != res->type)
+ net_warn_ratelimited("rth->rt_type (%x) != res->type (%x)\n",
+ rth->rt_type, res->type);
+ if (rth->rt_table_id !=
+ (res->table ? res->table->tb_id : 0))
+ net_warn_ratelimited("rth->rt_table_id (%x) != (%x)\n",
+ rth->rt_table_id, res->table ? res->table->tb_id : 0);
+ if (rth->rt_gateway !=
+ ((nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ?
+ nh->nh_gw : 0))
+ net_warn_ratelimited("rth->rt_gateway (%x) != (%x)\n",
+ rth->rt_gateway, (nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ? nh->nh_gw : 0);
+ if (rth->rt_uses_gateway !=
+ ((nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ?
+ 1 : 0))
+ net_warn_ratelimited("rth->rt_uses_gateway (%x) != (%x)\n",
+ rth->rt_uses_gateway, (nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ? 1 : 0);
+ if (rth->dst.lwtstate != nh->nh_lwtstate)
+ net_warn_ratelimited("rth->dst.lwtstate (%x) != nh->nh_lwtstate (%x)\n",
+ rth->dst.lwtstate, nh->nh_lwtstate);
}
}
Regards,
Anders K. Pedersen
Powered by blists - more mailing lists