[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAE4R7bB22KZjRFW1N70moF0Zm3biW8YgKzJeFWU+noT8BeSASA@mail.gmail.com>
Date: Tue, 23 Jun 2015 09:30:15 -0700
From: Scott Feldman <sfeldma@...il.com>
To: Andy Gospodarek <gospo@...ulusnetworks.com>
Cc: Netdev <netdev@...r.kernel.org>,
"David S. Miller" <davem@...emloft.net>, ddutt@...ulusnetworks.com,
Alexander Duyck <alexander.duyck@...il.com>,
Hannes Frederic Sowa <hannes@...essinduktion.org>,
"stephen@...workplumber.org" <stephen@...workplumber.org>
Subject: Re: [PATCH net-next 2/2 v6] net: ipv4 sysctl option to ignore routes
when nexthop link is down
On Tue, Jun 23, 2015 at 8:51 AM, Andy Gospodarek
<gospo@...ulusnetworks.com> wrote:
> This feature is only enabled with the new per-interface or ipv4 global
> sysctls called 'ignore_routes_with_linkdown'.
[cut]
checkpatch.pl says:
WARNING: suspect code indent for conditional statements (16, 20)
#293: FILE: net/ipv4/fib_semantics.c:1047:
+ if (fi->fib_nh->nh_flags & RTNH_F_LINKDOWN) {
+ in_dev = __in_dev_get_rcu(fi->fib_nh->nh_dev);
On Tue, Jun 23, 2015 at 8:51 AM, Andy Gospodarek
<gospo@...ulusnetworks.com> wrote:
> This feature is only enabled with the new per-interface or ipv4 global
> sysctls called 'ignore_routes_with_linkdown'.
>
> net.ipv4.conf.all.ignore_routes_with_linkdown = 0
> net.ipv4.conf.default.ignore_routes_with_linkdown = 0
> net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
> ...
>
> When the above sysctls are set, will report to userspace that a route is
> dead and will no longer resolve to this nexthop when performing a fib
> lookup. This will signal to userspace that the route will not be
> selected. The signalling of a RTNH_F_DEAD is only passed to userspace
> if the sysctl is enabled and link is down. This was done as without it
> the netlink listeners would have no idea whether or not a nexthop would
> be selected. The kernel only sets RTNH_F_DEAD internally if the
> interface has IFF_UP cleared.
>
> With the new sysctl set, the following behavior can be observed
> (interface p8p1 is link-down):
>
> default via 10.0.5.2 dev p9p1
> 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15
> 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1
> 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 dead linkdown
> 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 dead linkdown
> 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2
> 90.0.0.1 via 70.0.0.2 dev p7p1 src 70.0.0.1
> cache
> local 80.0.0.1 dev lo src 80.0.0.1
> cache <local>
> 80.0.0.2 via 10.0.5.2 dev p9p1 src 10.0.5.15
> cache
>
> While the route does remain in the table (so it can be modified if
> needed rather than being wiped away as it would be if IFF_UP was
> cleared), the proper next-hop is chosen automatically when the link is
> down. Now interface p8p1 is linked-up:
>
> default via 10.0.5.2 dev p9p1
> 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15
> 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1
> 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1
> 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1
> 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2
> 192.168.56.0/24 dev p2p1 proto kernel scope link src 192.168.56.2
> 90.0.0.1 via 80.0.0.2 dev p8p1 src 80.0.0.1
> cache
> local 80.0.0.1 dev lo src 80.0.0.1
> cache <local>
> 80.0.0.2 dev p8p1 src 80.0.0.1
> cache
>
> and the output changes to what one would expect.
>
> If the sysctl is not set, the following output would be expected when
> p8p1 is down:
>
> default via 10.0.5.2 dev p9p1
> 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15
> 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1
> 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 linkdown
> 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 linkdown
> 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2
>
> Since the dead flag does not appear, there should be no expectation that
> the kernel would skip using this route due to link being down.
>
> v2: Split kernel changes into 2 patches, this actually makes a
> behavioral change if the sysctl is set. Also took suggestion from Alex
> to simplify code by only checking sysctl during fib lookup and
> suggestion from Scott to add a per-interface sysctl.
>
> v3: Code clean-ups to make it more readable and efficient as well as a
> reverse path check fix.
>
> v4: Drop binary sysctl
>
> v5: Whitespace fixups from Dave
>
> v6: Style changes from Dave and checkpatch suggestions
>
> Signed-off-by: Andy Gospodarek <gospo@...ulusnetworks.com>
> Signed-off-by: Dinesh Dutt <ddutt@...ulusnetworks.com>
> Acked-by: Scott Feldman <sfeldma@...il.com>
> ---
> include/linux/inetdevice.h | 3 +++
> include/net/fib_rules.h | 3 ++-
> include/net/ip_fib.h | 16 +++++++++-------
> include/uapi/linux/ip.h | 1 +
> net/ipv4/devinet.c | 2 ++
> net/ipv4/fib_frontend.c | 6 +++---
> net/ipv4/fib_rules.c | 5 +++--
> net/ipv4/fib_semantics.c | 32 +++++++++++++++++++++++++++-----
> net/ipv4/fib_trie.c | 6 ++++++
> net/ipv4/netfilter/ipt_rpfilter.c | 2 +-
> net/ipv4/route.c | 10 +++++-----
> 11 files changed, 62 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
> index 0a21fbe..a4328ce 100644
> --- a/include/linux/inetdevice.h
> +++ b/include/linux/inetdevice.h
> @@ -120,6 +120,9 @@ static inline void ipv4_devconf_setall(struct in_device *in_dev)
> || (!IN_DEV_FORWARD(in_dev) && \
> IN_DEV_ORCONF((in_dev), ACCEPT_REDIRECTS)))
>
> +#define IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) \
> + IN_DEV_CONF_GET((in_dev), IGNORE_ROUTES_WITH_LINKDOWN)
> +
> #define IN_DEV_ARPFILTER(in_dev) IN_DEV_ORCONF((in_dev), ARPFILTER)
> #define IN_DEV_ARP_ACCEPT(in_dev) IN_DEV_ORCONF((in_dev), ARP_ACCEPT)
> #define IN_DEV_ARP_ANNOUNCE(in_dev) IN_DEV_MAXCONF((in_dev), ARP_ANNOUNCE)
> diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
> index 6d67383..903a55e 100644
> --- a/include/net/fib_rules.h
> +++ b/include/net/fib_rules.h
> @@ -36,7 +36,8 @@ struct fib_lookup_arg {
> void *result;
> struct fib_rule *rule;
> int flags;
> -#define FIB_LOOKUP_NOREF 1
> +#define FIB_LOOKUP_NOREF 1
> +#define FIB_LOOKUP_IGNORE_LINKSTATE 2
> };
>
> struct fib_rules_ops {
> diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> index f73d27c..49c142b 100644
> --- a/include/net/ip_fib.h
> +++ b/include/net/ip_fib.h
> @@ -226,7 +226,7 @@ static inline struct fib_table *fib_new_table(struct net *net, u32 id)
> }
>
> static inline int fib_lookup(struct net *net, const struct flowi4 *flp,
> - struct fib_result *res)
> + struct fib_result *res, unsigned int flags)
> {
> struct fib_table *tb;
> int err = -ENETUNREACH;
> @@ -234,7 +234,7 @@ static inline int fib_lookup(struct net *net, const struct flowi4 *flp,
> rcu_read_lock();
>
> tb = fib_get_table(net, RT_TABLE_MAIN);
> - if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF))
> + if (tb && !fib_table_lookup(tb, flp, res, flags | FIB_LOOKUP_NOREF))
> err = 0;
>
> rcu_read_unlock();
> @@ -249,16 +249,18 @@ void __net_exit fib4_rules_exit(struct net *net);
> struct fib_table *fib_new_table(struct net *net, u32 id);
> struct fib_table *fib_get_table(struct net *net, u32 id);
>
> -int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res);
> +int __fib_lookup(struct net *net, struct flowi4 *flp,
> + struct fib_result *res, unsigned int flags);
>
> static inline int fib_lookup(struct net *net, struct flowi4 *flp,
> - struct fib_result *res)
> + struct fib_result *res, unsigned int flags)
> {
> struct fib_table *tb;
> int err;
>
> + flags |= FIB_LOOKUP_NOREF;
> if (net->ipv4.fib_has_custom_rules)
> - return __fib_lookup(net, flp, res);
> + return __fib_lookup(net, flp, res, flags);
>
> rcu_read_lock();
>
> @@ -266,11 +268,11 @@ static inline int fib_lookup(struct net *net, struct flowi4 *flp,
>
> for (err = 0; !err; err = -ENETUNREACH) {
> tb = rcu_dereference_rtnl(net->ipv4.fib_main);
> - if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF))
> + if (tb && !fib_table_lookup(tb, flp, res, flags))
> break;
>
> tb = rcu_dereference_rtnl(net->ipv4.fib_default);
> - if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF))
> + if (tb && !fib_table_lookup(tb, flp, res, flags))
> break;
> }
>
> diff --git a/include/uapi/linux/ip.h b/include/uapi/linux/ip.h
> index 4119594..08f894d 100644
> --- a/include/uapi/linux/ip.h
> +++ b/include/uapi/linux/ip.h
> @@ -164,6 +164,7 @@ enum
> IPV4_DEVCONF_ROUTE_LOCALNET,
> IPV4_DEVCONF_IGMPV2_UNSOLICITED_REPORT_INTERVAL,
> IPV4_DEVCONF_IGMPV3_UNSOLICITED_REPORT_INTERVAL,
> + IPV4_DEVCONF_IGNORE_ROUTES_WITH_LINKDOWN,
> __IPV4_DEVCONF_MAX
> };
>
> diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
> index 419d23c..7498716 100644
> --- a/net/ipv4/devinet.c
> +++ b/net/ipv4/devinet.c
> @@ -2169,6 +2169,8 @@ static struct devinet_sysctl_table {
> "igmpv2_unsolicited_report_interval"),
> DEVINET_SYSCTL_RW_ENTRY(IGMPV3_UNSOLICITED_REPORT_INTERVAL,
> "igmpv3_unsolicited_report_interval"),
> + DEVINET_SYSCTL_RW_ENTRY(IGNORE_ROUTES_WITH_LINKDOWN,
> + "ignore_routes_with_linkdown"),
>
> DEVINET_SYSCTL_FLUSHING_ENTRY(NOXFRM, "disable_xfrm"),
> DEVINET_SYSCTL_FLUSHING_ENTRY(NOPOLICY, "disable_policy"),
> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> index 534eb14..6bbc549 100644
> --- a/net/ipv4/fib_frontend.c
> +++ b/net/ipv4/fib_frontend.c
> @@ -280,7 +280,7 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
> fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
> fl4.flowi4_scope = scope;
> fl4.flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0;
> - if (!fib_lookup(net, &fl4, &res))
> + if (!fib_lookup(net, &fl4, &res, 0))
> return FIB_RES_PREFSRC(net, res);
> } else {
> scope = RT_SCOPE_LINK;
> @@ -319,7 +319,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
> fl4.flowi4_mark = IN_DEV_SRC_VMARK(idev) ? skb->mark : 0;
>
> net = dev_net(dev);
> - if (fib_lookup(net, &fl4, &res))
> + if (fib_lookup(net, &fl4, &res, 0))
> goto last_resort;
> if (res.type != RTN_UNICAST &&
> (res.type != RTN_LOCAL || !IN_DEV_ACCEPT_LOCAL(idev)))
> @@ -354,7 +354,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
> fl4.flowi4_oif = dev->ifindex;
>
> ret = 0;
> - if (fib_lookup(net, &fl4, &res) == 0) {
> + if (fib_lookup(net, &fl4, &res, FIB_LOOKUP_IGNORE_LINKSTATE) == 0) {
> if (res.type == RTN_UNICAST)
> ret = FIB_RES_NH(res).nh_scope >= RT_SCOPE_HOST;
> }
> diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
> index 5615198..18123d5 100644
> --- a/net/ipv4/fib_rules.c
> +++ b/net/ipv4/fib_rules.c
> @@ -47,11 +47,12 @@ struct fib4_rule {
> #endif
> };
>
> -int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res)
> +int __fib_lookup(struct net *net, struct flowi4 *flp,
> + struct fib_result *res, unsigned int flags)
> {
> struct fib_lookup_arg arg = {
> .result = res,
> - .flags = FIB_LOOKUP_NOREF,
> + .flags = flags,
> };
> int err;
>
> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index b1b305b..24e7cef 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -623,7 +623,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
> /* It is not necessary, but requires a bit of thinking */
> if (fl4.flowi4_scope < RT_SCOPE_LINK)
> fl4.flowi4_scope = RT_SCOPE_LINK;
> - err = fib_lookup(net, &fl4, &res);
> + err = fib_lookup(net, &fl4, &res,
> + FIB_LOOKUP_IGNORE_LINKSTATE);
> if (err) {
> rcu_read_unlock();
> return err;
> @@ -1035,12 +1036,19 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
> nla_put_in_addr(skb, RTA_PREFSRC, fi->fib_prefsrc))
> goto nla_put_failure;
> if (fi->fib_nhs == 1) {
> + struct in_device *in_dev;
> +
> if (fi->fib_nh->nh_gw &&
> nla_put_in_addr(skb, RTA_GATEWAY, fi->fib_nh->nh_gw))
> goto nla_put_failure;
> if (fi->fib_nh->nh_oif &&
> nla_put_u32(skb, RTA_OIF, fi->fib_nh->nh_oif))
> goto nla_put_failure;
> + if (fi->fib_nh->nh_flags & RTNH_F_LINKDOWN) {
> + in_dev = __in_dev_get_rcu(fi->fib_nh->nh_dev);
> + if (in_dev && IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev))
> + rtm->rtm_flags |= RTNH_F_DEAD;
> + }
> #ifdef CONFIG_IP_ROUTE_CLASSID
> if (fi->fib_nh[0].nh_tclassid &&
> nla_put_u32(skb, RTA_FLOW, fi->fib_nh[0].nh_tclassid))
> @@ -1057,11 +1065,19 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
> goto nla_put_failure;
>
> for_nexthops(fi) {
> + struct in_device *in_dev;
> +
> rtnh = nla_reserve_nohdr(skb, sizeof(*rtnh));
> if (!rtnh)
> goto nla_put_failure;
>
> rtnh->rtnh_flags = nh->nh_flags & 0xFF;
> + if (nh->nh_flags & RTNH_F_LINKDOWN) {
> + in_dev = __in_dev_get_rcu(nh->nh_dev);
> + if (in_dev &&
> + IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev))
> + rtnh->rtnh_flags |= RTNH_F_DEAD;
> + }
> rtnh->rtnh_hops = nh->nh_weight - 1;
> rtnh->rtnh_ifindex = nh->nh_oif;
>
> @@ -1310,16 +1326,22 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
> void fib_select_multipath(struct fib_result *res)
> {
> struct fib_info *fi = res->fi;
> + struct in_device *in_dev;
> int w;
>
> spin_lock_bh(&fib_multipath_lock);
> if (fi->fib_power <= 0) {
> int power = 0;
> change_nexthops(fi) {
> - if (!(nexthop_nh->nh_flags & RTNH_F_DEAD)) {
> - power += nexthop_nh->nh_weight;
> - nexthop_nh->nh_power = nexthop_nh->nh_weight;
> - }
> + in_dev = __in_dev_get_rcu(nexthop_nh->nh_dev);
> + if (nexthop_nh->nh_flags & RTNH_F_DEAD)
> + continue;
> + if (in_dev &&
> + IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
> + nexthop_nh->nh_flags & RTNH_F_LINKDOWN)
> + continue;
> + power += nexthop_nh->nh_weight;
> + nexthop_nh->nh_power = nexthop_nh->nh_weight;
> } endfor_nexthops(fi);
> fi->fib_power = power;
> if (power <= 0) {
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 6c666a9..15d3261 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -1412,9 +1412,15 @@ found:
> continue;
> for (nhsel = 0; nhsel < fi->fib_nhs; nhsel++) {
> const struct fib_nh *nh = &fi->fib_nh[nhsel];
> + struct in_device *in_dev = __in_dev_get_rcu(nh->nh_dev);
>
> if (nh->nh_flags & RTNH_F_DEAD)
> continue;
> + if (in_dev &&
> + IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
> + nh->nh_flags & RTNH_F_LINKDOWN &&
> + !(fib_flags & FIB_LOOKUP_IGNORE_LINKSTATE))
> + continue;
> if (flp->flowi4_oif && flp->flowi4_oif != nh->nh_oif)
> continue;
>
> diff --git a/net/ipv4/netfilter/ipt_rpfilter.c b/net/ipv4/netfilter/ipt_rpfilter.c
> index 4bfaedf..8618fd1 100644
> --- a/net/ipv4/netfilter/ipt_rpfilter.c
> +++ b/net/ipv4/netfilter/ipt_rpfilter.c
> @@ -40,7 +40,7 @@ static bool rpfilter_lookup_reverse(struct flowi4 *fl4,
> struct net *net = dev_net(dev);
> int ret __maybe_unused;
>
> - if (fib_lookup(net, fl4, &res))
> + if (fib_lookup(net, fl4, &res, FIB_LOOKUP_IGNORE_LINKSTATE))
> return false;
>
> if (res.type != RTN_UNICAST) {
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index f605598..d0362a2 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -747,7 +747,7 @@ static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flow
> if (!(n->nud_state & NUD_VALID)) {
> neigh_event_send(n, NULL);
> } else {
> - if (fib_lookup(net, fl4, &res) == 0) {
> + if (fib_lookup(net, fl4, &res, 0) == 0) {
> struct fib_nh *nh = &FIB_RES_NH(res);
>
> update_or_create_fnhe(nh, fl4->daddr, new_gw,
> @@ -975,7 +975,7 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
> return;
>
> rcu_read_lock();
> - if (fib_lookup(dev_net(dst->dev), fl4, &res) == 0) {
> + if (fib_lookup(dev_net(dst->dev), fl4, &res, 0) == 0) {
> struct fib_nh *nh = &FIB_RES_NH(res);
>
> update_or_create_fnhe(nh, fl4->daddr, 0, mtu,
> @@ -1186,7 +1186,7 @@ void ip_rt_get_source(u8 *addr, struct sk_buff *skb, struct rtable *rt)
> fl4.flowi4_mark = skb->mark;
>
> rcu_read_lock();
> - if (fib_lookup(dev_net(rt->dst.dev), &fl4, &res) == 0)
> + if (fib_lookup(dev_net(rt->dst.dev), &fl4, &res, 0) == 0)
> src = FIB_RES_PREFSRC(dev_net(rt->dst.dev), res);
> else
> src = inet_select_addr(rt->dst.dev,
> @@ -1716,7 +1716,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
> fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
> fl4.daddr = daddr;
> fl4.saddr = saddr;
> - err = fib_lookup(net, &fl4, &res);
> + err = fib_lookup(net, &fl4, &res, 0);
> if (err != 0) {
> if (!IN_DEV_FORWARD(in_dev))
> err = -EHOSTUNREACH;
> @@ -2123,7 +2123,7 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
> goto make_route;
> }
>
> - if (fib_lookup(net, fl4, &res)) {
> + if (fib_lookup(net, fl4, &res, 0)) {
> res.fi = NULL;
> res.table = NULL;
> if (fl4->flowi4_oif) {
> --
> 1.9.3
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists