[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHsH6GtXWmOUGycPU4xJoSVDEGXPb+ziKb88YJvZXchsAm2fWA@mail.gmail.com>
Date: Tue, 11 Jun 2024 10:07:52 -0700
From: Eyal Birger <eyal.birger@...il.com>
To: Antony Antony <antony@...nome.org>
Cc: davem@...emloft.net, dsahern@...nel.org, edumazet@...gle.com,
kuba@...nel.org, pabeni@...hat.com, steffen.klassert@...unet.com,
herbert@...dor.apana.org.au, pablo@...filter.org, paul.wouters@...en.io,
nharold@...gle.com, mcr@...delman.ca, devel@...ux-ipsec.org,
netdev@...r.kernel.org
Subject: Re: [PATCH ipsec-next,v4] xfrm: support sending NAT keepalives in ESP
in UDP states
Hi Antony,
On Mon, Jun 10, 2024 at 10:45 PM Antony Antony <antony@...nome.org> wrote:
>
> Hi Eyal,
>
> On Mon, May 27, 2024 at 08:29:14PM -0700, Eyal Birger wrote:
> > Add the ability to send out RFC-3948 NAT keepalives from the xfrm stack.
> >
> > To use, Userspace sets an XFRM_NAT_KEEPALIVE_INTERVAL integer property when
> > creating XFRM outbound states which denotes the number of seconds between
> > keepalive messages.
> >
> > Keepalive messages are sent from a per net delayed work which iterates over
> > the xfrm states. The logic is guarded by the xfrm state spinlock due to the
> > xfrm state walk iterator.
> >
> > Possible future enhancements:
> >
> > - Adding counters to keep track of sent keepalives.
> > - deduplicate NAT keepalives between states sharing the same nat keepalive
> > parameters.
> > - provisioning hardware offloads for devices capable of implementing this.
> > - revise xfrm state list to use an rcu list in order to avoid running this
> > under spinlock.
> >
> > Suggested-by: Paul Wouters <paul.wouters@...en.io>
> > Signed-off-by: Eyal Birger <eyal.birger@...il.com>
> >
> > ---
> > v4: rebase and explicitly check that keepalive is only configured on
> > outbound SAs
> > v3: add missing ip6_checksum header
> > v2: change xfrm compat to include the new attribute
> > ---
> > include/net/ipv6_stubs.h | 3 +
> > include/net/netns/xfrm.h | 1 +
> > include/net/xfrm.h | 10 ++
> > include/uapi/linux/xfrm.h | 1 +
> > net/ipv6/af_inet6.c | 1 +
> > net/ipv6/xfrm6_policy.c | 7 +
> > net/xfrm/Makefile | 3 +-
> > net/xfrm/xfrm_compat.c | 6 +-
> > net/xfrm/xfrm_nat_keepalive.c | 292 ++++++++++++++++++++++++++++++++++
> > net/xfrm/xfrm_policy.c | 8 +
> > net/xfrm/xfrm_state.c | 17 ++
> > net/xfrm/xfrm_user.c | 15 ++
> > 12 files changed, 361 insertions(+), 3 deletions(-)
> > create mode 100644 net/xfrm/xfrm_nat_keepalive.c
> >
> > diff --git a/include/net/ipv6_stubs.h b/include/net/ipv6_stubs.h
> > index 485c39a89866..11cefd50704d 100644
> > --- a/include/net/ipv6_stubs.h
> > +++ b/include/net/ipv6_stubs.h
> > @@ -9,6 +9,7 @@
> > #include <net/flow.h>
> > #include <net/neighbour.h>
> > #include <net/sock.h>
> > +#include <net/ipv6.h>
> >
> > /* structs from net/ip6_fib.h */
> > struct fib6_info;
> > @@ -72,6 +73,8 @@ struct ipv6_stub {
> > int (*output)(struct net *, struct sock *, struct sk_buff *));
> > struct net_device *(*ipv6_dev_find)(struct net *net, const struct in6_addr *addr,
> > struct net_device *dev);
> > + int (*ip6_xmit)(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
> > + __u32 mark, struct ipv6_txoptions *opt, int tclass, u32 priority);
> > };
> > extern const struct ipv6_stub *ipv6_stub __read_mostly;
> >
> > diff --git a/include/net/netns/xfrm.h b/include/net/netns/xfrm.h
> > index 423b52eca908..d489d9250bff 100644
> > --- a/include/net/netns/xfrm.h
> > +++ b/include/net/netns/xfrm.h
> > @@ -83,6 +83,7 @@ struct netns_xfrm {
> >
> > spinlock_t xfrm_policy_lock;
> > struct mutex xfrm_cfg_mutex;
> > + struct delayed_work nat_keepalive_work;
> > };
> >
> > #endif
> > diff --git a/include/net/xfrm.h b/include/net/xfrm.h
> > index 7c9be06f8302..e208907b1a00 100644
> > --- a/include/net/xfrm.h
> > +++ b/include/net/xfrm.h
> > @@ -229,6 +229,10 @@ struct xfrm_state {
> > struct xfrm_encap_tmpl *encap;
> > struct sock __rcu *encap_sk;
> >
> > + /* NAT keepalive */
> > + u32 nat_keepalive_interval; /* seconds */
> > + time64_t nat_keepalive_expiration;
> > +
> > /* Data for care-of address */
> > xfrm_address_t *coaddr;
> >
> > @@ -2200,4 +2204,10 @@ static inline int register_xfrm_state_bpf(void)
> > }
> > #endif
> >
> > +int xfrm_nat_keepalive_init(unsigned short family);
> > +void xfrm_nat_keepalive_fini(unsigned short family);
> > +int xfrm_nat_keepalive_net_init(struct net *net);
> > +int xfrm_nat_keepalive_net_fini(struct net *net);
> > +void xfrm_nat_keepalive_state_updated(struct xfrm_state *x);
> > +
> > #endif /* _NET_XFRM_H */
> > diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h
> > index 18ceaba8486e..7744441c8d5f 100644
> > --- a/include/uapi/linux/xfrm.h
> > +++ b/include/uapi/linux/xfrm.h
> > @@ -321,6 +321,7 @@ enum xfrm_attr_type_t {
> > XFRMA_IF_ID, /* __u32 */
> > XFRMA_MTIMER_THRESH, /* __u32 in seconds for input SA */
> > XFRMA_SA_DIR, /* __u8 */
> > + XFRMA_NAT_KEEPALIVE_INTERVAL, /* __u32 in seconds for NAT keepalive */
> > __XFRMA_MAX
> >
> > #define XFRMA_OUTPUT_MARK XFRMA_SET_MARK /* Compatibility */
> > diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
> > index 8041dc181bd4..2b893858b9a9 100644
> > --- a/net/ipv6/af_inet6.c
> > +++ b/net/ipv6/af_inet6.c
> > @@ -1060,6 +1060,7 @@ static const struct ipv6_stub ipv6_stub_impl = {
> > .nd_tbl = &nd_tbl,
> > .ipv6_fragment = ip6_fragment,
> > .ipv6_dev_find = ipv6_dev_find,
> > + .ip6_xmit = ip6_xmit,
> > };
> >
> > static const struct ipv6_bpf_stub ipv6_bpf_stub_impl = {
> > diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
> > index 42fb6996b077..f03dbc011e65 100644
> > --- a/net/ipv6/xfrm6_policy.c
> > +++ b/net/ipv6/xfrm6_policy.c
> > @@ -285,8 +285,14 @@ int __init xfrm6_init(void)
> > ret = register_pernet_subsys(&xfrm6_net_ops);
> > if (ret)
> > goto out_protocol;
> > +
> > + ret = xfrm_nat_keepalive_init(AF_INET6);
> > + if (ret)
> > + goto out_nat_keepalive;
> > out:
> > return ret;
> > +out_nat_keepalive:
> > + unregister_pernet_subsys(&xfrm6_net_ops);
> > out_protocol:
> > xfrm6_protocol_fini();
> > out_state:
> > @@ -298,6 +304,7 @@ int __init xfrm6_init(void)
> >
> > void xfrm6_fini(void)
> > {
> > + xfrm_nat_keepalive_fini(AF_INET6);
> > unregister_pernet_subsys(&xfrm6_net_ops);
> > xfrm6_protocol_fini();
> > xfrm6_policy_fini();
> > diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile
> > index 547cec77ba03..512e0b2f8514 100644
> > --- a/net/xfrm/Makefile
> > +++ b/net/xfrm/Makefile
> > @@ -13,7 +13,8 @@ endif
> >
> > obj-$(CONFIG_XFRM) := xfrm_policy.o xfrm_state.o xfrm_hash.o \
> > xfrm_input.o xfrm_output.o \
> > - xfrm_sysctl.o xfrm_replay.o xfrm_device.o
> > + xfrm_sysctl.o xfrm_replay.o xfrm_device.o \
> > + xfrm_nat_keepalive.o
> > obj-$(CONFIG_XFRM_STATISTICS) += xfrm_proc.o
> > obj-$(CONFIG_XFRM_ALGO) += xfrm_algo.o
> > obj-$(CONFIG_XFRM_USER) += xfrm_user.o
> > diff --git a/net/xfrm/xfrm_compat.c b/net/xfrm/xfrm_compat.c
> > index 703d4172c7d7..91357ccaf4af 100644
> > --- a/net/xfrm/xfrm_compat.c
> > +++ b/net/xfrm/xfrm_compat.c
> > @@ -131,6 +131,7 @@ static const struct nla_policy compat_policy[XFRMA_MAX+1] = {
> > [XFRMA_IF_ID] = { .type = NLA_U32 },
> > [XFRMA_MTIMER_THRESH] = { .type = NLA_U32 },
> > [XFRMA_SA_DIR] = NLA_POLICY_RANGE(NLA_U8, XFRM_SA_DIR_IN, XFRM_SA_DIR_OUT),
> > + [XFRMA_NAT_KEEPALIVE_INTERVAL] = { .type = NLA_U32 },
> > };
> >
> > static struct nlmsghdr *xfrm_nlmsg_put_compat(struct sk_buff *skb,
> > @@ -280,9 +281,10 @@ static int xfrm_xlate64_attr(struct sk_buff *dst, const struct nlattr *src)
> > case XFRMA_IF_ID:
> > case XFRMA_MTIMER_THRESH:
> > case XFRMA_SA_DIR:
> > + case XFRMA_NAT_KEEPALIVE_INTERVAL:
> > return xfrm_nla_cpy(dst, src, nla_len(src));
> > default:
> > - BUILD_BUG_ON(XFRMA_MAX != XFRMA_SA_DIR);
> > + BUILD_BUG_ON(XFRMA_MAX != XFRMA_NAT_KEEPALIVE_INTERVAL);
> > pr_warn_once("unsupported nla_type %d\n", src->nla_type);
> > return -EOPNOTSUPP;
> > }
> > @@ -437,7 +439,7 @@ static int xfrm_xlate32_attr(void *dst, const struct nlattr *nla,
> > int err;
> >
> > if (type > XFRMA_MAX) {
> > - BUILD_BUG_ON(XFRMA_MAX != XFRMA_SA_DIR);
> > + BUILD_BUG_ON(XFRMA_MAX != XFRMA_NAT_KEEPALIVE_INTERVAL);
> > NL_SET_ERR_MSG(extack, "Bad attribute");
> > return -EOPNOTSUPP;
> > }
> > diff --git a/net/xfrm/xfrm_nat_keepalive.c b/net/xfrm/xfrm_nat_keepalive.c
> > new file mode 100644
> > index 000000000000..82f0a301683f
> > --- /dev/null
> > +++ b/net/xfrm/xfrm_nat_keepalive.c
> > @@ -0,0 +1,292 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * xfrm_nat_keepalive.c
> > + *
> > + * (c) 2024 Eyal Birger <eyal.birger@...il.com>
> > + */
> > +
> > +#include <net/inet_common.h>
> > +#include <net/ip6_checksum.h>
> > +#include <net/xfrm.h>
> > +
> > +static DEFINE_PER_CPU(struct sock *, nat_keepalive_sk_ipv4);
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +static DEFINE_PER_CPU(struct sock *, nat_keepalive_sk_ipv6);
> > +#endif
> > +
> > +struct nat_keepalive {
> > + struct net *net;
> > + u16 family;
> > + xfrm_address_t saddr;
> > + xfrm_address_t daddr;
> > + __be16 encap_sport;
> > + __be16 encap_dport;
> > + __u32 smark;
> > +};
> > +
> > +static void nat_keepalive_init(struct nat_keepalive *ka, struct xfrm_state *x)
> > +{
> > + ka->net = xs_net(x);
> > + ka->family = x->props.family;
> > + ka->saddr = x->props.saddr;
> > + ka->daddr = x->id.daddr;
> > + ka->encap_sport = x->encap->encap_sport;
> > + ka->encap_dport = x->encap->encap_dport;
> > + ka->smark = xfrm_smark_get(0, x);
> > +}
> > +
> > +static int nat_keepalive_send_ipv4(struct sk_buff *skb,
> > + struct nat_keepalive *ka)
> > +{
> > + struct net *net = ka->net;
> > + struct flowi4 fl4;
> > + struct rtable *rt;
> > + struct sock *sk;
> > + __u8 tos = 0;
> > + int err;
> > +
> > + flowi4_init_output(&fl4, 0 /* oif */, skb->mark, tos,
> > + RT_SCOPE_UNIVERSE, IPPROTO_UDP, 0,
> > + ka->daddr.a4, ka->saddr.a4, ka->encap_dport,
> > + ka->encap_sport, sock_net_uid(net, NULL));
> > +
> > + rt = ip_route_output_key(net, &fl4);
> > + if (IS_ERR(rt))
> > + return PTR_ERR(rt);
> > +
> > + skb_dst_set(skb, &rt->dst);
> > +
> > + sk = *this_cpu_ptr(&nat_keepalive_sk_ipv4);
> > + sock_net_set(sk, net);
> > + err = ip_build_and_send_pkt(skb, sk, fl4.saddr, fl4.daddr, NULL, tos);
> > + sock_net_set(sk, &init_net);
> > + return err;
> > +}
> > +
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +static int nat_keepalive_send_ipv6(struct sk_buff *skb,
> > + struct nat_keepalive *ka,
> > + struct udphdr *uh)
> > +{
> > + struct net *net = ka->net;
> > + struct dst_entry *dst;
> > + struct flowi6 fl6;
> > + struct sock *sk;
> > + __wsum csum;
> > + int err;
> > +
> > + csum = skb_checksum(skb, 0, skb->len, 0);
> > + uh->check = csum_ipv6_magic(&ka->saddr.in6, &ka->daddr.in6,
> > + skb->len, IPPROTO_UDP, csum);
> > + if (uh->check == 0)
> > + uh->check = CSUM_MANGLED_0;
> > +
> > + memset(&fl6, 0, sizeof(fl6));
> > + fl6.flowi6_mark = skb->mark;
> > + fl6.saddr = ka->saddr.in6;
> > + fl6.daddr = ka->daddr.in6;
> > + fl6.flowi6_proto = IPPROTO_UDP;
> > + fl6.fl6_sport = ka->encap_sport;
> > + fl6.fl6_dport = ka->encap_dport;
> > +
> > + sk = *this_cpu_ptr(&nat_keepalive_sk_ipv6);
> > + sock_net_set(sk, net);
> > + dst = ipv6_stub->ipv6_dst_lookup_flow(net, sk, &fl6, NULL);
> > + if (IS_ERR(dst))
> > + return PTR_ERR(dst);
> > +
> > + skb_dst_set(skb, dst);
> > + err = ipv6_stub->ip6_xmit(sk, skb, &fl6, skb->mark, NULL, 0, 0);
> > + sock_net_set(sk, &init_net);
> > + return err;
> > +}
> > +#endif
> > +
> > +static void nat_keepalive_send(struct nat_keepalive *ka)
> > +{
> > + const int nat_ka_hdrs_len = max(sizeof(struct iphdr),
> > + sizeof(struct ipv6hdr)) +
> > + sizeof(struct udphdr);
> > + const u8 nat_ka_payload = 0xFF;
> > + int err = -EAFNOSUPPORT;
> > + struct sk_buff *skb;
> > + struct udphdr *uh;
> > +
> > + skb = alloc_skb(nat_ka_hdrs_len + sizeof(nat_ka_payload), GFP_ATOMIC);
> > + if (unlikely(!skb))
> > + return;
> > +
> > + skb_reserve(skb, nat_ka_hdrs_len);
> > +
> > + skb_put_u8(skb, nat_ka_payload);
> > +
> > + uh = skb_push(skb, sizeof(*uh));
> > + uh->source = ka->encap_sport;
> > + uh->dest = ka->encap_dport;
> > + uh->len = htons(skb->len);
> > + uh->check = 0;
> > +
> > + skb->mark = ka->smark;
> > +
> > + switch (ka->family) {
> > + case AF_INET:
> > + err = nat_keepalive_send_ipv4(skb, ka);
> > + break;
> > +#if IS_ENABLED(CONFIG_IPV6)
> > + case AF_INET6:
> > + err = nat_keepalive_send_ipv6(skb, ka, uh);
> > + break;
> > +#endif
> > + }
> > + if (err)
> > + kfree_skb(skb);
> > +}
> > +
> > +struct nat_keepalive_work_ctx {
> > + time64_t next_run;
> > + time64_t now;
> > +};
> > +
> > +static int nat_keepalive_work_single(struct xfrm_state *x, int count, void *ptr)
> > +{
> > + struct nat_keepalive_work_ctx *ctx = ptr;
> > + bool send_keepalive = false;
> > + struct nat_keepalive ka;
> > + time64_t next_run;
> > + u32 interval;
> > + int delta;
> > +
> > + interval = x->nat_keepalive_interval;
> > + if (!interval)
> > + return 0;
> > +
> > + spin_lock(&x->lock);
> > +
> > + delta = (int)(ctx->now - x->lastused);
> > + if (delta < interval) {
> > + x->nat_keepalive_expiration = ctx->now + interval - delta;
> > + next_run = x->nat_keepalive_expiration;
> > + } else if (x->nat_keepalive_expiration > ctx->now) {
> > + next_run = x->nat_keepalive_expiration;
> > + } else {
> > + next_run = ctx->now + interval;
> > + nat_keepalive_init(&ka, x);
> > + send_keepalive = true;
> > + }
> > +
> > + spin_unlock(&x->lock);
> > +
> > + if (send_keepalive)
> > + nat_keepalive_send(&ka);
> > +
> > + if (!ctx->next_run || next_run < ctx->next_run)
> > + ctx->next_run = next_run;
> > + return 0;
> > +}
> > +
> > +static void nat_keepalive_work(struct work_struct *work)
> > +{
> > + struct nat_keepalive_work_ctx ctx;
> > + struct xfrm_state_walk walk;
> > + struct net *net;
> > +
> > + ctx.next_run = 0;
> > + ctx.now = ktime_get_real_seconds();
> > +
> > + net = container_of(work, struct net, xfrm.nat_keepalive_work.work);
> > + xfrm_state_walk_init(&walk, IPPROTO_ESP, NULL);
> > + xfrm_state_walk(net, &walk, nat_keepalive_work_single, &ctx);
> > + xfrm_state_walk_done(&walk, net);
> > + if (ctx.next_run)
> > + schedule_delayed_work(&net->xfrm.nat_keepalive_work,
> > + (ctx.next_run - ctx.now) * HZ);
> > +}
> > +
> > +static int nat_keepalive_sk_init(struct sock * __percpu *socks,
> > + unsigned short family)
> > +{
> > + struct sock *sk;
> > + int err, i;
> > +
> > + for_each_possible_cpu(i) {
> > + err = inet_ctl_sock_create(&sk, family, SOCK_RAW, IPPROTO_UDP,
> > + &init_net);
> > + if (err < 0)
> > + goto err;
> > +
> > + *per_cpu_ptr(socks, i) = sk;
> > + }
> > +
> > + return 0;
> > +err:
> > + for_each_possible_cpu(i)
> > + inet_ctl_sock_destroy(*per_cpu_ptr(socks, i));
> > + return err;
> > +}
> > +
> > +static void nat_keepalive_sk_fini(struct sock * __percpu *socks)
> > +{
> > + int i;
> > +
> > + for_each_possible_cpu(i)
> > + inet_ctl_sock_destroy(*per_cpu_ptr(socks, i));
> > +}
> > +
> > +void xfrm_nat_keepalive_state_updated(struct xfrm_state *x)
> > +{
> > + struct net *net;
> > +
> > + if (!x->nat_keepalive_interval)
> > + return;
> > +
> > + net = xs_net(x);
> > + schedule_delayed_work(&net->xfrm.nat_keepalive_work, 0);
> > +}
> > +
> > +int __net_init xfrm_nat_keepalive_net_init(struct net *net)
> > +{
> > + INIT_DELAYED_WORK(&net->xfrm.nat_keepalive_work, nat_keepalive_work);
> > + return 0;
> > +}
> > +
> > +int xfrm_nat_keepalive_net_fini(struct net *net)
> > +{
> > + cancel_delayed_work_sync(&net->xfrm.nat_keepalive_work);
> > + return 0;
> > +}
> > +
> > +int xfrm_nat_keepalive_init(unsigned short family)
> > +{
> > + int err = -EAFNOSUPPORT;
> > +
> > + switch (family) {
> > + case AF_INET:
> > + err = nat_keepalive_sk_init(&nat_keepalive_sk_ipv4, PF_INET);
> > + break;
> > +#if IS_ENABLED(CONFIG_IPV6)
> > + case AF_INET6:
> > + err = nat_keepalive_sk_init(&nat_keepalive_sk_ipv6, PF_INET6);
> > + break;
> > +#endif
> > + }
> > +
> > + if (err)
> > + pr_err("xfrm nat keepalive init: failed to init err:%d\n", err);
> > + return err;
> > +}
> > +EXPORT_SYMBOL_GPL(xfrm_nat_keepalive_init);
> > +
> > +void xfrm_nat_keepalive_fini(unsigned short family)
> > +{
> > + switch (family) {
> > + case AF_INET:
> > + nat_keepalive_sk_fini(&nat_keepalive_sk_ipv4);
> > + break;
> > +#if IS_ENABLED(CONFIG_IPV6)
> > + case AF_INET6:
> > + nat_keepalive_sk_fini(&nat_keepalive_sk_ipv6);
> > + break;
> > +#endif
> > + }
> > +}
> > +EXPORT_SYMBOL_GPL(xfrm_nat_keepalive_fini);
> > diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> > index 298b3a9eb48d..580c27ac7778 100644
> > --- a/net/xfrm/xfrm_policy.c
> > +++ b/net/xfrm/xfrm_policy.c
> > @@ -4288,8 +4288,14 @@ static int __net_init xfrm_net_init(struct net *net)
> > if (rv < 0)
> > goto out_sysctl;
> >
> > + rv = xfrm_nat_keepalive_net_init(net);
> > + if (rv < 0)
> > + goto out_nat_keepalive;
> > +
> > return 0;
> >
> > +out_nat_keepalive:
> > + xfrm_sysctl_fini(net);
> > out_sysctl:
> > xfrm_policy_fini(net);
> > out_policy:
> > @@ -4302,6 +4308,7 @@ static int __net_init xfrm_net_init(struct net *net)
> >
> > static void __net_exit xfrm_net_exit(struct net *net)
> > {
> > + xfrm_nat_keepalive_net_fini(net);
> > xfrm_sysctl_fini(net);
> > xfrm_policy_fini(net);
> > xfrm_state_fini(net);
> > @@ -4363,6 +4370,7 @@ void __init xfrm_init(void)
> > #endif
> >
> > register_xfrm_state_bpf();
> > + xfrm_nat_keepalive_init(AF_INET);
> > }
> >
> > #ifdef CONFIG_AUDITSYSCALL
> > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> > index 649bb739df0d..abadc857cd45 100644
> > --- a/net/xfrm/xfrm_state.c
> > +++ b/net/xfrm/xfrm_state.c
> > @@ -715,6 +715,7 @@ int __xfrm_state_delete(struct xfrm_state *x)
> > if (x->id.spi)
> > hlist_del_rcu(&x->byspi);
> > net->xfrm.state_num--;
> > + xfrm_nat_keepalive_state_updated(x);
> > spin_unlock(&net->xfrm.xfrm_state_lock);
> >
> > if (x->encap_sk)
> > @@ -1453,6 +1454,7 @@ static void __xfrm_state_insert(struct xfrm_state *x)
> > net->xfrm.state_num++;
> >
> > xfrm_hash_grow_check(net, x->bydst.next != NULL);
> > + xfrm_nat_keepalive_state_updated(x);
> > }
> >
> > /* net->xfrm.xfrm_state_lock is held */
> > @@ -2871,6 +2873,21 @@ int __xfrm_init_state(struct xfrm_state *x, bool init_replay, bool offload,
> > goto error;
> > }
> >
> > + if (x->nat_keepalive_interval) {
> > + if (x->dir != XFRM_SA_DIR_OUT) {
> > + NL_SET_ERR_MSG(extack, "NAT keepalive is only supported for outbound SAs");
> > + err = -EINVAL;
> > + goto error;
> > + }
> > +
> > + if (!x->encap || x->encap->encap_type != UDP_ENCAP_ESPINUDP) {
> > + NL_SET_ERR_MSG(extack,
> > + "NAT keepalive is only supported for UDP encapsulation");
> > + err = -EINVAL;
> > + goto error;
> > + }
> > + }
> > +
> > error:
> > return err;
> > }
> > diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
> > index e83c687bd64e..a552cfa623ea 100644
> > --- a/net/xfrm/xfrm_user.c
> > +++ b/net/xfrm/xfrm_user.c
> > @@ -833,6 +833,10 @@ static struct xfrm_state *xfrm_state_construct(struct net *net,
> > if (attrs[XFRMA_SA_DIR])
> > x->dir = nla_get_u8(attrs[XFRMA_SA_DIR]);
> >
> > + if (attrs[XFRMA_NAT_KEEPALIVE_INTERVAL])
> > + x->nat_keepalive_interval =
> > + nla_get_u32(attrs[XFRMA_NAT_KEEPALIVE_INTERVAL]);
> > +
> > err = __xfrm_init_state(x, false, attrs[XFRMA_OFFLOAD_DEV], extack);
> > if (err)
> > goto error;
> > @@ -1288,6 +1292,13 @@ static int copy_to_user_state_extra(struct xfrm_state *x,
> > }
> > if (x->dir)
> > ret = nla_put_u8(skb, XFRMA_SA_DIR, x->dir);
> > +
> > + if (x->nat_keepalive_interval) {
> > + ret = nla_put_u32(skb, XFRMA_NAT_KEEPALIVE_INTERVAL,
> > + x->nat_keepalive_interval);
> > + if (ret)
> > + goto out;
> > + }
> > out:
> > return ret;
> > }
> > @@ -3165,6 +3176,7 @@ const struct nla_policy xfrma_policy[XFRMA_MAX+1] = {
> > [XFRMA_IF_ID] = { .type = NLA_U32 },
> > [XFRMA_MTIMER_THRESH] = { .type = NLA_U32 },
> > [XFRMA_SA_DIR] = NLA_POLICY_RANGE(NLA_U8, XFRM_SA_DIR_IN, XFRM_SA_DIR_OUT),
> > + [XFRMA_NAT_KEEPALIVE_INTERVAL] = { .type = NLA_U32 },
>
> What would happen if the value is set to 0? Should it be allowed? I don't
> see any use for it. If not, should we make the range from 1 to UINT32_MAX?
> This way, we can avoid any potential issues with the value 0, ensuring
> proper functionality.
The xfrm_state only has x->nat_keepalive_interval so if you set the value to
'0' it would be the same as not setting it.
>
> I also wonder about limiting the maxium value to x->lft.hard_add_expires_seconds.
> Adding XFRMA_NAT_KEEPALIVE_INTERVAL > x->lft.hard_add_expires_seconds does
> not make sesne to me. Just another sanity check.
I don't understand the relation between the NAT-T keepalive and
"hard_add_expires_seconds". That number seems to be derived from the acquire
expiration sysctl whereas NAT-T is related to the network between hosts and
firewall timeouts there. As such, i'm not sure what the appropriate upper bound
for this configuration would be or whether it's needed.
>
> > };
> > EXPORT_SYMBOL_GPL(xfrma_policy);
> >
> > @@ -3474,6 +3486,9 @@ static inline unsigned int xfrm_sa_len(struct xfrm_state *x)
> > if (x->dir)
> > l += nla_total_size(sizeof(x->dir));
> >
> > + if (x->nat_keepalive_interval)
> > + l += nla_total_size(sizeof(x->nat_keepalive_interval));
> > +
> > return l;
> > }
> >
> > --
> > 2.34.1
>
> One curious question: What happens if the NAT gateway in between returns an
> ICMP unreachable error as a response? If the IKE daemon was sending it,
> IKEd would notice the ICMP errors and possibly take action. This is
> something to consider. For example, this might be important to handle when
> offloading on an Android phone use case. Somehow, the IKE daemon should be
> woken up to handle these errors; otherwise, you risk unnecessary battery
> drain. Or if you are server, flodding lot of nat-keep-alive.
>
> 07:33:30.839377 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 29)
> 192.1.3.33.4500 > 192.1.2.23.4500: [no cksum] isakmp-nat-keep-alive
> 07:33:30.840014 IP (tos 0xc0, ttl 63, id 17350, offset 0, flags [none], proto ICMP (1), length 57)
> 192.1.2.23 > 192.1.3.33: ICMP 192.1.2.23 udp port 4500 unreachable, length 37
> IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 29)
> 192.1.3.33.4500 > 192.1.2.23.4500: [no cksum] isakmp-nat-keep-alive
That's an interesting observation. Do IKE daemons currently handle this
situation?
If the route between hosts isn't available, any traffic related to the xfrm
state would fail in a similar way, which the IKE daemon could observe as well.
Is there such handling done today?
>
> I ran a few quick tests and the patch works well.
Thanks!
Eyal.
Powered by blists - more mailing lists