netdev - Re: [PATCH net] tcp: drop skb extensions before skb_attempt_defer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJ=qUt=tgPUMUcAjeNunuYByMNCOcTzqABe_qLu-BiAPQ@mail.gmail.com>
Date: Tue, 11 Feb 2025 20:17:17 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Sabrina Dubroca <sd@...asysnail.net>
Cc: netdev@...r.kernel.org, Paolo Abeni <pabeni@...hat.com>, 
	Neal Cardwell <ncardwell@...gle.com>, Kuniyuki Iwashima <kuniyu@...zon.com>, 
	David Ahern <dsahern@...nel.org>, Xiumei Mu <xmu@...hat.com>, Paul Moore <paul@...l-moore.com>
Subject: Re: [PATCH net] tcp: drop skb extensions before skb_attempt_defer_free

On Tue, Feb 11, 2025 at 7:51 PM Sabrina Dubroca <sd@...asysnail.net> wrote:
>
> 2025-02-10, 17:24:43 +0100, Eric Dumazet wrote:
> > On Mon, Feb 10, 2025 at 5:02 PM Sabrina Dubroca <sd@...asysnail.net> wrote:
> > >
> > > Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
> > > running tests that boil down to:
> > >  - create a pair of netns
> > >  - run a basic TCP test over ipcomp6
> > >  - delete the pair of netns
> > >
> > > The xfrm_state found on spi_byaddr was not deleted at the time we
> > > delete the netns, because we still have a reference on it. This
> > > lingering reference comes from a secpath (which holds a ref on the
> > > xfrm_state), which is still attached to an skb. This skb is not
> > > leaked, it ends up on sk_receive_queue and then gets defer-free'd by
> > > skb_attempt_defer_free.
> > >
> > > The problem happens when we defer freeing an skb (push it on one CPU's
> > > defer_list), and don't flush that list before the netns is deleted. In
> > > that case, we still have a reference on the xfrm_state that we don't
> > > expect at this point.
> > >
> > > tcp_eat_recv_skb is currently the only caller of skb_attempt_defer_free,
> > > so I'm fixing it here. This patch also adds a DEBUG_NET_WARN_ON_ONCE
> > > in skb_attempt_defer_free, to make sure we don't re-introduce this
> > > problem.
> > >
> > > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> > > Reported-by: Xiumei Mu <xmu@...hat.com>
> > > Signed-off-by: Sabrina Dubroca <sd@...asysnail.net>
> > > ---
> > > A few comments:
> > >  - AFAICT this could not happen before 68822bdf76f1, since we would
> > >    have emptied the (per-socket) defer_list before getting to ->exit()
> > >    for the netns
> > >  - I thought about dropping the extensions at the same time as we
> > >    already drop the dst, but Paolo said this is probably not correct due
> > >    to IP_CMSG_PASSSEC
> >
> > I think we discussed this issue in the past.
> >
> > Are you sure IP_CMSG_PASSSEC  is ever used by TCP ?
>
> After checking, I don't think so. The only way TCP can get to
> IP_CMSG_PASSSEC is through the error queue, so it shouldn't matter.
>
> The original commit (2c7946a7bf45 ("[SECURITY]: TCP/UDP getpeersec"))
> also says that TCP should be using SO_PEERSEC for that purpose
> (although likely based on the secpath as well, but not packet per
> packet).
>
> Based on the chat you had with Paul Moore back in November, it seems
> any point after tcp_filter should be fine:
>
> https://lore.kernel.org/netdev/CAHC9VhS3yuwrOPcH5_iRy50O_TtBCh_OVWHZgzfFTYqyfrw_zQ@mail.gmail.com
>
>
> > Many layers in TCP can aggregate packets, are they aware of XFRM yet ?
>
> I'm not so familiar with the depths of TCP, but with what you're
> suggesting below, AFAIU the cleanup should happen before any
> aggregation attempt (well, there's GRO...).
>
>
> [...]
> > If we think about it, storing thousands of packets in TCP sockets receive queues
> > with XFRM state is consuming memory for absolutely no purpose.
>
> True. I went with the simpler (less likely to break things
> unexpectedly) fix for v1.
>
> > It is worth noting MPTCP  calls skb_ext_reset(skb) after
> > commit 4e637c70b503b686aae45716a25a94dc3a434f3a ("mptcp: attempt
> > coalescing when moving skbs to mptcp rx queue")
> >
> > I would suggest calling secpath_reset() earlier in TCP, from BH
> > handler, while cpu caches are hot,
> > instead of waiting for recvmsg() to drain the receive queue much later ?
>
> Ok. So in the end it would look a lot like what you proposed in a
> discussion with Ilya:
> https://lore.kernel.org/netdev/CANn89i+JdDukwEhZ%3D41FxY-w63eER6JVixkwL+s2eSOjo6aWEQ@mail.gmail.com/
>
> (as Paolo noticed, we can't just do skb_ext_reset because at least in
> tcp_data_queue the MPTCP extension has just been attached)
>
> An additional patch could maybe add DEBUG_NET_WARN_ON_ONCE at the time
> we add skbs to sk_receive_queue, to check we didn't miss (or remove in
> the future) places where the dst or secpath should have been dropped?

Sure, adding the  DEBUG_NET_WARN_ON_ONCE() is absolutely fine.

>
> -------- 8< --------
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 5b2b04835688..87c1e98d76cf 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -683,6 +683,12 @@ void tcp_fin(struct sock *sk);
>  void tcp_check_space(struct sock *sk);
>  void tcp_sack_compress_send_ack(struct sock *sk);
>
> +static inline void tcp_skb_cleanup(struct sk_buff *skb)
> +{
> +       skb_dst_drop(skb);
> +       secpath_reset(skb);
> +}
> +
>  /* tcp_timer.c */
>  void tcp_init_xmit_timers(struct sock *);
>  static inline void tcp_clear_xmit_timers(struct sock *sk)
> diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
> index 0f523cbfe329..b815b9fc604c 100644
> --- a/net/ipv4/tcp_fastopen.c
> +++ b/net/ipv4/tcp_fastopen.c
> @@ -178,7 +178,7 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb)
>         if (!skb)
>                 return;
>
> -       skb_dst_drop(skb);
> +       tcp_cleanup_skb(skb);
>         /* segs_in has been initialized to 1 in tcp_create_openreq_child().
>          * Hence, reset segs_in to 0 before calling tcp_segs_in()
>          * to avoid double counting.  Also, tcp_segs_in() expects
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index eb82e01da911..bb0811c38908 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5245,7 +5245,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
>                 __kfree_skb(skb);
>                 return;
>         }
> -       skb_dst_drop(skb);
> +       tcp_cleanup_skb(skb);
>         __skb_pull(skb, tcp_hdr(skb)->doff * 4);
>
>         reason = SKB_DROP_REASON_NOT_SPECIFIED;
> @@ -6226,7 +6226,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
>                         NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPHITS);
>
>                         /* Bulk data transfer: receiver */
> -                       skb_dst_drop(skb);
> +                       tcp_cleanup_skb(skb);
>                         __skb_pull(skb, tcp_header_len);
>                         eaten = tcp_queue_rcv(sk, skb, &fragstolen);
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index cc2b5194a18d..2632844d2c35 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2027,7 +2027,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
>          */
>         skb_condense(skb);
>
> -       skb_dst_drop(skb);
> +       tcp_cleanup_skb(skb);
>
>         if (unlikely(tcp_checksum_complete(skb))) {
>                 bh_unlock_sock(sk);
> --

This looks much better to me ;)