netdev - Re: [PATCH net] tcp: drop skb extensions before skb_attempt_defer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89iJbzed1HnW7QHSRWno92hLAbQH+iaitAutqRh=CK9koaw@mail.gmail.com>
Date: Mon, 10 Feb 2025 17:24:43 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Sabrina Dubroca <sd@...asysnail.net>
Cc: netdev@...r.kernel.org, Paolo Abeni <pabeni@...hat.com>, 
	Neal Cardwell <ncardwell@...gle.com>, Kuniyuki Iwashima <kuniyu@...zon.com>, 
	David Ahern <dsahern@...nel.org>, Xiumei Mu <xmu@...hat.com>
Subject: Re: [PATCH net] tcp: drop skb extensions before skb_attempt_defer_free

On Mon, Feb 10, 2025 at 5:02 PM Sabrina Dubroca <sd@...asysnail.net> wrote:
>
> Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
> running tests that boil down to:
>  - create a pair of netns
>  - run a basic TCP test over ipcomp6
>  - delete the pair of netns
>
> The xfrm_state found on spi_byaddr was not deleted at the time we
> delete the netns, because we still have a reference on it. This
> lingering reference comes from a secpath (which holds a ref on the
> xfrm_state), which is still attached to an skb. This skb is not
> leaked, it ends up on sk_receive_queue and then gets defer-free'd by
> skb_attempt_defer_free.
>
> The problem happens when we defer freeing an skb (push it on one CPU's
> defer_list), and don't flush that list before the netns is deleted. In
> that case, we still have a reference on the xfrm_state that we don't
> expect at this point.
>
> tcp_eat_recv_skb is currently the only caller of skb_attempt_defer_free,
> so I'm fixing it here. This patch also adds a DEBUG_NET_WARN_ON_ONCE
> in skb_attempt_defer_free, to make sure we don't re-introduce this
> problem.
>
> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> Reported-by: Xiumei Mu <xmu@...hat.com>
> Signed-off-by: Sabrina Dubroca <sd@...asysnail.net>
> ---
> A few comments:
>  - AFAICT this could not happen before 68822bdf76f1, since we would
>    have emptied the (per-socket) defer_list before getting to ->exit()
>    for the netns
>  - I thought about dropping the extensions at the same time as we
>    already drop the dst, but Paolo said this is probably not correct due
>    to IP_CMSG_PASSSEC

I think we discussed this issue in the past.

Are you sure IP_CMSG_PASSSEC  is ever used by TCP ?

Many layers in TCP can aggregate packets, are they aware of XFRM yet ?

>  - I'm planning to rework the "synchronous" removal of xfrm_states
>    (commit f75a2804da39 ("xfrm: destroy xfrm_state synchronously on
>    net exit path")), which may also be able to fix this problem, but
>    it is going to be a lot more complex than this patch
>
>  net/core/skbuff.c | 1 +
>  net/ipv4/tcp.c    | 1 +
>  2 files changed, 2 insertions(+)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 6a99c453397f..abd0371bc51a 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -7047,6 +7047,7 @@ nodefer:  kfree_skb_napi_cache(skb);
>
>         DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
>         DEBUG_NET_WARN_ON_ONCE(skb->destructor);
> +       DEBUG_NET_WARN_ON_ONCE(skb_has_extensions(skb));
>
>         sd = &per_cpu(softnet_data, cpu);
>         defer_max = READ_ONCE(net_hotdata.sysctl_skb_defer_max);
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 0d704bda6c41..e60f642485ee 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1524,6 +1524,7 @@ static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb)
>                 sock_rfree(skb);
>                 skb->destructor = NULL;
>                 skb->sk = NULL;
> +               skb_ext_reset(skb);

If we think about it, storing thousands of packets in TCP sockets receive queues
with XFRM state is consuming memory for absolutely no purpose.

It is worth noting MPTCP  calls skb_ext_reset(skb) after
commit 4e637c70b503b686aae45716a25a94dc3a434f3a ("mptcp: attempt
coalescing when moving skbs to mptcp rx queue")

I would suggest calling secpath_reset() earlier in TCP, from BH
handler, while cpu caches are hot,
instead of waiting for recvmsg() to drain the receive queue much later ?