[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z6ucMj5FukT_lecR@hog>
Date: Tue, 11 Feb 2025 19:51:30 +0100
From: Sabrina Dubroca <sd@...asysnail.net>
To: Eric Dumazet <edumazet@...gle.com>
Cc: netdev@...r.kernel.org, Paolo Abeni <pabeni@...hat.com>,
Neal Cardwell <ncardwell@...gle.com>,
Kuniyuki Iwashima <kuniyu@...zon.com>,
David Ahern <dsahern@...nel.org>, Xiumei Mu <xmu@...hat.com>,
Paul Moore <paul@...l-moore.com>
Subject: Re: [PATCH net] tcp: drop skb extensions before
skb_attempt_defer_free
2025-02-10, 17:24:43 +0100, Eric Dumazet wrote:
> On Mon, Feb 10, 2025 at 5:02 PM Sabrina Dubroca <sd@...asysnail.net> wrote:
> >
> > Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
> > running tests that boil down to:
> > - create a pair of netns
> > - run a basic TCP test over ipcomp6
> > - delete the pair of netns
> >
> > The xfrm_state found on spi_byaddr was not deleted at the time we
> > delete the netns, because we still have a reference on it. This
> > lingering reference comes from a secpath (which holds a ref on the
> > xfrm_state), which is still attached to an skb. This skb is not
> > leaked, it ends up on sk_receive_queue and then gets defer-free'd by
> > skb_attempt_defer_free.
> >
> > The problem happens when we defer freeing an skb (push it on one CPU's
> > defer_list), and don't flush that list before the netns is deleted. In
> > that case, we still have a reference on the xfrm_state that we don't
> > expect at this point.
> >
> > tcp_eat_recv_skb is currently the only caller of skb_attempt_defer_free,
> > so I'm fixing it here. This patch also adds a DEBUG_NET_WARN_ON_ONCE
> > in skb_attempt_defer_free, to make sure we don't re-introduce this
> > problem.
> >
> > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> > Reported-by: Xiumei Mu <xmu@...hat.com>
> > Signed-off-by: Sabrina Dubroca <sd@...asysnail.net>
> > ---
> > A few comments:
> > - AFAICT this could not happen before 68822bdf76f1, since we would
> > have emptied the (per-socket) defer_list before getting to ->exit()
> > for the netns
> > - I thought about dropping the extensions at the same time as we
> > already drop the dst, but Paolo said this is probably not correct due
> > to IP_CMSG_PASSSEC
>
> I think we discussed this issue in the past.
>
> Are you sure IP_CMSG_PASSSEC is ever used by TCP ?
After checking, I don't think so. The only way TCP can get to
IP_CMSG_PASSSEC is through the error queue, so it shouldn't matter.
The original commit (2c7946a7bf45 ("[SECURITY]: TCP/UDP getpeersec"))
also says that TCP should be using SO_PEERSEC for that purpose
(although likely based on the secpath as well, but not packet per
packet).
Based on the chat you had with Paul Moore back in November, it seems
any point after tcp_filter should be fine:
https://lore.kernel.org/netdev/CAHC9VhS3yuwrOPcH5_iRy50O_TtBCh_OVWHZgzfFTYqyfrw_zQ@mail.gmail.com
> Many layers in TCP can aggregate packets, are they aware of XFRM yet ?
I'm not so familiar with the depths of TCP, but with what you're
suggesting below, AFAIU the cleanup should happen before any
aggregation attempt (well, there's GRO...).
[...]
> If we think about it, storing thousands of packets in TCP sockets receive queues
> with XFRM state is consuming memory for absolutely no purpose.
True. I went with the simpler (less likely to break things
unexpectedly) fix for v1.
> It is worth noting MPTCP calls skb_ext_reset(skb) after
> commit 4e637c70b503b686aae45716a25a94dc3a434f3a ("mptcp: attempt
> coalescing when moving skbs to mptcp rx queue")
>
> I would suggest calling secpath_reset() earlier in TCP, from BH
> handler, while cpu caches are hot,
> instead of waiting for recvmsg() to drain the receive queue much later ?
Ok. So in the end it would look a lot like what you proposed in a
discussion with Ilya:
https://lore.kernel.org/netdev/CANn89i+JdDukwEhZ%3D41FxY-w63eER6JVixkwL+s2eSOjo6aWEQ@mail.gmail.com/
(as Paolo noticed, we can't just do skb_ext_reset because at least in
tcp_data_queue the MPTCP extension has just been attached)
An additional patch could maybe add DEBUG_NET_WARN_ON_ONCE at the time
we add skbs to sk_receive_queue, to check we didn't miss (or remove in
the future) places where the dst or secpath should have been dropped?
-------- 8< --------
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5b2b04835688..87c1e98d76cf 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -683,6 +683,12 @@ void tcp_fin(struct sock *sk);
void tcp_check_space(struct sock *sk);
void tcp_sack_compress_send_ack(struct sock *sk);
+static inline void tcp_skb_cleanup(struct sk_buff *skb)
+{
+ skb_dst_drop(skb);
+ secpath_reset(skb);
+}
+
/* tcp_timer.c */
void tcp_init_xmit_timers(struct sock *);
static inline void tcp_clear_xmit_timers(struct sock *sk)
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 0f523cbfe329..b815b9fc604c 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -178,7 +178,7 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb)
if (!skb)
return;
- skb_dst_drop(skb);
+ tcp_cleanup_skb(skb);
/* segs_in has been initialized to 1 in tcp_create_openreq_child().
* Hence, reset segs_in to 0 before calling tcp_segs_in()
* to avoid double counting. Also, tcp_segs_in() expects
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index eb82e01da911..bb0811c38908 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5245,7 +5245,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
__kfree_skb(skb);
return;
}
- skb_dst_drop(skb);
+ tcp_cleanup_skb(skb);
__skb_pull(skb, tcp_hdr(skb)->doff * 4);
reason = SKB_DROP_REASON_NOT_SPECIFIED;
@@ -6226,7 +6226,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPHITS);
/* Bulk data transfer: receiver */
- skb_dst_drop(skb);
+ tcp_cleanup_skb(skb);
__skb_pull(skb, tcp_header_len);
eaten = tcp_queue_rcv(sk, skb, &fragstolen);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index cc2b5194a18d..2632844d2c35 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2027,7 +2027,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
*/
skb_condense(skb);
- skb_dst_drop(skb);
+ tcp_cleanup_skb(skb);
if (unlikely(tcp_checksum_complete(skb))) {
bh_unlock_sock(sk);
--
Sabrina
Powered by blists - more mailing lists