lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250129123129.0c102586@samweis>
Date: Wed, 29 Jan 2025 12:31:29 +0100
From: Thomas Bogendoerfer <tbogendoerfer@...e.de>
To: Eric Dumazet <edumazet@...gle.com>
Cc: Paolo Abeni <pabeni@...hat.com>, "David S. Miller"
 <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, Simon Horman
 <horms@...nel.org>, netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 net] gro_cells: Avoid packet re-ordering for cloned
 skbs

On Thu, 23 Jan 2025 11:43:05 +0100
Eric Dumazet <edumazet@...gle.com> wrote:

> On Thu, Jan 23, 2025 at 11:42 AM Paolo Abeni <pabeni@...hat.com> wrote:
> >
> > On 1/23/25 11:07 AM, Eric Dumazet wrote:  
> > > On Thu, Jan 23, 2025 at 9:43 AM Paolo Abeni <pabeni@...hat.com> wrote:  
> > >> On 1/21/25 12:50 PM, Thomas Bogendoerfer wrote:  
> > >>> gro_cells_receive() passes a cloned skb directly up the stack and
> > >>> could cause re-ordering against segments still in GRO. To avoid
> > >>> this queue cloned skbs and use gro_normal_one() to pass it during
> > >>> normal NAPI work.
> > >>>
> > >>> Fixes: c9e6bc644e55 ("net: add gro_cells infrastructure")
> > >>> Suggested-by: Eric Dumazet <edumazet@...gle.com>
> > >>> Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@...e.de>
> > >>> --
> > >>> v2: don't use skb_copy(), but make decision how to pass cloned skbs in
> > >>>     napi poll function (suggested by Eric)
> > >>> v1: https://lore.kernel.org/lkml/20250109142724.29228-1-tbogendoerfer@suse.de/
> > >>>
> > >>>  net/core/gro_cells.c | 9 +++++++--
> > >>>  1 file changed, 7 insertions(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/net/core/gro_cells.c b/net/core/gro_cells.c
> > >>> index ff8e5b64bf6b..762746d18486 100644
> > >>> --- a/net/core/gro_cells.c
> > >>> +++ b/net/core/gro_cells.c
> > >>> @@ -2,6 +2,7 @@
> > >>>  #include <linux/skbuff.h>
> > >>>  #include <linux/slab.h>
> > >>>  #include <linux/netdevice.h>
> > >>> +#include <net/gro.h>
> > >>>  #include <net/gro_cells.h>
> > >>>  #include <net/hotdata.h>
> > >>>
> > >>> @@ -20,7 +21,7 @@ int gro_cells_receive(struct gro_cells *gcells, struct sk_buff *skb)
> > >>>       if (unlikely(!(dev->flags & IFF_UP)))
> > >>>               goto drop;
> > >>>
> > >>> -     if (!gcells->cells || skb_cloned(skb) || netif_elide_gro(dev)) {
> > >>> +     if (!gcells->cells || netif_elide_gro(dev)) {
> > >>>               res = netif_rx(skb);
> > >>>               goto unlock;
> > >>>       }
> > >>> @@ -58,7 +59,11 @@ static int gro_cell_poll(struct napi_struct *napi, int budget)
> > >>>               skb = __skb_dequeue(&cell->napi_skbs);
> > >>>               if (!skb)
> > >>>                       break;
> > >>> -             napi_gro_receive(napi, skb);
> > >>> +             /* Core GRO stack does not play well with clones. */
> > >>> +             if (skb_cloned(skb))
> > >>> +                     gro_normal_one(napi, skb, 1);
> > >>> +             else
> > >>> +                     napi_gro_receive(napi, skb);  
> > >>
> > >> I must admit it's not clear to me how/why the above will avoid OoO. I
> > >> assume OoO happens when we observe both cloned and uncloned packets
> > >> belonging to the same connection/flow.
> > >>
> > >> What if we have a (uncloned) packet for the relevant flow in the GRO,
> > >> 'rx_count - 1' packets already sitting in 'rx_list' and a cloned packet
> > >> for the critical flow reaches gro_cells_receive()?
> > >>
> > >> Don't we need to unconditionally flush any packets belonging to the same
> > >> flow?  
> > >
> > > It would only matter if we had 2 or more segments that would belong
> > > to the same flow and packet train (potential 'GRO super packet'), with
> > > the 'cloned'
> > > status being of mixed value on various segments.
> > >
> > > In practice, the cloned status will be the same for all segments.  
> >
> > I agree with the above, but my doubt is: does the above also mean that
> > in practice there are no OoO to deal with, even without this patch?
> >
> > To rephrase my doubt: which scenario is addressed by this patch that
> > would lead to OoO without it?  
> 
> Fair point, a detailed changelog would be really nice.

My test scenario is simple:

TCP Sender in namespace A -> ip6_tunnel -> ipvlan -> ipvlan -> ip6_tunnel -> TCP receiver

Sender does continuous writes in 15k chunks, receiver reads data from socket in a loop.

And that is what I see:

   40   0.002766      1000::1 → 2000::1      TCP 15088 51238 → 5060 [PSH, ACK] Seq=2862576060 Ack=1152583678 Win=65536 Len=15000 TSval=3343493494 TSecr=629171944
   41   0.002844      1000::1 → 2000::1      TCP 9816 51238 → 5060 [PSH, ACK] Seq=2862591060 Ack=1152583678 Win=65536 Len=9728 TSval=3343493494 TSecr=629171944
   42   0.004122      1000::1 → 2000::1      TCP 1468 [TCP Previous segment not captured] 51238 → 5060 [ACK] Seq=2862642188 Ack=1152583678 Win=65536 Len=1380 TSval=3343493496 TSecr=629171946
   43   0.004128      1000::1 → 2000::1      TCP 20788 [TCP Out-Of-Order] 51238 → 5060 [PSH, ACK] Seq=2862600788 Ack=1152583678 Win=65536 Len=20700 TSval=3343493496 TSecr=629171946
   44   0.004133      1000::1 → 2000::1      TCP 20788 [TCP Out-Of-Order] 51238 → 5060 [PSH, ACK] Seq=2862621488 Ack=1152583678 Win=65536 Len=20700 TSval=3343493496 TSecr=629171946
   45   0.004169      1000::1 → 2000::1      TCP 500 [TCP Previous segment not captured] 51238 → 5060 [PSH, ACK] Seq=2862665648 Ack=1152583678 Win=65536 Len=412 TSval=3343493496 TSecr=629171946
   46   0.004180      1000::1 → 2000::1      TCP 22168 [TCP Out-Of-Order] 51238 → 5060 [PSH, ACK] Seq=2862643568 Ack=1152583678 Win=65536 Len=22080 TSval=3343493496 TSecr=629171946
   47   0.004187      1000::1 → 2000::1      TCP 13888 51238 → 5060 [PSH, ACK] Seq=2862666060 Ack=1152583678 Win=65536 Len=13800 TSval=3343493496 TSecr=629171946
   48   0.004201      1000::1 → 2000::1      TCP 1288 51238 → 5060 [PSH, ACK] Seq=2862679860 Ack=1152583678 Win=65536 Len=1200 TSval=3343493496 TSecr=629171946
   49   0.004273      1000::1 → 2000::1      TCP 13888 51238 → 5060 [PSH, ACK] Seq=2862681060 Ack=1152583678 Win=65536 Len=13800 TSval=3343493496 TSecr=629171946

IMHO these ooO are retransmits for segments still waiting in GRO. With the
v2 patch this looks applied trace looks like this:

 2856   9.526256      1000::1 → 2000::1      TCP 64948 50452 → 5060 [PSH, ACK] Seq=1871837193 Ack=209151777 Win=65536 Len=64860 TSval=2755210164 TSecr=2795235137
 2857   9.526258      1000::1 → 2000::1      TCP 5480 50452 → 5060 [PSH, ACK] Seq=1871902053 Ack=209151777 Win=65536 Len=5392 TSval=2755210164 TSecr=2795235137
 2858   9.535262      1000::1 → 2000::1      TCP 1340 [TCP Retransmission] 50452 → 5060 [ACK] Seq=1871906193 Ack=209151777 Win=65536 Len=1252 TSval=2755210174 TSecr=2795235137
 2859   9.585477      1000::1 → 2000::1      TCP 64948 50452 → 5060 [PSH, ACK] Seq=1871907445 Ack=209151777 Win=65536 Len=64860 TSval=2755210224 TSecr=2795235197
 2860   9.585486      1000::1 → 2000::1      TCP 64948 50452 → 5060 [PSH, ACK] Seq=1871972305 Ack=209151777 Win=65536 Len=64860 TSval=2755210224 TSecr=2795235197

Looks ok to me, but without a GRO flush there is still a chance of ooO packets.
I've worked on a new patch (below as a RFC) which pushes the check for skb_cloned()
into GRO. Result is comparable to the v2 patch:

  604   1.987863      1000::1 → 2000::1      TCP 64948 57278 → 5060 [PSH, ACK] Seq=1220895319 Ack=1484877190 Win=65536 Len=64860 TSval=646104760 TSecr=459787214
  605   1.987866      1000::1 → 2000::1      TCP 16488 57278 → 5060 [PSH, ACK] Seq=1220960179 Ack=1484877190 Win=65536 Len=16400 TSval=646104760 TSecr=459787214
  606   1.998231      1000::1 → 2000::1      TCP 1308 [TCP Retransmission] 57278 → 5060 [ACK] Seq=1220975359 Ack=1484877190 Win=65536 Len=1220 TSval=646104771 TSecr=459787214
  607   2.049288      1000::1 → 2000::1      TCP 64948 57278 → 5060 [PSH,
  ACK] Seq=1220976579 Ack=1484877190 Win=65536 Len=64860 TSval=646104822
  TSecr=459787276
  608   2.049304      1000::1 → 2000::1      TCP 64948 57278 → 5060 [PSH,
  ACK] Seq=1221041439 Ack=1484877190 Win=65536 Len=64860 TSval=646104822
  TSecr=459787276



diff --git a/net/core/gro_cells.c b/net/core/gro_cells.c
index ff8e5b64bf6b..06e6889138ba 100644
--- a/net/core/gro_cells.c
+++ b/net/core/gro_cells.c
@@ -20,7 +20,7 @@ int gro_cells_receive(struct gro_cells *gcells, struct sk_buff *skb)
 	if (unlikely(!(dev->flags & IFF_UP)))
 		goto drop;
 
-	if (!gcells->cells || skb_cloned(skb) || netif_elide_gro(dev)) {
+	if (!gcells->cells || netif_elide_gro(dev)) {
 		res = netif_rx(skb);
 		goto unlock;
 	}
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 2308665b51c5..66a2bb849e85 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -322,6 +322,12 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
 	if (!p)
 		goto out_check_final;
 
+	if (unlikely(skb_cloned(skb))) {
+		NAPI_GRO_CB(skb)->flush |= 1;
+		NAPI_GRO_CB(skb)->same_flow = 0;
+		return p;
+	}
+
 	th2 = tcp_hdr(p);
 	flush = (__force int)(flags & TCP_FLAG_CWR);
 	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index a5be6e4ed326..a9c85b0556ce 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -647,6 +647,11 @@ struct sk_buff *udp4_gro_receive(struct list_head *head, struct sk_buff *skb)
 	struct sock *sk = NULL;
 	struct sk_buff *pp;
 
+	if (unlikely(skb_cloned(skb))) {
+		NAPI_GRO_CB(skb)->same_flow = 0;
+		goto flush;
+	}
+
 	if (unlikely(!uh))
 		goto flush;
 
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index b41152dd4246..b754747e3e8a 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -134,6 +134,11 @@ struct sk_buff *udp6_gro_receive(struct list_head *head, struct sk_buff *skb)
 	struct sock *sk = NULL;
 	struct sk_buff *pp;
 
+	if (unlikely(skb_cloned(skb))) {
+		NAPI_GRO_CB(skb)->same_flow = 0;
+		goto flush;
+	}
+
 	if (unlikely(!uh))
 		goto flush;
 

What do you think about this approach ?

Thomas.

-- 
SUSE Software Solutions Germany GmbH
HRB 36809 (AG Nürnberg)
Geschäftsführer: Ivo Totev, Andrew McDonald, Werner Knoblich

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ