[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250129123129.0c102586@samweis>
Date: Wed, 29 Jan 2025 12:31:29 +0100
From: Thomas Bogendoerfer <tbogendoerfer@...e.de>
To: Eric Dumazet <edumazet@...gle.com>
Cc: Paolo Abeni <pabeni@...hat.com>, "David S. Miller"
<davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, Simon Horman
<horms@...nel.org>, netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 net] gro_cells: Avoid packet re-ordering for cloned
skbs
On Thu, 23 Jan 2025 11:43:05 +0100
Eric Dumazet <edumazet@...gle.com> wrote:
> On Thu, Jan 23, 2025 at 11:42 AM Paolo Abeni <pabeni@...hat.com> wrote:
> >
> > On 1/23/25 11:07 AM, Eric Dumazet wrote:
> > > On Thu, Jan 23, 2025 at 9:43 AM Paolo Abeni <pabeni@...hat.com> wrote:
> > >> On 1/21/25 12:50 PM, Thomas Bogendoerfer wrote:
> > >>> gro_cells_receive() passes a cloned skb directly up the stack and
> > >>> could cause re-ordering against segments still in GRO. To avoid
> > >>> this queue cloned skbs and use gro_normal_one() to pass it during
> > >>> normal NAPI work.
> > >>>
> > >>> Fixes: c9e6bc644e55 ("net: add gro_cells infrastructure")
> > >>> Suggested-by: Eric Dumazet <edumazet@...gle.com>
> > >>> Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@...e.de>
> > >>> --
> > >>> v2: don't use skb_copy(), but make decision how to pass cloned skbs in
> > >>> napi poll function (suggested by Eric)
> > >>> v1: https://lore.kernel.org/lkml/20250109142724.29228-1-tbogendoerfer@suse.de/
> > >>>
> > >>> net/core/gro_cells.c | 9 +++++++--
> > >>> 1 file changed, 7 insertions(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/net/core/gro_cells.c b/net/core/gro_cells.c
> > >>> index ff8e5b64bf6b..762746d18486 100644
> > >>> --- a/net/core/gro_cells.c
> > >>> +++ b/net/core/gro_cells.c
> > >>> @@ -2,6 +2,7 @@
> > >>> #include <linux/skbuff.h>
> > >>> #include <linux/slab.h>
> > >>> #include <linux/netdevice.h>
> > >>> +#include <net/gro.h>
> > >>> #include <net/gro_cells.h>
> > >>> #include <net/hotdata.h>
> > >>>
> > >>> @@ -20,7 +21,7 @@ int gro_cells_receive(struct gro_cells *gcells, struct sk_buff *skb)
> > >>> if (unlikely(!(dev->flags & IFF_UP)))
> > >>> goto drop;
> > >>>
> > >>> - if (!gcells->cells || skb_cloned(skb) || netif_elide_gro(dev)) {
> > >>> + if (!gcells->cells || netif_elide_gro(dev)) {
> > >>> res = netif_rx(skb);
> > >>> goto unlock;
> > >>> }
> > >>> @@ -58,7 +59,11 @@ static int gro_cell_poll(struct napi_struct *napi, int budget)
> > >>> skb = __skb_dequeue(&cell->napi_skbs);
> > >>> if (!skb)
> > >>> break;
> > >>> - napi_gro_receive(napi, skb);
> > >>> + /* Core GRO stack does not play well with clones. */
> > >>> + if (skb_cloned(skb))
> > >>> + gro_normal_one(napi, skb, 1);
> > >>> + else
> > >>> + napi_gro_receive(napi, skb);
> > >>
> > >> I must admit it's not clear to me how/why the above will avoid OoO. I
> > >> assume OoO happens when we observe both cloned and uncloned packets
> > >> belonging to the same connection/flow.
> > >>
> > >> What if we have a (uncloned) packet for the relevant flow in the GRO,
> > >> 'rx_count - 1' packets already sitting in 'rx_list' and a cloned packet
> > >> for the critical flow reaches gro_cells_receive()?
> > >>
> > >> Don't we need to unconditionally flush any packets belonging to the same
> > >> flow?
> > >
> > > It would only matter if we had 2 or more segments that would belong
> > > to the same flow and packet train (potential 'GRO super packet'), with
> > > the 'cloned'
> > > status being of mixed value on various segments.
> > >
> > > In practice, the cloned status will be the same for all segments.
> >
> > I agree with the above, but my doubt is: does the above also mean that
> > in practice there are no OoO to deal with, even without this patch?
> >
> > To rephrase my doubt: which scenario is addressed by this patch that
> > would lead to OoO without it?
>
> Fair point, a detailed changelog would be really nice.
My test scenario is simple:
TCP Sender in namespace A -> ip6_tunnel -> ipvlan -> ipvlan -> ip6_tunnel -> TCP receiver
Sender does continuous writes in 15k chunks, receiver reads data from socket in a loop.
And that is what I see:
40 0.002766 1000::1 → 2000::1 TCP 15088 51238 → 5060 [PSH, ACK] Seq=2862576060 Ack=1152583678 Win=65536 Len=15000 TSval=3343493494 TSecr=629171944
41 0.002844 1000::1 → 2000::1 TCP 9816 51238 → 5060 [PSH, ACK] Seq=2862591060 Ack=1152583678 Win=65536 Len=9728 TSval=3343493494 TSecr=629171944
42 0.004122 1000::1 → 2000::1 TCP 1468 [TCP Previous segment not captured] 51238 → 5060 [ACK] Seq=2862642188 Ack=1152583678 Win=65536 Len=1380 TSval=3343493496 TSecr=629171946
43 0.004128 1000::1 → 2000::1 TCP 20788 [TCP Out-Of-Order] 51238 → 5060 [PSH, ACK] Seq=2862600788 Ack=1152583678 Win=65536 Len=20700 TSval=3343493496 TSecr=629171946
44 0.004133 1000::1 → 2000::1 TCP 20788 [TCP Out-Of-Order] 51238 → 5060 [PSH, ACK] Seq=2862621488 Ack=1152583678 Win=65536 Len=20700 TSval=3343493496 TSecr=629171946
45 0.004169 1000::1 → 2000::1 TCP 500 [TCP Previous segment not captured] 51238 → 5060 [PSH, ACK] Seq=2862665648 Ack=1152583678 Win=65536 Len=412 TSval=3343493496 TSecr=629171946
46 0.004180 1000::1 → 2000::1 TCP 22168 [TCP Out-Of-Order] 51238 → 5060 [PSH, ACK] Seq=2862643568 Ack=1152583678 Win=65536 Len=22080 TSval=3343493496 TSecr=629171946
47 0.004187 1000::1 → 2000::1 TCP 13888 51238 → 5060 [PSH, ACK] Seq=2862666060 Ack=1152583678 Win=65536 Len=13800 TSval=3343493496 TSecr=629171946
48 0.004201 1000::1 → 2000::1 TCP 1288 51238 → 5060 [PSH, ACK] Seq=2862679860 Ack=1152583678 Win=65536 Len=1200 TSval=3343493496 TSecr=629171946
49 0.004273 1000::1 → 2000::1 TCP 13888 51238 → 5060 [PSH, ACK] Seq=2862681060 Ack=1152583678 Win=65536 Len=13800 TSval=3343493496 TSecr=629171946
IMHO these ooO are retransmits for segments still waiting in GRO. With the
v2 patch this looks applied trace looks like this:
2856 9.526256 1000::1 → 2000::1 TCP 64948 50452 → 5060 [PSH, ACK] Seq=1871837193 Ack=209151777 Win=65536 Len=64860 TSval=2755210164 TSecr=2795235137
2857 9.526258 1000::1 → 2000::1 TCP 5480 50452 → 5060 [PSH, ACK] Seq=1871902053 Ack=209151777 Win=65536 Len=5392 TSval=2755210164 TSecr=2795235137
2858 9.535262 1000::1 → 2000::1 TCP 1340 [TCP Retransmission] 50452 → 5060 [ACK] Seq=1871906193 Ack=209151777 Win=65536 Len=1252 TSval=2755210174 TSecr=2795235137
2859 9.585477 1000::1 → 2000::1 TCP 64948 50452 → 5060 [PSH, ACK] Seq=1871907445 Ack=209151777 Win=65536 Len=64860 TSval=2755210224 TSecr=2795235197
2860 9.585486 1000::1 → 2000::1 TCP 64948 50452 → 5060 [PSH, ACK] Seq=1871972305 Ack=209151777 Win=65536 Len=64860 TSval=2755210224 TSecr=2795235197
Looks ok to me, but without a GRO flush there is still a chance of ooO packets.
I've worked on a new patch (below as a RFC) which pushes the check for skb_cloned()
into GRO. Result is comparable to the v2 patch:
604 1.987863 1000::1 → 2000::1 TCP 64948 57278 → 5060 [PSH, ACK] Seq=1220895319 Ack=1484877190 Win=65536 Len=64860 TSval=646104760 TSecr=459787214
605 1.987866 1000::1 → 2000::1 TCP 16488 57278 → 5060 [PSH, ACK] Seq=1220960179 Ack=1484877190 Win=65536 Len=16400 TSval=646104760 TSecr=459787214
606 1.998231 1000::1 → 2000::1 TCP 1308 [TCP Retransmission] 57278 → 5060 [ACK] Seq=1220975359 Ack=1484877190 Win=65536 Len=1220 TSval=646104771 TSecr=459787214
607 2.049288 1000::1 → 2000::1 TCP 64948 57278 → 5060 [PSH,
ACK] Seq=1220976579 Ack=1484877190 Win=65536 Len=64860 TSval=646104822
TSecr=459787276
608 2.049304 1000::1 → 2000::1 TCP 64948 57278 → 5060 [PSH,
ACK] Seq=1221041439 Ack=1484877190 Win=65536 Len=64860 TSval=646104822
TSecr=459787276
diff --git a/net/core/gro_cells.c b/net/core/gro_cells.c
index ff8e5b64bf6b..06e6889138ba 100644
--- a/net/core/gro_cells.c
+++ b/net/core/gro_cells.c
@@ -20,7 +20,7 @@ int gro_cells_receive(struct gro_cells *gcells, struct sk_buff *skb)
if (unlikely(!(dev->flags & IFF_UP)))
goto drop;
- if (!gcells->cells || skb_cloned(skb) || netif_elide_gro(dev)) {
+ if (!gcells->cells || netif_elide_gro(dev)) {
res = netif_rx(skb);
goto unlock;
}
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 2308665b51c5..66a2bb849e85 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -322,6 +322,12 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
if (!p)
goto out_check_final;
+ if (unlikely(skb_cloned(skb))) {
+ NAPI_GRO_CB(skb)->flush |= 1;
+ NAPI_GRO_CB(skb)->same_flow = 0;
+ return p;
+ }
+
th2 = tcp_hdr(p);
flush = (__force int)(flags & TCP_FLAG_CWR);
flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index a5be6e4ed326..a9c85b0556ce 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -647,6 +647,11 @@ struct sk_buff *udp4_gro_receive(struct list_head *head, struct sk_buff *skb)
struct sock *sk = NULL;
struct sk_buff *pp;
+ if (unlikely(skb_cloned(skb))) {
+ NAPI_GRO_CB(skb)->same_flow = 0;
+ goto flush;
+ }
+
if (unlikely(!uh))
goto flush;
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index b41152dd4246..b754747e3e8a 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -134,6 +134,11 @@ struct sk_buff *udp6_gro_receive(struct list_head *head, struct sk_buff *skb)
struct sock *sk = NULL;
struct sk_buff *pp;
+ if (unlikely(skb_cloned(skb))) {
+ NAPI_GRO_CB(skb)->same_flow = 0;
+ goto flush;
+ }
+
if (unlikely(!uh))
goto flush;
What do you think about this approach ?
Thomas.
--
SUSE Software Solutions Germany GmbH
HRB 36809 (AG Nürnberg)
Geschäftsführer: Ivo Totev, Andrew McDonald, Werner Knoblich
Powered by blists - more mailing lists