lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4b3ede21-27cf-bc17-be71-4c388e670f2c@kernel.org>
Date: Tue, 29 Oct 2024 23:17:40 +0200 (EET)
From: Ilpo Järvinen <ij@...nel.org>
To: Paolo Abeni <pabeni@...hat.com>
cc: chia-yu.chang@...ia-bell-labs.com, netdev@...r.kernel.org, 
    davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org, 
    dsahern@...nel.org, netfilter-devel@...r.kernel.org, kadlec@...filter.org, 
    coreteam@...filter.org, pablo@...filter.org, bpf@...r.kernel.org, 
    joel.granados@...nel.org, linux-fsdevel@...r.kernel.org, kees@...nel.org, 
    mcgrof@...nel.org, ncardwell@...gle.com, 
    koen.de_schepper@...ia-bell-labs.com, g.white@...leLabs.com, 
    ingemar.s.johansson@...csson.com, mirja.kuehlewind@...csson.com, 
    cheshire@...le.com, rs.ietf@....at, Jason_Livingood@...cast.com, 
    vidhi_goel@...le.com
Subject: Re: [PATCH v4 net-next 09/14] gro: prevent ACE field corruption &
 better AccECN handling

On Tue, 29 Oct 2024, Paolo Abeni wrote:
> On 10/21/24 23:59, chia-yu.chang@...ia-bell-labs.com wrote:
> > From: Ilpo Järvinen <ij@...nel.org>
> > 
> > There are important differences in how the CWR field behaves
> > in RFC3168 and AccECN. With AccECN, CWR flag is part of the
> > ACE counter and its changes are important so adjust the flags
> > changed mask accordingly.
> > 
> > Also, if CWR is there, set the Accurate ECN GSO flag to avoid
> > corrupting CWR flag somewhere.
> > 
> > Signed-off-by: Ilpo Järvinen <ij@...nel.org>
> > Signed-off-by: Chia-Yu Chang <chia-yu.chang@...ia-bell-labs.com>
> > ---
> >  net/ipv4/tcp_offload.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
> > index 0b05f30e9e5f..f59762d88c38 100644
> > --- a/net/ipv4/tcp_offload.c
> > +++ b/net/ipv4/tcp_offload.c
> > @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
> >  	th2 = tcp_hdr(p);
> >  	flush = (__force int)(flags & TCP_FLAG_CWR);
> >  	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
> > -		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
> > +		  ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
> 
> If I read correctly, if the peer is using RFC3168 and TSO_ECN, GRO will
> now pump into the stack twice the number of packets it was doing prior
> to this patch, am I correct?
> 
> That is likely causing measurable performance regressions.

Hi Paolo,

Thanks for taking a look!

While it's true on surface that this might cause some more packets with 
RFC3168 (by design, as network cannot know if the sender is using RFC3168 
or not), the important question is the scale how many of extra packets 
will occur in practice.

First of all, RFC3168 requires CWR flag to be sent no more frequently 
than once per window of data, or in other words, once per RTT. And that
means just one packet, not e.g. all packets of a super-skb (the RFC3168
signalling will lose its integrity if this is violated by the sender).

Secondly, the TCP sender uses CWR flag to indicate it just halved its 
congestion window which mean it is sending half the amount of packets in 
this window than in the previous window (analoguous to halving sending 
rate). 2 RTTs with CWR each means two window reductions (this behavior 
is spec'ed in RFC3168).

So lets say the sender was using 100 packets congestion window, this 
change will add one packet to 50 packets on this next RTT. Note those are
raw numbers of packets on wire and do not tell how many packets GRO 
combined into each super-skb which will wary case-by-case basis. 
Regardless, I suspect the extra packet added to the half of the packets 
will be hard/impossible to measure to cause a performance regression.

This change would double the number of packets only if the congestion 
window is 1 or 2 packets and in that case TSO/GSO/GRO benefits will be 
pretty small to begin with (or even counterproductive). Also, the 
traditional TCP congestion control (RFC3168 included) has many issues 
anyway with that small windows because it doesn't deal with fractional 
congestion windows well.

> >  	flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
> >  	for (i = sizeof(*th); i < thlen; i += 4)
> >  		flush |= *(u32 *)((u8 *)th + i) ^
> > @@ -405,7 +405,7 @@ void tcp_gro_complete(struct sk_buff *skb)
> >  	shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
> >  
> >  	if (th->cwr)
> > -		shinfo->gso_type |= SKB_GSO_TCP_ECN;
> > +		shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
> 
> If this packet is forwarded, it will not leverage TSO anymore - with
> current H/W.
> 
> I think we need a way to enable this feature conditionally, but I fear
> another sysctl will be ugly and the additional conditionals will not be
> good for GRO.
>
> Smarter suggestions welcome ;)

Well, it is already very selectively _conditional_, SKB_GSO_TCP_ACCECN is 
only set for the skb when CWR is set. That is, once per RTT (data window) 
when it comes to RFC3168. 

I don't have any source for this (other than reading many many tcpdumps 
in the past) but I believe the percentage of packets with CWR set (due to 
RFC3168 signalling) is going to be very small overall.

Do you think that is not good enough?

To answer more generally to your suggestion on making it conditional based 
on some other logic, it would mean you accept network middleboxes are 
allowed to corrupt AccECN ACE field when forwarding. If RFC3168 TSO/GSO 
trickery remains in use (without a middlebox explicitly tracking the 
connection had negotiated RFC3168), a forwarder won't be able to reproduce 
the exactly same stream of TCP packets headers thus corrupting non-RFC3168 
use of CWR flag. It's not something any middlebox should be doing (I hope 
we agree on this as a general principle)!

-- 
 i.
 
> Cheers,
> 
> Paolo
> 
> >  }
> >  EXPORT_SYMBOL(tcp_gro_complete);
> >  
> 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ