netdev - Re: [PATCH] tcp: disable tcp_autocorking for socket when TCP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJwokqZC9P3Ycy4ZWpmT1QhC0qD79y1K1eg2UUAcAj-Lw@mail.gmail.com>
Date: Thu, 14 Dec 2023 17:07:02 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Geoff Blake <blakgeof@...zon.com>
Cc: Salvatore Dipietro <dipiets@...zon.com>, alisaidi@...zon.com, benh@...zon.com, 
	davem@...emloft.net, dipietro.salvatore@...il.com, dsahern@...nel.org, 
	kuba@...nel.org, netdev@...r.kernel.org, pabeni@...hat.com
Subject: Re: [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY
 flag is set

On Thu, Dec 14, 2023 at 4:52 PM Geoff Blake <blakgeof@...zon.com> wrote:
>
> Thanks for helping dig in here Eric, but what is supposed to happen on TX
> completion? We're unfamiliar with TCP small queues beside finding your old
> LKML listing that states a tasklet is supposed to run if there is pending
> data.  So need a bit more guidance if you could.
>
> I think its supposed to call tcp_free() when the skb is destructed and
> that invokes the tasklet?  There is also sock_wfree(), it does not appear
> to have the linkage to the tasklet by design.
>
> We did attach probes at one point to look at whether there was a chance an
> interrupt went missing (but don't have them on-hand anymore), but we
> always saw the TX completion happen. When the 40ms latency happened
> we'd see that the completion had happened just after the other packet decided to
> be corked.  But it certainly doesn't hurt to double check.

When TX completion happens, while autocorking was setup, TSQ_THROTTLED
bit was set on sk->sk_tsq_flags, so TSQ logic should call
tcp_tsq_handler() -> tcp_tsq_write() -> tcp_write_xmit()

tcp_write_xmit() should send the pending packet (if CWND and other
constraints allows this)

autocorking is all about giving chance to the application to
add more bytes on the pending skb before TX completion happens
(typically in less  than 100 usec on an idle qdisc/nic)

If your life depends on not waiting for this delay, you have two options :

1) use MSG_EOR
2) disable autocorking (/proc/sys/net/ipv4/tcp_autocorking



>
> - Geoff Blake
>
> On Thu, 14 Dec 2023, Eric Dumazet wrote:
>
> > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> >
> >
> >
> > On Wed, Dec 13, 2023 at 10:30 PM Salvatore Dipietro <dipiets@...zon.com> wrote:
> > >
> > > > It looks like the above disables autocorking even after the userspace
> > > > sets TCP_CORK. Am I reading it correctly? Is that expected?
> > >
> > > I have tested a new version of the patch which can target only TCP_NODELAY.
> > > Results using previous benchmark are identical. I will submit it in a new
> > > patch version.
> > >
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -716,7 +716,8 @@
> > >
> > >         tcp_mark_urg(tp, flags);
> > >
> > > -       if (tcp_should_autocork(sk, skb, size_goal)) {
> > > +       if (!(nonagle & TCP_NAGLE_OFF) &&
> > > +           tcp_should_autocork(sk, skb, size_goal)) {
> > >
> > >                 /* avoid atomic op if TSQ_THROTTLED bit is already set */
> > >                 if (!test_bit(TSQ_THROTTLED, &sk->sk_tsq_flags)) {
> > >
> > >
> > >
> > > > Also I wonder about these 40ms delays, TCP small queue handler should
> > > > kick when the prior skb is TX completed.
> > > >
> > > > It seems the issue is on the driver side ?
> > > >
> > > > Salvatore, which driver are you using ?
> > >
> > > I am using ENA driver.
> > >
> > > Eric can you please clarify where do you think the problem is?
> > >
> >
> > Following bpftrace program could double check if ena driver is
> > possibly holding TCP skbs too long:
> >
> > bpftrace -e 'k:dev_hard_start_xmit {
> >  $skb = (struct sk_buff *)arg0;
> >  if ($skb->fclone == 2) {
> >   @start[$skb] = nsecs;
> >  }
> > }
> > k:__kfree_skb {
> >  $skb = (struct sk_buff *)arg0;
> >  if ($skb->fclone == 2 && @start[$skb]) {
> >   @tx_compl_usecs = hist((nsecs - @start[$skb])/1000);
> >   delete(@start[$skb]);
> > }
> > } END { clear(@start); }'
> >
> > iroa21:/home/edumazet# ./trace-tx-completion.sh
> > Attaching 3 probes...
> > ^C
> >
> >
> > @tx_compl_usecs:
> > [2, 4)                13 |                                                    |
> > [4, 8)               182 |                                                    |
> > [8, 16)          2379007 |@@@@@@@@@@@@@@@                                     |
> > [16, 32)         7865369 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [32, 64)         6040939 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             |
> > [64, 128)         199255 |@                                                   |
> > [128, 256)          9235 |                                                    |
> > [256, 512)            89 |                                                    |
> > [512, 1K)             37 |                                                    |
> > [1K, 2K)              19 |                                                    |
> > [2K, 4K)              56 |                                                    |
> >