netdev - Re: [PATCH] tcp: disable tcp_autocorking for socket when TCP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iLJOi+qempjx0AtWWtbX94nd4Hi9zRyFsdgmcKiq==N7Q@mail.gmail.com>
Date: Thu, 14 Dec 2023 10:05:19 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Salvatore Dipietro <dipiets@...zon.com>
Cc: alisaidi@...zon.com, benh@...zon.com, blakgeof@...zon.com, 
	davem@...emloft.net, dipietro.salvatore@...il.com, dsahern@...nel.org, 
	kuba@...nel.org, netdev@...r.kernel.org, pabeni@...hat.com
Subject: Re: [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY
 flag is set

On Thu, Dec 14, 2023 at 9:40 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Wed, Dec 13, 2023 at 10:30 PM Salvatore Dipietro <dipiets@...zon.com> wrote:
> >
> > > It looks like the above disables autocorking even after the userspace
> > > sets TCP_CORK. Am I reading it correctly? Is that expected?
> >
> > I have tested a new version of the patch which can target only TCP_NODELAY.
> > Results using previous benchmark are identical. I will submit it in a new
> > patch version.
>
> Well, I do not think we will accept a patch there, because you
> basically are working around the root cause
> for a certain variety of workloads.
>
> Issue would still be there for applications not using TCP_NODELAY
>
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -716,7 +716,8 @@
> >
> >         tcp_mark_urg(tp, flags);
> >
> > -       if (tcp_should_autocork(sk, skb, size_goal)) {
> > +       if (!(nonagle & TCP_NAGLE_OFF) &&
> > +           tcp_should_autocork(sk, skb, size_goal)) {
> >
> >                 /* avoid atomic op if TSQ_THROTTLED bit is already set */
> >                 if (!test_bit(TSQ_THROTTLED, &sk->sk_tsq_flags)) {
> >
> >
> >
> > > Also I wonder about these 40ms delays, TCP small queue handler should
> > > kick when the prior skb is TX completed.
> > >
> > > It seems the issue is on the driver side ?
> > >
> > > Salvatore, which driver are you using ?
> >
> > I am using ENA driver.
> >
> > Eric can you please clarify where do you think the problem is?
>
> The problem is that TSQ logic is not working properly, probably
> because the driver
> holds a packet that has been sent.
>
> TX completion seems to be delayed until the next transmit happens on
> the transmit queue.
>
> I suspect some kind of missed interrupt or a race.
>
> virtio_net is known to have a similar issue (not sure if this has been
> fixed lately)
>
> ena_io_poll() and ena_intr_msix_io() logic, playing with
> ena_napi->interrupts_masked seem
> convoluted/risky to me.
>
> ena_start_xmit() also seems to have bugs vs xmit_more logic, but this
> is orthogonal.
>
> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> index c44c44e26ddfe74a93b7f1fb3c3ca90f978909e2..5282e718699ba9e64765bea2435e1c5a55aaa89b
> 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> @@ -3235,6 +3235,8 @@ static netdev_tx_t ena_start_xmit(struct sk_buff
> *skb, struct net_device *dev)
>
>  error_drop_packet:
>         dev_kfree_skb(skb);
> +       /* Make sure to ring the doorbell. */
> +       ena_ring_tx_doorbell(tx_ring);
>         return NETDEV_TX_OK;
>  }

ena_io_poll() has a race against
u64_stats_update_begin(&tx_ring->syncp);/u64_stats_update_end(&tx_ring->syncp);

This should be done by this thread, while it still owns the NAPI SCHED bit.

Doing anything that might be racy after napi_complete_done() is a bug.
In this case, this could brick foreverer ena_get_stats64().

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index c44c44e26ddfe74a93b7f1fb3c3ca90f978909e2..e3464adfd0b791af621c92a651125ced2ad2de8a
100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2017,7 +2017,6 @@ static int ena_io_poll(struct napi_struct *napi,
int budget)
        int tx_work_done;
        int rx_work_done = 0;
        int tx_budget;
-       int napi_comp_call = 0;
        int ret;

        tx_ring = ena_napi->tx_ring;
@@ -2038,6 +2037,11 @@ static int ena_io_poll(struct napi_struct
*napi, int budget)
        if (likely(budget))
                rx_work_done = ena_clean_rx_irq(rx_ring, napi, budget);

+       u64_stats_update_begin(&tx_ring->syncp);
+       tx_ring->tx_stats.tx_poll++;
+       u64_stats_update_end(&tx_ring->syncp);
+       WRITE_ONCE(tx_ring->tx_stats.last_napi_jiffies, jiffies);
+
        /* If the device is about to reset or down, avoid unmask
         * the interrupt and return 0 so NAPI won't reschedule
         */
@@ -2047,7 +2051,9 @@ static int ena_io_poll(struct napi_struct *napi,
int budget)
                ret = 0;

        } else if ((budget > rx_work_done) && (tx_budget > tx_work_done)) {
-               napi_comp_call = 1;
+               u64_stats_update_begin(&tx_ring->syncp);
+               tx_ring->tx_stats.napi_comp++;
+               u64_stats_update_end(&tx_ring->syncp);

                /* Update numa and unmask the interrupt only when schedule
                 * from the interrupt context (vs from sk_busy_loop)
@@ -2071,13 +2077,6 @@ static int ena_io_poll(struct napi_struct
*napi, int budget)
                ret = budget;
        }

-       u64_stats_update_begin(&tx_ring->syncp);
-       tx_ring->tx_stats.napi_comp += napi_comp_call;
-       tx_ring->tx_stats.tx_poll++;
-       u64_stats_update_end(&tx_ring->syncp);
-
-       tx_ring->tx_stats.last_napi_jiffies = jiffies;
-
        return ret;
 }

@@ -3235,6 +3234,8 @@ static netdev_tx_t ena_start_xmit(struct sk_buff
*skb, struct net_device *dev)

 error_drop_packet:
        dev_kfree_skb(skb);
+       /* Make sure to ring the doorbell. */
+       ena_ring_tx_doorbell(tx_ring);
        return NETDEV_TX_OK;
 }