[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iLJOi+qempjx0AtWWtbX94nd4Hi9zRyFsdgmcKiq==N7Q@mail.gmail.com>
Date: Thu, 14 Dec 2023 10:05:19 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Salvatore Dipietro <dipiets@...zon.com>
Cc: alisaidi@...zon.com, benh@...zon.com, blakgeof@...zon.com,
davem@...emloft.net, dipietro.salvatore@...il.com, dsahern@...nel.org,
kuba@...nel.org, netdev@...r.kernel.org, pabeni@...hat.com
Subject: Re: [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY
flag is set
On Thu, Dec 14, 2023 at 9:40 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Wed, Dec 13, 2023 at 10:30 PM Salvatore Dipietro <dipiets@...zon.com> wrote:
> >
> > > It looks like the above disables autocorking even after the userspace
> > > sets TCP_CORK. Am I reading it correctly? Is that expected?
> >
> > I have tested a new version of the patch which can target only TCP_NODELAY.
> > Results using previous benchmark are identical. I will submit it in a new
> > patch version.
>
> Well, I do not think we will accept a patch there, because you
> basically are working around the root cause
> for a certain variety of workloads.
>
> Issue would still be there for applications not using TCP_NODELAY
>
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -716,7 +716,8 @@
> >
> > tcp_mark_urg(tp, flags);
> >
> > - if (tcp_should_autocork(sk, skb, size_goal)) {
> > + if (!(nonagle & TCP_NAGLE_OFF) &&
> > + tcp_should_autocork(sk, skb, size_goal)) {
> >
> > /* avoid atomic op if TSQ_THROTTLED bit is already set */
> > if (!test_bit(TSQ_THROTTLED, &sk->sk_tsq_flags)) {
> >
> >
> >
> > > Also I wonder about these 40ms delays, TCP small queue handler should
> > > kick when the prior skb is TX completed.
> > >
> > > It seems the issue is on the driver side ?
> > >
> > > Salvatore, which driver are you using ?
> >
> > I am using ENA driver.
> >
> > Eric can you please clarify where do you think the problem is?
>
> The problem is that TSQ logic is not working properly, probably
> because the driver
> holds a packet that has been sent.
>
> TX completion seems to be delayed until the next transmit happens on
> the transmit queue.
>
> I suspect some kind of missed interrupt or a race.
>
> virtio_net is known to have a similar issue (not sure if this has been
> fixed lately)
>
> ena_io_poll() and ena_intr_msix_io() logic, playing with
> ena_napi->interrupts_masked seem
> convoluted/risky to me.
>
> ena_start_xmit() also seems to have bugs vs xmit_more logic, but this
> is orthogonal.
>
> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> index c44c44e26ddfe74a93b7f1fb3c3ca90f978909e2..5282e718699ba9e64765bea2435e1c5a55aaa89b
> 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> @@ -3235,6 +3235,8 @@ static netdev_tx_t ena_start_xmit(struct sk_buff
> *skb, struct net_device *dev)
>
> error_drop_packet:
> dev_kfree_skb(skb);
> + /* Make sure to ring the doorbell. */
> + ena_ring_tx_doorbell(tx_ring);
> return NETDEV_TX_OK;
> }
ena_io_poll() has a race against
u64_stats_update_begin(&tx_ring->syncp);/u64_stats_update_end(&tx_ring->syncp);
This should be done by this thread, while it still owns the NAPI SCHED bit.
Doing anything that might be racy after napi_complete_done() is a bug.
In this case, this could brick foreverer ena_get_stats64().
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index c44c44e26ddfe74a93b7f1fb3c3ca90f978909e2..e3464adfd0b791af621c92a651125ced2ad2de8a
100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2017,7 +2017,6 @@ static int ena_io_poll(struct napi_struct *napi,
int budget)
int tx_work_done;
int rx_work_done = 0;
int tx_budget;
- int napi_comp_call = 0;
int ret;
tx_ring = ena_napi->tx_ring;
@@ -2038,6 +2037,11 @@ static int ena_io_poll(struct napi_struct
*napi, int budget)
if (likely(budget))
rx_work_done = ena_clean_rx_irq(rx_ring, napi, budget);
+ u64_stats_update_begin(&tx_ring->syncp);
+ tx_ring->tx_stats.tx_poll++;
+ u64_stats_update_end(&tx_ring->syncp);
+ WRITE_ONCE(tx_ring->tx_stats.last_napi_jiffies, jiffies);
+
/* If the device is about to reset or down, avoid unmask
* the interrupt and return 0 so NAPI won't reschedule
*/
@@ -2047,7 +2051,9 @@ static int ena_io_poll(struct napi_struct *napi,
int budget)
ret = 0;
} else if ((budget > rx_work_done) && (tx_budget > tx_work_done)) {
- napi_comp_call = 1;
+ u64_stats_update_begin(&tx_ring->syncp);
+ tx_ring->tx_stats.napi_comp++;
+ u64_stats_update_end(&tx_ring->syncp);
/* Update numa and unmask the interrupt only when schedule
* from the interrupt context (vs from sk_busy_loop)
@@ -2071,13 +2077,6 @@ static int ena_io_poll(struct napi_struct
*napi, int budget)
ret = budget;
}
- u64_stats_update_begin(&tx_ring->syncp);
- tx_ring->tx_stats.napi_comp += napi_comp_call;
- tx_ring->tx_stats.tx_poll++;
- u64_stats_update_end(&tx_ring->syncp);
-
- tx_ring->tx_stats.last_napi_jiffies = jiffies;
-
return ret;
}
@@ -3235,6 +3234,8 @@ static netdev_tx_t ena_start_xmit(struct sk_buff
*skb, struct net_device *dev)
error_drop_packet:
dev_kfree_skb(skb);
+ /* Make sure to ring the doorbell. */
+ ena_ring_tx_doorbell(tx_ring);
return NETDEV_TX_OK;
}
Powered by blists - more mailing lists