[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f3868a5f9abe263f4ebebd21382cd022afa6a029.camel@calian.com>
Date: Wed, 11 May 2022 17:16:55 +0000
From: Robert Hancock <robert.hancock@...ian.com>
To: "kuba@...nel.org" <kuba@...nel.org>
CC: "pabeni@...hat.com" <pabeni@...hat.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"davem@...emloft.net" <davem@...emloft.net>,
"michal.simek@...inx.com" <michal.simek@...inx.com>,
"radhey.shyam.pandey@...inx.com" <radhey.shyam.pandey@...inx.com>,
"edumazet@...gle.com" <edumazet@...gle.com>,
"linux-arm-kernel@...ts.infradead.org"
<linux-arm-kernel@...ts.infradead.org>
Subject: Re: [PATCH net-next v5] net: axienet: Use NAPI for TX completion path
On Tue, 2022-05-10 at 18:56 -0700, Jakub Kicinski wrote:
> On Mon, 9 May 2022 11:30:39 -0600 Robert Hancock wrote:
> > This driver was using the TX IRQ handler to perform all TX completion
> > tasks. Under heavy TX network load, this can cause significant irqs-off
> > latencies (found to be in the hundreds of microseconds using ftrace).
> > This can cause other issues, such as overrunning serial UART FIFOs when
> > using high baud rates with limited UART FIFO sizes.
> >
> > Switch to using a NAPI poll handler to perform the TX completion work
> > to get this out of hard IRQ context and avoid the IRQ latency impact.
> > A separate poll handler is used for TX and RX since they have separate
> > IRQs on this controller, so that the completion work for each of them
> > stays on the same CPU as the interrupt.
> >
> > Testing on a Xilinx MPSoC ZU9EG platform using iperf3 from a Linux PC
> > through a switch at 1G link speed showed no significant change in TX or
> > RX throughput, with approximately 941 Mbps before and after. Hard IRQ
> > time in the TX throughput test was significantly reduced from 12% to
> > below 1% on the CPU handling TX interrupts, with total hard+soft IRQ CPU
> > usage dropping from about 56% down to 48%.
> >
> > Signed-off-by: Robert Hancock <robert.hancock@...ian.com>
> > ---
> >
> > Changed since v4: Added locking to protect TX ring tail pointer against
> > concurrent access by TX transmit and TX poll paths.
>
> Hi, sorry for a late reply there's just too many patches to look at
> lately.
>
> The lock is slightly concerning, the driver follows the usual wake up
> scheme based on memory barriers. If we add the lock we should probably
> take the barriers out.
So there's basically two places where there is contention, axienet_start_xmit
where it is moving the tail pointer down after adding more entries to the TX
ring, and the TX poll function calling axienet_check_tx_bd_space where it is
using the tail pointer to see if there is enough space in the TX ring to wake
the queue. I suppose barriers are likely sufficient if the code updating the
ring pointer is more careful about how it is done - for example in the snippet
quoted below, it's moving the pointer down and then moving it back to 0 if it
is past the end of the ring; this would need to change to only update the
pointer once and not have the intermediate state where it is at an invalid
position.
I think the stability issue I saw earlier was not actually due to these changes
however, but to similar changes in v1 of the "net: macb: use NAPI for TX
completion path" patch. In the case of that driver, it was previously relying
on the TX completion path being protected by a spinlock in the IRQ handler,
which was lost when the TX completion was moved to a poll function.
>
> We can also try to avoid the lock and drill into what the issue is.
> At a quick look it seems like there is a barrier missing between setup
> of the descriptors and kicking the transfer off:
>
> diff --git a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c
> b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c
> index d6fc3f7acdf0..9e244b73a0ca 100644
> --- a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c
> +++ b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c
> @@ -878,10 +878,11 @@ axienet_start_xmit(struct sk_buff *skb, struct
> net_device *ndev)
> cur_p->skb = skb;
>
> tail_p = lp->tx_bd_p + sizeof(*lp->tx_bd_v) * lp->tx_bd_tail;
> - /* Start the transfer */
> - axienet_dma_out_addr(lp, XAXIDMA_TX_TDESC_OFFSET, tail_p);
> if (++lp->tx_bd_tail >= lp->tx_bd_num)
> lp->tx_bd_tail = 0;
> + wmb(); // possibly dma_wmb()
I think the MMIO write in axienet_dma_out_addr is supposed to be an implicit
barrier, so that shouldn't be needed?
> + /* Start the transfer */
> + axienet_dma_out_addr(lp, XAXIDMA_TX_TDESC_OFFSET, tail_p);
>
> /* Stop queue if next transmit may not have space */
> if (axienet_check_tx_bd_space(lp, MAX_SKB_FRAGS + 1)) {
--
Robert Hancock
Senior Hardware Designer, Calian Advanced Technologies
www.calian.com
Powered by blists - more mailing lists