netdev - Re: [PATCH net-next v5] net: axienet: Use NAPI for TX completion path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20220510185639.1c6d6c8a@kernel.org>
Date:   Tue, 10 May 2022 18:56:39 -0700
From:   Jakub Kicinski <kuba@...nel.org>
To:     Robert Hancock <robert.hancock@...ian.com>
Cc:     netdev@...r.kernel.org, radhey.shyam.pandey@...inx.com,
        davem@...emloft.net, edumazet@...gle.com, pabeni@...hat.com,
        michal.simek@...inx.com, linux-arm-kernel@...ts.infradead.org
Subject: Re: [PATCH net-next v5] net: axienet: Use NAPI for TX completion
 path

On Mon,  9 May 2022 11:30:39 -0600 Robert Hancock wrote:
> This driver was using the TX IRQ handler to perform all TX completion
> tasks. Under heavy TX network load, this can cause significant irqs-off
> latencies (found to be in the hundreds of microseconds using ftrace).
> This can cause other issues, such as overrunning serial UART FIFOs when
> using high baud rates with limited UART FIFO sizes.
> 
> Switch to using a NAPI poll handler to perform the TX completion work
> to get this out of hard IRQ context and avoid the IRQ latency impact.
> A separate poll handler is used for TX and RX since they have separate
> IRQs on this controller, so that the completion work for each of them
> stays on the same CPU as the interrupt.
> 
> Testing on a Xilinx MPSoC ZU9EG platform using iperf3 from a Linux PC
> through a switch at 1G link speed showed no significant change in TX or
> RX throughput, with approximately 941 Mbps before and after. Hard IRQ
> time in the TX throughput test was significantly reduced from 12% to
> below 1% on the CPU handling TX interrupts, with total hard+soft IRQ CPU
> usage dropping from about 56% down to 48%.
> 
> Signed-off-by: Robert Hancock <robert.hancock@...ian.com>
> ---
> 
> Changed since v4: Added locking to protect TX ring tail pointer against
> concurrent access by TX transmit and TX poll paths.

Hi, sorry for a late reply there's just too many patches to look at
lately.

The lock is slightly concerning, the driver follows the usual wake up 
scheme based on memory barriers. If we add the lock we should probably
take the barriers out.

We can also try to avoid the lock and drill into what the issue is.
At a quick look it seems like there is a barrier missing between setup
of the descriptors and kicking the transfer off:

diff --git a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c
index d6fc3f7acdf0..9e244b73a0ca 100644
--- a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c
+++ b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c
@@ -878,10 +878,11 @@ axienet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	cur_p->skb = skb;
 
 	tail_p = lp->tx_bd_p + sizeof(*lp->tx_bd_v) * lp->tx_bd_tail;
-	/* Start the transfer */
-	axienet_dma_out_addr(lp, XAXIDMA_TX_TDESC_OFFSET, tail_p);
 	if (++lp->tx_bd_tail >= lp->tx_bd_num)
 		lp->tx_bd_tail = 0;
+	wmb(); // possibly dma_wmb()
+	/* Start the transfer */
+	axienet_dma_out_addr(lp, XAXIDMA_TX_TDESC_OFFSET, tail_p);
 
 	/* Stop queue if next transmit may not have space */
 	if (axienet_check_tx_bd_space(lp, MAX_SKB_FRAGS + 1)) {