[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <40d30157cafc1fde60027e5f8cfcbdfb08a63a33.camel@calian.com>
Date: Tue, 3 May 2022 18:34:13 +0000
From: Robert Hancock <robert.hancock@...ian.com>
To: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"pabeni@...hat.com" <pabeni@...hat.com>
CC: "nicolas.ferre@...rochip.com" <nicolas.ferre@...rochip.com>,
"davem@...emloft.net" <davem@...emloft.net>,
"claudiu.beznea@...rochip.com" <claudiu.beznea@...rochip.com>,
"kuba@...nel.org" <kuba@...nel.org>,
"edumazet@...gle.com" <edumazet@...gle.com>,
"tomas.melin@...sala.com" <tomas.melin@...sala.com>
Subject: Re: [PATCH net-next 2/2] net: macb: use NAPI for TX completion path
On Tue, 2022-05-03 at 09:22 +0200, Paolo Abeni wrote:
> On Fri, 2022-04-29 at 23:09 +0000, Robert Hancock wrote:
> > On Fri, 2022-04-29 at 16:31 -0600, Robert Hancock wrote:
> > > This driver was using the TX IRQ handler to perform all TX completion
> > > tasks. Under heavy TX network load, this can cause significant irqs-off
> > > latencies (found to be in the hundreds of microseconds using ftrace).
> > > This can cause other issues, such as overrunning serial UART FIFOs when
> > > using high baud rates with limited UART FIFO sizes.
> > >
> > > Switch to using the NAPI poll handler to perform the TX completion work
> > > to get this out of hard IRQ context and avoid the IRQ latency impact.
> > >
> > > The TX Used Bit Read interrupt (TXUBR) handling also needs to be moved
> > > into the NAPI poll handler to maintain the proper order of operations. A
> > > flag is used to notify the poll handler that a UBR condition needs to be
> > > handled. The macb_tx_restart handler has had some locking added for
> > > global
> > > register access, since this could now potentially happen concurrently on
> > > different queues.
> > >
> > > Signed-off-by: Robert Hancock <robert.hancock@...ian.com>
> > > ---
> > > drivers/net/ethernet/cadence/macb.h | 1 +
> > > drivers/net/ethernet/cadence/macb_main.c | 138 +++++++++++++----------
> > > 2 files changed, 80 insertions(+), 59 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/cadence/macb.h
> > > b/drivers/net/ethernet/cadence/macb.h
> > > index f0a7d8396a4a..5355cef95a9b 100644
> > > --- a/drivers/net/ethernet/cadence/macb.h
> > > +++ b/drivers/net/ethernet/cadence/macb.h
> > > @@ -1209,6 +1209,7 @@ struct macb_queue {
> > > struct macb_tx_skb *tx_skb;
> > > dma_addr_t tx_ring_dma;
> > > struct work_struct tx_error_task;
> > > + bool txubr_pending;
> > >
> > > dma_addr_t rx_ring_dma;
> > > dma_addr_t rx_buffers_dma;
> > > diff --git a/drivers/net/ethernet/cadence/macb_main.c
> > > b/drivers/net/ethernet/cadence/macb_main.c
> > > index 160dc5ad84ae..1cb8afb8ef0d 100644
> > > --- a/drivers/net/ethernet/cadence/macb_main.c
> > > +++ b/drivers/net/ethernet/cadence/macb_main.c
> > > @@ -959,7 +959,7 @@ static int macb_halt_tx(struct macb *bp)
> > > return -ETIMEDOUT;
> > > }
> > >
> > > -static void macb_tx_unmap(struct macb *bp, struct macb_tx_skb *tx_skb)
> > > +static void macb_tx_unmap(struct macb *bp, struct macb_tx_skb *tx_skb,
> > > int
> > > budget)
> > > {
> > > if (tx_skb->mapping) {
> > > if (tx_skb->mapped_as_page)
> > > @@ -972,7 +972,7 @@ static void macb_tx_unmap(struct macb *bp, struct
> > > macb_tx_skb *tx_skb)
> > > }
> > >
> > > if (tx_skb->skb) {
> > > - dev_kfree_skb_any(tx_skb->skb);
> > > + napi_consume_skb(tx_skb->skb, budget);
> > > tx_skb->skb = NULL;
> > > }
> > > }
> > > @@ -1025,12 +1025,13 @@ static void macb_tx_error_task(struct work_struct
> > > *work)
> > > (unsigned int)(queue - bp->queues),
> > > queue->tx_tail, queue->tx_head);
> > >
> > > - /* Prevent the queue IRQ handlers from running: each of them may call
> > > - * macb_tx_interrupt(), which in turn may call netif_wake_subqueue().
> > > + /* Prevent the queue NAPI poll from running, as it calls
> > > + * macb_tx_complete(), which in turn may call netif_wake_subqueue().
> > > * As explained below, we have to halt the transmission before updating
> > > * TBQP registers so we call netif_tx_stop_all_queues() to notify the
> > > * network engine about the macb/gem being halted.
> > > */
> > > + napi_disable(&queue->napi);
> > > spin_lock_irqsave(&bp->lock, flags);
> > >
> > > /* Make sure nobody is trying to queue up new packets */
> > > @@ -1058,7 +1059,7 @@ static void macb_tx_error_task(struct work_struct
> > > *work)
> > > if (ctrl & MACB_BIT(TX_USED)) {
> > > /* skb is set for the last buffer of the frame */
> > > while (!skb) {
> > > - macb_tx_unmap(bp, tx_skb);
> > > + macb_tx_unmap(bp, tx_skb, 0);
> > > tail++;
> > > tx_skb = macb_tx_skb(queue, tail);
> > > skb = tx_skb->skb;
> > > @@ -1088,7 +1089,7 @@ static void macb_tx_error_task(struct work_struct
> > > *work)
> > > desc->ctrl = ctrl | MACB_BIT(TX_USED);
> > > }
> > >
> > > - macb_tx_unmap(bp, tx_skb);
> > > + macb_tx_unmap(bp, tx_skb, 0);
> > > }
> > >
> > > /* Set end of TX queue */
> > > @@ -1118,25 +1119,28 @@ static void macb_tx_error_task(struct work_struct
> > > *work)
> > > macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART));
> > >
> > > spin_unlock_irqrestore(&bp->lock, flags);
> > > + napi_enable(&queue->napi);
> > > }
> > >
> > > -static void macb_tx_interrupt(struct macb_queue *queue)
> > > +static bool macb_tx_complete_pending(struct macb_queue *queue)
> > > +{
> > > + if (queue->tx_head != queue->tx_tail) {
> > > + /* Make hw descriptor updates visible to CPU */
> > > + rmb();
> > > +
> > > + if (macb_tx_desc(queue, queue->tx_tail)->ctrl &
> > > MACB_BIT(TX_USED))
> > > + return true;
> > > + }
> > > + return false;
> > > +}
> > > +
> > > +static void macb_tx_complete(struct macb_queue *queue, int budget)
> > > {
> > > unsigned int tail;
> > > unsigned int head;
> > > - u32 status;
> > > struct macb *bp = queue->bp;
> > > u16 queue_index = queue - bp->queues;
> > >
> > > - status = macb_readl(bp, TSR);
> > > - macb_writel(bp, TSR, status);
> > > -
> > > - if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
> > > - queue_writel(queue, ISR, MACB_BIT(TCOMP));
> > > -
> > > - netdev_vdbg(bp->dev, "macb_tx_interrupt status = 0x%03lx\n",
> > > - (unsigned long)status);
> > > -
> > > head = queue->tx_head;
> > > for (tail = queue->tx_tail; tail != head; tail++) {
> > > struct macb_tx_skb *tx_skb;
> > > @@ -1182,7 +1186,7 @@ static void macb_tx_interrupt(struct macb_queue
> > > *queue)
> > > }
> > >
> > > /* Now we can safely release resources */
> > > - macb_tx_unmap(bp, tx_skb);
> > > + macb_tx_unmap(bp, tx_skb, budget);
> > >
> > > /* skb is set only for the last buffer of the frame.
> > > * WARNING: at this point skb has been freed by
> > > @@ -1569,23 +1573,55 @@ static bool macb_rx_pending(struct macb_queue
> > > *queue)
> > > return (desc->addr & MACB_BIT(RX_USED)) != 0;
> > > }
> > >
> > > +static void macb_tx_restart(struct macb_queue *queue)
> > > +{
> > > + unsigned int head = queue->tx_head;
> > > + unsigned int tail = queue->tx_tail;
> > > + struct macb *bp = queue->bp;
> > > + unsigned int head_idx, tbqp;
> > > +
> > > + if (head == tail)
> > > + return;
> > > +
> > > + tbqp = queue_readl(queue, TBQP) / macb_dma_desc_get_size(bp);
> > > + tbqp = macb_adj_dma_desc_idx(bp, macb_tx_ring_wrap(bp, tbqp));
> > > + head_idx = macb_adj_dma_desc_idx(bp, macb_tx_ring_wrap(bp, head));
> > > +
> > > + if (tbqp == head_idx)
> > > + return;
> > > +
> > > + spin_lock_irq(&bp->lock);
> > > + macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART));
> > > + spin_unlock_irq(&bp->lock);
> > > +}
> > > +
> > > static int macb_poll(struct napi_struct *napi, int budget)
> > > {
> > > struct macb_queue *queue = container_of(napi, struct macb_queue, napi);
> > > struct macb *bp = queue->bp;
> > > int work_done;
> > >
> > > + macb_tx_complete(queue, budget);
> > > +
> > > + rmb(); // ensure txubr_pending is up to date
> > > + if (queue->txubr_pending) {
> > > + queue->txubr_pending = false;
> > > + netdev_vdbg(bp->dev, "poll: tx restart\n");
> > > + macb_tx_restart(queue);
> > > + }
> > > +
> >
> > Thinking about this a bit more, I wonder if we could just do this
> > tx_restart
> > call unconditionally here? Then we wouldn't need this txubr_pending flag at
> > all. CCing Tomas Melin who worked on this tx_restart code recently.
>
> Looking only at the code, and lacking the NIC specs, the two
> alternative look functionally equivalent.
>
> Performance wise it could matter. It depends on the relative cost of
> ISR+memory barriers vs restarting TX - removing txubr_pending you will
> trade the former for the latter.
>
> I guess the easier way to check is doing performance comparisons with
> the 2 options. I hope you have the relevant H/W handy ;)
I've done some benchmarks with this under a moderate TX load (about 600 Mbps)
on the Xilinx ZynqMP platform. It looks like the softirq CPU usage is a few
percent higher when doing an unconditional TX restart (around 52% vs. 48-50%
with the code as submitted). So it's not earth shattering or highly conclusive
either way, but it seems the way I had it is likely more efficient.
>
> Thanks,
>
> Paolo
>
Powered by blists - more mailing lists