[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d811db1e-6aef-40f2-b152-7d77b8a8c2b3@amd.com>
Date: Mon, 9 Sep 2024 18:32:26 -0700
From: "Nelson, Shannon" <shannon.nelson@....com>
To: Sean Anderson <sean.anderson@...ux.dev>,
"David S . Miller" <davem@...emloft.net>, Eric Dumazet
<edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Radhey Shyam Pandey <radhey.shyam.pandey@....com>, netdev@...r.kernel.org
Cc: linux-arm-kernel@...ts.infradead.org, Michal Simek
<michal.simek@....com>, linux-kernel@...r.kernel.org,
Heng Qi <hengqi@...ux.alibaba.com>
Subject: Re: [RFC PATCH net-next v2 6/6] net: xilinx: axienet: Enable adaptive
IRQ coalescing with DIM
On 9/9/2024 4:52 PM, Sean Anderson wrote:
>
> The default RX IRQ coalescing settings of one IRQ per packet can represent
> a significant CPU load. However, increasing the coalescing unilaterally
> can result in undesirable latency under low load. Adaptive IRQ
> coalescing with DIM offers a way to adjust the coalescing settings based
> on load.
>
> This device only supports "CQE" mode [1], where each packet resets the
> timer. Therefore, an interrupt is fired either when we receive
> coalesce_count_rx packets or when the interface is idle for
> coalesce_usec_rx. With this in mind, consider the following scenarios:
>
> Link saturated
> Here we want to set coalesce_count_rx to a large value, in order to
> coalesce more packets and reduce CPU load. coalesce_usec_rx should
> be set to at least the time for one packet. Otherwise the link will
> be "idle" and we will get an interrupt for each packet anyway.
>
> Bursts of packets
> Each burst should be coalesced into a single interrupt, although it
> may be prudent to reduce coalesce_count_rx for better latency.
> coalesce_usec_rx should be set to at least the time for one packet
> so bursts are coalesced. However, additional time beyond the packet
> time will just increase latency at the end of a burst.
>
> Sporadic packets
> Due to low load, we can set coalesce_count_rx to 1 in order to
> reduce latency to the minimum. coalesce_usec_rx does not matter in
> this case.
>
> Based on this analysis, I expected the CQE profiles to look something
> like
>
> usec = 0, pkts = 1 // Low load
> usec = 16, pkts = 4
> usec = 16, pkts = 16
> usec = 16, pkts = 64
> usec = 16, pkts = 256 // High load
>
> Where usec is set to 16 to be a few us greater than the 12.3 us packet
> time of a 1500 MTU packet at 1 GBit/s. However, the CQE profile is
> instead
>
> usec = 2, pkts = 256 // Low load
> usec = 8, pkts = 128
> usec = 16, pkts = 64
> usec = 32, pkts = 64
> usec = 64, pkts = 64 // High load
>
> I found this very surprising. The number of coalesced packets
> *decreases* as load increases. But as load increases we have more
> opportunities to coalesce packets without affecting latency as much.
> Additionally, the profile *increases* the usec as the load increases.
> But as load increases, the gaps between packets will tend to become
> smaller, making it possible to *decrease* usec for better latency at the
> end of a "burst".
>
> I consider the default CQE profile unsuitable for this NIC. Therefore,
> we use the first profile outlined in this commit instead.
> coalesce_usec_rx is set to 16 by default, but the user can customize it.
> This may be necessary if they are using jumbo frames. I think adjusting
> the profile times based on the link speed/mtu would be good improvement
> for generic DIM.
>
> In addition to the above profile problems, I noticed the following
> additional issues with DIM while testing:
>
> - DIM tends to "wander" when at low load, since the performance gradient
> is pretty flat. If you only have 10p/ms anyway then adjusting the
> coalescing settings will not affect throughput very much.
> - DIM takes a long time to adjust back to low indices when load is
> decreased following a period of high load. This is because it only
> re-evaluates its settings once every 64 interrupts. However, at low
> load 64 interrupts can be several seconds.
>
> Finally: performance. This patch increases receive throughput with
> iperf3 from 840 Mbits/sec to 938 Mbits/sec, decreases interrupts from
> 69920/sec to 316/sec, and decreases CPU utilization (4x Cortex-A53) from
> 43% to 9%.
>
> [1] Who names this stuff?
>
> Signed-off-by: Sean Anderson <sean.anderson@...ux.dev>
> ---
> Heng, maybe you have some comments on DIM regarding the above?
>
> Changes in v2:
> - Don't take the RTNL in axienet_rx_dim_work to avoid deadlock. Instead,
> calculate a partial cr update that axienet_update_coalesce_rx can
> perform under a spin lock.
> - Use READ/WRITE_ONCE when accessing/modifying rx_irqs
>
> drivers/net/ethernet/xilinx/Kconfig | 1 +
> drivers/net/ethernet/xilinx/xilinx_axienet.h | 10 ++-
> .../net/ethernet/xilinx/xilinx_axienet_main.c | 80 +++++++++++++++++--
> 3 files changed, 82 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/net/ethernet/xilinx/Kconfig b/drivers/net/ethernet/xilinx/Kconfig
> index 35d96c633a33..7502214cc7d5 100644
> --- a/drivers/net/ethernet/xilinx/Kconfig
> +++ b/drivers/net/ethernet/xilinx/Kconfig
> @@ -28,6 +28,7 @@ config XILINX_AXI_EMAC
> depends on HAS_IOMEM
> depends on XILINX_DMA
> select PHYLINK
> + select DIMLIB
> help
> This driver supports the 10/100/1000 Ethernet from Xilinx for the
> AXI bus interface used in Xilinx Virtex FPGAs and Soc's.
> diff --git a/drivers/net/ethernet/xilinx/xilinx_axienet.h b/drivers/net/ethernet/xilinx/xilinx_axienet.h
> index 33d05e55567e..b6604e354de7 100644
> --- a/drivers/net/ethernet/xilinx/xilinx_axienet.h
> +++ b/drivers/net/ethernet/xilinx/xilinx_axienet.h
> @@ -9,6 +9,7 @@
> #ifndef XILINX_AXIENET_H
> #define XILINX_AXIENET_H
>
> +#include <linux/dim.h>
> #include <linux/netdevice.h>
> #include <linux/spinlock.h>
> #include <linux/interrupt.h>
> @@ -123,8 +124,7 @@
> /* Default TX/RX Threshold and delay timer values for SGDMA mode */
> #define XAXIDMA_DFT_TX_THRESHOLD 24
> #define XAXIDMA_DFT_TX_USEC 50
> -#define XAXIDMA_DFT_RX_THRESHOLD 1
> -#define XAXIDMA_DFT_RX_USEC 50
> +#define XAXIDMA_DFT_RX_USEC 16
>
> #define XAXIDMA_BD_CTRL_TXSOF_MASK 0x08000000 /* First tx packet */
> #define XAXIDMA_BD_CTRL_TXEOF_MASK 0x04000000 /* Last tx packet */
> @@ -484,6 +484,9 @@ struct skbuf_dma_descriptor {
> * @regs: Base address for the axienet_local device address space
> * @dma_regs: Base address for the axidma device address space
> * @napi_rx: NAPI RX control structure
> + * @rx_dim: DIM state for the receive queue
> + * @rx_irqs: Number of interrupts
> + * @rx_dim_enabled: Whether DIM is enabled or not
nit: These should be in the same order as in the actual struct
sln
> * @rx_cr_lock: Lock protecting @rx_dma_cr, its register, and @rx_dma_started
> * @rx_dma_cr: Nominal content of RX DMA control register
> * @rx_dma_started: Set when RX DMA is started
> @@ -566,6 +569,9 @@ struct axienet_local {
> void __iomem *dma_regs;
>
> struct napi_struct napi_rx;
> + struct dim rx_dim;
> + bool rx_dim_enabled;
> + u16 rx_irqs;
> spinlock_t rx_cr_lock;
> u32 rx_dma_cr;
> bool rx_dma_started;
> diff --git a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c
> index eb9600417d81..194ae87f534a 100644
> --- a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c
> +++ b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c
> @@ -1279,6 +1279,18 @@ static int axienet_rx_poll(struct napi_struct *napi, int budget)
> axienet_dma_out_addr(lp, XAXIDMA_RX_TDESC_OFFSET, tail_p);
>
> if (packets < budget && napi_complete_done(napi, packets)) {
> + if (READ_ONCE(lp->rx_dim_enabled)) {
> + struct dim_sample sample = {
> + .time = ktime_get(),
> + /* Safe because we are the only writer */
> + .pkt_ctr = u64_stats_read(&lp->rx_packets),
> + .byte_ctr = u64_stats_read(&lp->rx_bytes),
> + .event_ctr = READ_ONCE(lp->rx_irqs),
> + };
> +
> + net_dim(&lp->rx_dim, sample);
> + }
> +
> /* Re-enable RX completion interrupts. This should
> * cause an immediate interrupt if any RX packets are
> * already pending.
> @@ -1373,6 +1385,7 @@ static irqreturn_t axienet_rx_irq(int irq, void *_ndev)
> */
> u32 cr;
>
> + WRITE_ONCE(lp->rx_irqs, READ_ONCE(lp->rx_irqs) + 1);
> spin_lock(&lp->rx_cr_lock);
> cr = lp->rx_dma_cr;
> cr &= ~(XAXIDMA_IRQ_IOC_MASK | XAXIDMA_IRQ_DELAY_MASK);
> @@ -1670,6 +1683,7 @@ static int axienet_open(struct net_device *ndev)
> if (lp->eth_irq > 0)
> free_irq(lp->eth_irq, ndev);
> err_phy:
> + cancel_work_sync(&lp->rx_dim.work);
> cancel_delayed_work_sync(&lp->stats_work);
> phylink_stop(lp->phylink);
> phylink_disconnect_phy(lp->phylink);
> @@ -1696,6 +1710,7 @@ static int axienet_stop(struct net_device *ndev)
> napi_disable(&lp->napi_rx);
> }
>
> + cancel_work_sync(&lp->rx_dim.work);
> cancel_delayed_work_sync(&lp->stats_work);
>
> phylink_stop(lp->phylink);
> @@ -2068,6 +2083,31 @@ static void axienet_update_coalesce_rx(struct axienet_local *lp, u32 cr,
> spin_unlock_irq(&lp->rx_cr_lock);
> }
>
> +/**
> + * axienet_dim_coalesce_count_rx() - RX coalesce count for DIM
> + * @lp: Device private data
> + */
> +static u32 axienet_dim_coalesce_count_rx(struct axienet_local *lp)
> +{
> + return 1 << (lp->rx_dim.profile_ix << 1);
> +}
> +
> +/**
> + * axienet_rx_dim_work() - Adjust RX DIM settings
> + * @work: The work struct
> + */
> +static void axienet_rx_dim_work(struct work_struct *work)
> +{
> + struct axienet_local *lp =
> + container_of(work, struct axienet_local, rx_dim.work);
> + u32 cr = axienet_calc_cr(lp, axienet_dim_coalesce_count_rx(lp), 0);
> + u32 mask = XAXIDMA_COALESCE_MASK | XAXIDMA_IRQ_IOC_MASK |
> + XAXIDMA_IRQ_ERROR_MASK;
> +
> + axienet_update_coalesce_rx(lp, cr, mask);
> + lp->rx_dim.state = DIM_START_MEASURE;
> +}
> +
> /**
> * axienet_set_cr_tx() - Set TX CR
> * @lp: Device private data
> @@ -2118,6 +2158,8 @@ axienet_ethtools_get_coalesce(struct net_device *ndev,
> struct axienet_local *lp = netdev_priv(ndev);
> u32 cr;
>
> + ecoalesce->use_adaptive_rx_coalesce = lp->rx_dim_enabled;
> +
> spin_lock_irq(&lp->rx_cr_lock);
> cr = lp->rx_dma_cr;
> spin_unlock_irq(&lp->rx_cr_lock);
> @@ -2154,7 +2196,9 @@ axienet_ethtools_set_coalesce(struct net_device *ndev,
> struct netlink_ext_ack *extack)
> {
> struct axienet_local *lp = netdev_priv(ndev);
> - u32 cr;
> + bool new_dim = ecoalesce->use_adaptive_rx_coalesce;
> + bool old_dim = lp->rx_dim_enabled;
> + u32 cr, mask = ~XAXIDMA_CR_RUNSTOP_MASK;
>
> if (!ecoalesce->rx_max_coalesced_frames ||
> !ecoalesce->tx_max_coalesced_frames) {
> @@ -2162,7 +2206,7 @@ axienet_ethtools_set_coalesce(struct net_device *ndev,
> return -EINVAL;
> }
>
> - if ((ecoalesce->rx_max_coalesced_frames > 1 &&
> + if (((ecoalesce->rx_max_coalesced_frames > 1 || new_dim) &&
> !ecoalesce->rx_coalesce_usecs) ||
> (ecoalesce->tx_max_coalesced_frames > 1 &&
> !ecoalesce->tx_coalesce_usecs)) {
> @@ -2171,9 +2215,27 @@ axienet_ethtools_set_coalesce(struct net_device *ndev,
> return -EINVAL;
> }
>
> - cr = axienet_calc_cr(lp, ecoalesce->rx_max_coalesced_frames,
> - ecoalesce->rx_coalesce_usecs);
> - axienet_update_coalesce_rx(lp, cr, ~XAXIDMA_CR_RUNSTOP_MASK);
> + if (new_dim && !old_dim) {
> + cr = axienet_calc_cr(lp, axienet_dim_coalesce_count_rx(lp),
> + ecoalesce->rx_coalesce_usecs);
> + } else if (!new_dim) {
> + if (old_dim) {
> + WRITE_ONCE(lp->rx_dim_enabled, false);
> + napi_synchronize(&lp->napi_rx);
> + flush_work(&lp->rx_dim.work);
> + }
> +
> + cr = axienet_calc_cr(lp, ecoalesce->rx_max_coalesced_frames,
> + ecoalesce->rx_coalesce_usecs);
> + } else {
> + /* Dummy value for count just to calculate timer */
> + cr = axienet_calc_cr(lp, 2, ecoalesce->rx_coalesce_usecs);
> + mask = XAXIDMA_DELAY_MASK | XAXIDMA_IRQ_DELAY_MASK;
> + }
> +
> + axienet_update_coalesce_rx(lp, cr, mask);
> + if (new_dim && !old_dim)
> + WRITE_ONCE(lp->rx_dim_enabled, true);
>
> cr = axienet_calc_cr(lp, ecoalesce->tx_max_coalesced_frames,
> ecoalesce->tx_coalesce_usecs);
> @@ -2415,7 +2477,8 @@ axienet_ethtool_get_rmon_stats(struct net_device *dev,
>
> static const struct ethtool_ops axienet_ethtool_ops = {
> .supported_coalesce_params = ETHTOOL_COALESCE_MAX_FRAMES |
> - ETHTOOL_COALESCE_USECS,
> + ETHTOOL_COALESCE_USECS |
> + ETHTOOL_COALESCE_USE_ADAPTIVE_RX,
> .get_drvinfo = axienet_ethtools_get_drvinfo,
> .get_regs_len = axienet_ethtools_get_regs_len,
> .get_regs = axienet_ethtools_get_regs,
> @@ -2964,7 +3027,10 @@ static int axienet_probe(struct platform_device *pdev)
>
> spin_lock_init(&lp->rx_cr_lock);
> spin_lock_init(&lp->tx_cr_lock);
> - lp->rx_dma_cr = axienet_calc_cr(lp, XAXIDMA_DFT_RX_THRESHOLD,
> + INIT_WORK(&lp->rx_dim.work, axienet_rx_dim_work);
> + lp->rx_dim_enabled = true;
> + lp->rx_dim.profile_ix = 1;
> + lp->rx_dma_cr = axienet_calc_cr(lp, axienet_dim_coalesce_count_rx(lp),
> XAXIDMA_DFT_RX_USEC);
> lp->tx_dma_cr = axienet_calc_cr(lp, XAXIDMA_DFT_TX_THRESHOLD,
> XAXIDMA_DFT_TX_USEC);
> --
> 2.35.1.1320.gc452695387.dirty
>
>
Powered by blists - more mailing lists