[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZxYxqhj7cesDO8-j@archie.me>
Date: Mon, 21 Oct 2024 17:49:14 +0700
From: Bagas Sanjaya <bagasdotme@...il.com>
To: Joe Damato <jdamato@...tly.com>,
Linux Networking <netdev@...r.kernel.org>
Cc: namangulati@...gle.com, edumazet@...gle.com, amritha.nambiar@...el.com,
sridhar.samudrala@...el.com, sdf@...ichev.me, peter@...eblog.net,
m2shafiei@...terloo.ca, bjorn@...osinc.com, hch@...radead.org,
willy@...radead.org, willemdebruijn.kernel@...il.com,
skhawaja@...gle.com, kuba@...nel.org,
Martin Karsten <mkarsten@...terloo.ca>,
"David S. Miller" <davem@...emloft.net>,
Paolo Abeni <pabeni@...hat.com>, Jonathan Corbet <corbet@....net>,
Linux Documentation <linux-doc@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Linux BPF <bpf@...r.kernel.org>
Subject: Re: [PATCH net-next v2 6/6] docs: networking: Describe irq suspension
On Mon, Oct 21, 2024 at 01:53:01AM +0000, Joe Damato wrote:
> diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
> index dfa5d549be9c..3b43477a52ce 100644
> --- a/Documentation/networking/napi.rst
> +++ b/Documentation/networking/napi.rst
> @@ -192,6 +192,28 @@ is reused to control the delay of the timer, while
> ``napi_defer_hard_irqs`` controls the number of consecutive empty polls
> before NAPI gives up and goes back to using hardware IRQs.
>
> +The above parameters can also be set on a per-NAPI basis using netlink via
> +netdev-genl. This can be done programmatically in a user application or by
> +using a script included in the kernel source tree: ``tools/net/ynl/cli.py``.
> +
> +For example, using the script:
> +
> +.. code-block:: bash
> +
> + $ kernel-source/tools/net/ynl/cli.py \
> + --spec Documentation/netlink/specs/netdev.yaml \
> + --do napi-set \
> + --json='{"id": 345,
> + "defer-hard-irqs": 111,
> + "gro-flush-timeout": 11111}'
> +
> +Similarly, the parameter ``irq-suspend-timeout`` can be set using netlink
> +via netdev-genl. There is no global sysfs parameter for this value.
In JSON, both gro-flush-timeout and irq-suspend-timeout parameter
names are written in hyphens; but the rest of the docs uses underscores
(that is, gro_flush_timeout and irq_suspend_timeout), right?
> +
> +``irq_suspend_timeout`` is used to determine how long an application can
> +completely suspend IRQs. It is used in combination with SO_PREFER_BUSY_POLL,
> +which can be set on a per-epoll context basis with ``EPIOCSPARAMS`` ioctl.
> +
> .. _poll:
>
> Busy polling
> @@ -207,6 +229,46 @@ selected sockets or using the global ``net.core.busy_poll`` and
> ``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling
> also exists.
>
> +epoll-based busy polling
> +------------------------
> +
> +It is possible to trigger packet processing directly from calls to
> +``epoll_wait``. In order to use this feature, a user application must ensure
> +all file descriptors which are added to an epoll context have the same NAPI ID.
> +
> +If the application uses a dedicated acceptor thread, the application can obtain
> +the NAPI ID of the incoming connection using SO_INCOMING_NAPI_ID and then
> +distribute that file descriptor to a worker thread. The worker thread would add
> +the file descriptor to its epoll context. This would ensure each worker thread
> +has an epoll context with FDs that have the same NAPI ID.
> +
> +Alternatively, if the application uses SO_REUSEPORT, a bpf or ebpf program be
> +inserted to distribute incoming connections to threads such that each thread is
> +only given incoming connections with the same NAPI ID. Care must be taken to
> +carefully handle cases where a system may have multiple NICs.
> +
> +In order to enable busy polling, there are two choices:
> +
> +1. ``/proc/sys/net/core/busy_poll`` can be set with a time in useconds to busy
> + loop waiting for events. This is a system-wide setting and will cause all
> + epoll-based applications to busy poll when they call epoll_wait. This may
> + not be desirable as many applications may not have the need to busy poll.
> +
> +2. Applications using recent kernels can issue an ioctl on the epoll context
> + file descriptor to set (``EPIOCSPARAMS``) or get (``EPIOCGPARAMS``) ``struct
> + epoll_params``:, which user programs can define as follows:
> +
> +.. code-block:: c
> +
> + struct epoll_params {
> + uint32_t busy_poll_usecs;
> + uint16_t busy_poll_budget;
> + uint8_t prefer_busy_poll;
> +
> + /* pad the struct to a multiple of 64bits */
> + uint8_t __pad;
> + };
> +
> IRQ mitigation
> ---------------
>
> @@ -222,12 +284,78 @@ Such applications can pledge to the kernel that they will perform a busy
> polling operation periodically, and the driver should keep the device IRQs
> permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL``
> socket option. To avoid system misbehavior the pledge is revoked
> -if ``gro_flush_timeout`` passes without any busy poll call.
> +if ``gro_flush_timeout`` passes without any busy poll call. For epoll-based
> +busy polling applications, the ``prefer_busy_poll`` field of ``struct
> +epoll_params`` can be set to 1 and the ``EPIOCSPARAMS`` ioctl can be issued to
> +enable this mode. See the above section for more details.
>
> The NAPI budget for busy polling is lower than the default (which makes
> sense given the low latency intention of normal busy polling). This is
> not the case with IRQ mitigation, however, so the budget can be adjusted
> -with the ``SO_BUSY_POLL_BUDGET`` socket option.
> +with the ``SO_BUSY_POLL_BUDGET`` socket option. For epoll-based busy polling
> +applications, the ``busy_poll_budget`` field can be adjusted to the desired value
> +in ``struct epoll_params`` and set on a specific epoll context using the ``EPIOCSPARAMS``
> +ioctl. See the above section for more details.
> +
> +It is important to note that choosing a large value for ``gro_flush_timeout``
> +will defer IRQs to allow for better batch processing, but will induce latency
> +when the system is not fully loaded. Choosing a small value for
> +``gro_flush_timeout`` can cause interference of the user application which is
> +attempting to busy poll by device IRQs and softirq processing. This value
> +should be chosen carefully with these tradeoffs in mind. epoll-based busy
> +polling applications may be able to mitigate how much user processing happens
> +by choosing an appropriate value for ``maxevents``.
> +
> +Users may want to consider an alternate approach, IRQ suspension, to help deal
> +with these tradeoffs.
> +
> +IRQ suspension
> +--------------
> +
> +IRQ suspension is a mechanism wherein device IRQs are masked while epoll
> +triggers NAPI packet processing.
> +
> +While application calls to epoll_wait successfully retrieve events, the kernel will
> +defer the IRQ suspension timer. If the kernel does not retrieve any events
> +while busy polling (for example, because network traffic levels subsided), IRQ
> +suspension is disabled and the IRQ mitigation strategies described above are
> +engaged.
> +
> +This allows users to balance CPU consumption with network processing
> +efficiency.
> +
> +To use this mechanism:
> +
> + 1. The per-NAPI config parameter ``irq_suspend_timeout`` should be set to the
> + maximum time (in nanoseconds) the application can have its IRQs
> + suspended. This is done using netlink, as described above. This timeout
> + serves as a safety mechanism to restart IRQ driver interrupt processing if
> + the application has stalled. This value should be chosen so that it covers
> + the amount of time the user application needs to process data from its
> + call to epoll_wait, noting that applications can control how much data
> + they retrieve by setting ``max_events`` when calling epoll_wait.
> +
> + 2. The sysfs parameter or per-NAPI config parameters ``gro_flush_timeout``
> + and ``napi_defer_hard_irqs`` can be set to low values. They will be used
> + to defer IRQs after busy poll has found no data.
> +
> + 3. The ``prefer_busy_poll`` flag must be set to true. This can be done using
> + the ``EPIOCSPARAMS`` ioctl as described above.
> +
> + 4. The application uses epoll as described above to trigger NAPI packet
> + processing.
> +
> +As mentioned above, as long as subsequent calls to epoll_wait return events to
> +userland, the ``irq_suspend_timeout`` is deferred and IRQs are disabled. This
> +allows the application to process data without interference.
> +
> +Once a call to epoll_wait results in no events being found, IRQ suspension is
> +automatically disabled and the ``gro_flush_timeout`` and
> +``napi_defer_hard_irqs`` mitigation mechanisms take over.
> +
> +It is expected that ``irq_suspend_timeout`` will be set to a value much larger
> +than ``gro_flush_timeout`` as ``irq_suspend_timeout`` should suspend IRQs for
> +the duration of one userland processing cycle.
>
> .. _threaded:
>
The rest LGTM, thanks!
Reviewed-by: Bagas Sanjaya <bagasdotme@...il.com>
--
An old man doll... just what I always wanted! - Clara
Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)
Powered by blists - more mailing lists