[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251112192408.3646835-1-lrizzo@google.com>
Date: Wed, 12 Nov 2025 19:24:02 +0000
From: Luigi Rizzo <lrizzo@...gle.com>
To: Thomas Gleixner <tglx@...utronix.de>, Marc Zyngier <maz@...nel.org>,
Luigi Rizzo <rizzo.unipi@...il.com>, Paolo Abeni <pabeni@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>, Sean Christopherson <seanjc@...gle.com>,
Jacob Pan <jacob.jun.pan@...ux.intel.com>
Cc: linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org,
Bjorn Helgaas <bhelgaas@...gle.com>, Willem de Bruijn <willemb@...gle.com>,
Luigi Rizzo <lrizzo@...gle.com>
Subject: [PATCH 0/6] platform wide software interrupt moderation
Platform wide software interrupt moderation specifically addresses a
limitation of platforms, from many vendors, whose I/O performance drops
significantly when the total rate of MSI-X interrupts is too high (e.g
1-3M intr/s depending on the platform).
Conventional interrupt moderation operates separately on each source,
hence the configuration should target the worst case. On large servers
with hundreds of interrupt sources, keeping the total rate bounded would
require delays of 100-200us; and adaptive moderation would have to reach
those delays with as little as 10K intr/s per source. These values are
unacceptable for RPC or transactional workloads.
To address this problem, this code measures efficiently the total and
per-CPU interrupt rates, so that individual moderation delays can be
adjusted based on actual global and local load. This way, the system
controls both global interrupt rates and individual CPU load, and
tunes delays so they are normally 0 or very small except during actual
local/global overload.
Configuration is easy and robust. System administrators specify the
maximum targets (moderation delay; interrupt rate; and fraction of time
spent in hardirq), and per-CPU control loops adjust actual delays to try
and keep metrics within the bounds.
There is no need for exact targets, because the system is adaptive.
Values like delay_us=100, target_irq_rate=1000000, hardirq_percent=70
are good almost everywhere.
The system does not rely on any special hardware feature except from
MSI-X Pending Bit Array (PBA), a mandatory component of MSI-X
Boot defaults are set via module parameters (/sys/module/irq_moderation
and /sys/module/${DRIVER}) or at runtime via /proc/irq/moderation, which
is also used to export statistics. Moderation on individual irq can be
turned on/off via /proc/irq/NN/moderation .
PERFORMANCE BENEFITS:
Below are some experimental results under high load (before/after rates
are measured with conventional moderation or with this change):
- 100Gbps NIC, 32 queues: rx goes from 50-60Gbps to 92.8 Gbps (line rate).
- 200Gbps NIC, 10 VMs (total 160 queues): rx goes from 30 Gbps to 190 Gbps (line rate).
- 12 SSD, 96 queues: 4K random read goes from 6M to 20.5 MIOPS (device max).
In all cases, latency up to p95 is unaffected at low/moderate load,
even if compared with no moderation at all.
IMPLEMENTATION
- Most of the code, including module parameters and procfs hooks for
configuration and telemetry, is in two files
include/linux/irq_moderation.h and kernel/irq/irq_moderation.c.
- struct irq_desc is extended with a list entry and one field indicating
whether this source should use moderation
- handle_irq_event() and sysrec_posted_msi_notification() have small
inline hooks to track interrupts and trigger moderation as needed.
- per-CPU state is initialized via hooks in per-architecture files
- optional device driver module parameters can be added to set driver
defaults to enable/disable moderation
Luigi Rizzo (6):
genirq: platform wide interrupt moderation: Documentation, Kconfig,
irq_desc
genirq: soft_moderation: add base files, procfs hooks
genirq: soft_moderation: activate hooks in handle_irq_event()
genirq: soft_moderation: implement adaptive moderation
x86/irq: soft_moderation: add support for posted_msi (intel)
genirq: soft_moderation: implement per-driver defaults (nvme and vfio)
Documentation/core-api/irq/index.rst | 1 +
Documentation/core-api/irq/irq-moderation.rst | 215 ++++++++
arch/x86/kernel/cpu/common.c | 1 +
arch/x86/kernel/irq.c | 12 +
drivers/irqchip/irq-gic-v3.c | 2 +
drivers/nvme/host/pci.c | 3 +
drivers/vfio/pci/vfio_pci_intrs.c | 3 +
include/linux/interrupt.h | 28 +
include/linux/irq_moderation.h | 265 ++++++++++
include/linux/irqdesc.h | 5 +
kernel/irq/Kconfig | 11 +
kernel/irq/Makefile | 1 +
kernel/irq/handle.c | 3 +
kernel/irq/irq_moderation.c | 482 ++++++++++++++++++
kernel/irq/irqdesc.c | 1 +
kernel/irq/proc.c | 2 +
16 files changed, 1035 insertions(+)
create mode 100644 Documentation/core-api/irq/irq-moderation.rst
create mode 100644 include/linux/irq_moderation.h
create mode 100644 kernel/irq/irq_moderation.c
--
2.51.2.1041.gc1ab5b90ca-goog
Powered by blists - more mailing lists