[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251116182839.939139-1-lrizzo@google.com>
Date: Sun, 16 Nov 2025 18:28:31 +0000
From: Luigi Rizzo <lrizzo@...gle.com>
To: Thomas Gleixner <tglx@...utronix.de>, Marc Zyngier <maz@...nel.org>,
Luigi Rizzo <rizzo.unipi@...il.com>, Paolo Abeni <pabeni@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>, Sean Christopherson <seanjc@...gle.com>,
Jacob Pan <jacob.jun.pan@...ux.intel.com>
Cc: linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org,
Bjorn Helgaas <bhelgaas@...gle.com>, Willem de Bruijn <willemb@...gle.com>,
Luigi Rizzo <lrizzo@...gle.com>
Subject: [PATCH v2 0/8] platform wide software interrupt moderation
Platform wide software interrupt moderation specifically addresses a
limitation of platforms, from many vendors, whose I/O performance drops
significantly when the total rate of MSI-X interrupts is too high (e.g
1-3M intr/s depending on the platform).
Conventional interrupt moderation operates separately on each source,
hence the configuration should target the worst case. On large servers
with hundreds of interrupt sources, keeping the total rate bounded would
require delays of 100-200us; and adaptive moderation would have to reach
those delays with as little as 10K intr/s per source. These values are
unacceptable for RPC or transactional workloads.
To address this problem, this code measures efficiently the total and
per-CPU interrupt rates, so that individual moderation delays can be
adjusted based on actual global and local load. This way, the system
controls both global interrupt rates and individual CPU load, and
tunes delays so they are normally 0 or very small except during actual
local/global overload.
Configuration is easy and robust. System administrators specify the
maximum targets (moderation delay; interrupt rate; and fraction of time
spent in hardirq), and per-CPU control loops adjust actual delays to try
and keep metrics within the bounds.
There is no need for exact targets, because the system is adaptive.
Values like delay_us=100, target_irq_rate=1000000, hardirq_percent=70
are good almost everywhere.
The system does not rely on any special hardware feature except from
devices recording pending interrupts.
Boot defaults are set via module parameters (/sys/module/irq_moderation
and /sys/module/${DRIVER}) or at runtime via /proc/irq/moderation, which
is also used to export statistics. Moderation on individual irq can be
turned on/off via /proc/irq/NN/moderation .
PERFORMANCE BENEFITS:
Below are some experimental results under high load (before/after rates
are measured with conventional moderation or with this change):
- 100Gbps NIC, 32 queues: rx goes from 50-60Gbps to 92.8 Gbps (line rate).
- 200Gbps NIC, 10 VMs (total 160 queues): rx goes from 30 Gbps to 190 Gbps (line rate).
- 12 SSD, 96 queues: 4K random read goes from 6M to 20.5 MIOPS (device max).
In all cases, latency up to p95 is unaffected at low/moderate load,
even if compared with no moderation at all.
Changes in v2:
- many style fixes (mostly on comments) based on reviewers' comments on v1
- removed background from Documentation/core-api/irq/irq-moderation.rst
- split procfs handlers
- moved internal details to kernel/irq/irq_moderation.h
- use cpu hotplug for per-CPU setup, removed unnecessary arch-specific changes
- select suitable irqs based on !irqd_is_level_type(irqd) && irqd_is_single_target(irqd)
- use a static_key to enable/disable the feature
There are two open comments from v1 for which I would like maintainer's
clarifications
- handle_irq_event() calls irq_moderation_hook() after releasing the lock,
so it can call disable_irq_nosync(). It may be possible to move the
call before releasing the lock and use __disable_irq(). I am not sure
if there is any benefit in making the change.
- the timer callback calls handle_irq_event_percpu(desc) on moderated
irqdesc (which have irqd_irq_disabled(irqd) == 1) without changing
IRQD_IRQ_INPROGRESS. I am not sure if this should be protected with
the following, and especially where it would make a difference
(specifically because that the desc is disabled during this sequence).
raw_spin_lock(&desc->lock);
irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
raw_spin_unlock(&desc->lock)
handle_irq_event_percpu(desc); // <--
raw_spin_lock(&desc->lock);
irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
raw_spin_unlock(&desc->lock)
Luigi Rizzo (8):
genirq: platform wide interrupt moderation: Documentation, Kconfig,
irq_desc
genirq: soft_moderation: add base files, procfs
genirq: soft_moderation: implement fixed moderation
genirq: soft_moderation: implement adaptive moderation
x86/irq: soft_moderation: add support for posted_msi (intel)
genirq: soft_moderation: helpers for per-driver defaults
nvme-pci: add module parameter for default moderation mode
vfio-pci: add module parameter for default moderation mode
Documentation/core-api/irq/index.rst | 1 +
Documentation/core-api/irq/irq-moderation.rst | 154 +++++
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/irq.c | 13 +
drivers/nvme/host/pci.c | 3 +
drivers/vfio/pci/vfio_pci_intrs.c | 3 +
include/linux/interrupt.h | 19 +
include/linux/irqdesc.h | 18 +
kernel/irq/Kconfig | 12 +
kernel/irq/Makefile | 1 +
kernel/irq/handle.c | 3 +
kernel/irq/irq_moderation.c | 606 ++++++++++++++++++
kernel/irq/irq_moderation.h | 330 ++++++++++
kernel/irq/irqdesc.c | 1 +
kernel/irq/proc.c | 3 +
15 files changed, 1168 insertions(+), 1 deletion(-)
create mode 100644 Documentation/core-api/irq/irq-moderation.rst
create mode 100644 kernel/irq/irq_moderation.c
create mode 100644 kernel/irq/irq_moderation.h
--
2.52.0.rc1.455.g30608eb744-goog
Powered by blists - more mailing lists