[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251217112128.1401896-1-lrizzo@google.com>
Date: Wed, 17 Dec 2025 11:21:25 +0000
From: Luigi Rizzo <lrizzo@...gle.com>
To: Thomas Gleixner <tglx@...utronix.de>, Marc Zyngier <maz@...nel.org>,
Luigi Rizzo <rizzo.unipi@...il.com>, Paolo Abeni <pabeni@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>, Sean Christopherson <seanjc@...gle.com>,
Jacob Pan <jacob.jun.pan@...ux.intel.com>
Cc: linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org,
Bjorn Helgaas <bhelgaas@...gle.com>, Willem de Bruijn <willemb@...gle.com>,
Luigi Rizzo <lrizzo@...gle.com>
Subject: [PATCH-v3 0/3] Global Software Interrupt Moderation (GSIM)
Global Software Interrupt Moderation (GSIM) specifically addresses a
limitation of platforms, from many vendors, whose I/O performance drops
significantly when the total rate of MSI-X interrupts is too high (e.g
1-3M intr/s depending on the platform).
Conventional interrupt moderation, typically implemented in hardware
by NICs or storage devices, operates separately on each source (e.g. a
completion queue). Large servers can have hundreds of sources, and without
knowledge of global activity, keeping the total rate bounded would require
moderation delays of 100-200us, and adaptive moderation would have to
reach those delays with as little as 10K intr/s per source. These values
are unacceptable for RPC or transactional workloads.
To address this problem, GSIM measures efficiently the total and
per-CPU interrupt rates, so that individual moderation delays can be
dynamically adjusted based on actual global and local load. This way,
delays are normally 0 or very small except during actual
local/global overload.
As an additional benefit, GSIM also monitors the percentage of time
spent by each CPU in hardirq, and can use moderation to reserve some
time for other, lower priority, tasks.
Configuration is easy and robust. System administrators specify the
maximum targets (moderation delay; interrupt rate; and percentage of time
spent in hardirq), and which interrupt sources should be moderated (can be
done per-interrupt, per device, or globally). Independent per-CPU control
loops adjust actual delays to try and keep metrics within the targets.
The system is adaptive, and moderation does not affect throughput but
only latency and only in high load scenarios. Hence, targets don't need
to match precisely the platform limits, and one can make conservative
and robust choices. Values like delay_us=100, target_irq_rate=1000000,
hardirq_percent=70 are a very good starting point.
GSIM does not rely on any special hardware feature.
Defaults are set at boot via module parameters
irq_moderation.${NAME}=${VALUE}
and can be changed runtime with
echo ${NAME}=${VALUE} /proc/irq/soft_moderation
/proc/irq/soft_moderation is also used to export statistics.
Moderation on individual interrupts can be turned on/off at runtime with
echo 1 > /proc/irq/NN/moderation # use 0 to disable
PERFORMANCE BENEFITS:
Below are some experimental results under high load comparing conventional
moderation with GSIM:
- 100Gbps NIC, 32 queues: rx goes from 50 Gbps to 92.8 Gbps (line rate).
- 200Gbps NIC, 10 VMs (total 160 queues): rx goes from 30 Gbps to 190 Gbps (line rate).
- 12 SSD, 96 queues: 4K random read goes from 6M to 20.5M IOPS (device max).
In all cases, latency up to p95 is unaffected at low/moderate load,
even if compared with no moderation at all.
Changes in v3:
- clearly documented architecture in kernel/irq/irq_moderation.c
including how to handle enable/disable/mask, interrupt migration,
hotplug and suspend.
- split implementation in 4 files irq_moderation.[ch] and
irq_moderation_hook.[ch] for better separation of control plane and
"dataplane" (functions ran on each interrupt)
- limited scope to handle_edge_irq() and handle_fasteoi_irq() which
have been tested on actual hardware.
- tested on Intel (also with intremap=posted_msi), AMD, ARM, with NIC,
nvme, vfio
Changes in v2:
- many style fixes (mostly on comments) based on reviewers' comments on v1
- removed background from Documentation/core-api/irq/irq-moderation.rst
- split procfs handlers
- moved internal details to kernel/irq/irq_moderation.h
- use cpu hotplug for per-CPU setup, removed unnecessary arch-specific changes
- select suitable irqs based on !irqd_is_level_type(irqd) && irqd_is_single_target(irqd)
- use a static_key to enable/disable the feature
There are two open comments from v1 for which I would like maintainer's
clarifications
- handle_irq_event() calls irq_moderation_hook() after releasing the lock,
so it can call disable_irq_nosync(). It may be possible to move the
call before releasing the lock and use __disable_irq(). I am not sure
if there is any benefit in making the change.
- the timer callback calls handle_irq_event_percpu(desc) on moderated
irqdesc (which have irqd_irq_disabled(irqd) == 1) without changing
IRQD_IRQ_INPROGRESS. I am not sure if this should be protected with
the following, and especially where it would make a difference
(specifically because that the desc is disabled during this sequence).
raw_spin_lock(&desc->lock);
irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
raw_spin_unlock(&desc->lock)
handle_irq_event_percpu(desc); // <--
raw_spin_lock(&desc->lock);
irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
raw_spin_unlock(&desc->lock)
Luigi Rizzo (3):
genirq: Fixed Global Software Interrupt Moderation (GSIM)
genirq: Adaptive Global Software Interrupt Moderation (GSIM)
genirq: Configurable default mode for GSIM
include/linux/irqdesc.h | 28 ++
kernel/irq/Kconfig | 12 +
kernel/irq/Makefile | 1 +
kernel/irq/chip.c | 16 +-
kernel/irq/irq_moderation.c | 613 +++++++++++++++++++++++++++++++
kernel/irq/irq_moderation.h | 135 +++++++
kernel/irq/irq_moderation_hook.c | 157 ++++++++
kernel/irq/irq_moderation_hook.h | 102 +++++
kernel/irq/irqdesc.c | 1 +
kernel/irq/manage.c | 4 +
kernel/irq/proc.c | 3 +
11 files changed, 1071 insertions(+), 1 deletion(-)
create mode 100644 kernel/irq/irq_moderation.c
create mode 100644 kernel/irq/irq_moderation.h
create mode 100644 kernel/irq/irq_moderation_hook.c
create mode 100644 kernel/irq/irq_moderation_hook.h
--
2.52.0.305.g3fc767764a-goog
Powered by blists - more mailing lists