lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251217112128.1401896-1-lrizzo@google.com>
Date: Wed, 17 Dec 2025 11:21:25 +0000
From: Luigi Rizzo <lrizzo@...gle.com>
To: Thomas Gleixner <tglx@...utronix.de>, Marc Zyngier <maz@...nel.org>, 
	Luigi Rizzo <rizzo.unipi@...il.com>, Paolo Abeni <pabeni@...hat.com>, 
	Andrew Morton <akpm@...ux-foundation.org>, Sean Christopherson <seanjc@...gle.com>, 
	Jacob Pan <jacob.jun.pan@...ux.intel.com>
Cc: linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org, 
	Bjorn Helgaas <bhelgaas@...gle.com>, Willem de Bruijn <willemb@...gle.com>, 
	Luigi Rizzo <lrizzo@...gle.com>
Subject: [PATCH-v3 0/3] Global Software Interrupt Moderation (GSIM)

Global Software Interrupt Moderation (GSIM) specifically addresses a
limitation of platforms, from many vendors, whose I/O performance drops
significantly when the total rate of MSI-X interrupts is too high (e.g
1-3M intr/s depending on the platform).

Conventional interrupt moderation, typically implemented in hardware
by NICs or storage devices, operates separately on each source (e.g. a
completion queue). Large servers can have hundreds of sources, and without
knowledge of global activity, keeping the total rate bounded would require
moderation delays of 100-200us, and adaptive moderation would have to
reach those delays with as little as 10K intr/s per source. These values
are unacceptable for RPC or transactional workloads.

To address this problem, GSIM measures efficiently the total and
per-CPU interrupt rates, so that individual moderation delays can be
dynamically adjusted based on actual global and local load. This way,
delays are normally 0 or very small except during actual
local/global overload.

As an additional benefit, GSIM also monitors the percentage of time
spent by each CPU in hardirq, and can use moderation to reserve some
time for other, lower priority, tasks.

Configuration is easy and robust. System administrators specify the
maximum targets (moderation delay; interrupt rate; and percentage of time
spent in hardirq), and which interrupt sources should be moderated (can be
done per-interrupt, per device, or globally). Independent per-CPU control
loops adjust actual delays to try and keep metrics within the targets.

The system is adaptive, and moderation does not affect throughput but
only latency and only in high load scenarios. Hence, targets don't need
to match precisely the platform limits, and one can make conservative
and robust choices. Values like delay_us=100, target_irq_rate=1000000,
hardirq_percent=70 are a very good starting point.

GSIM does not rely on any special hardware feature.

Defaults are set at boot via module parameters

    irq_moderation.${NAME}=${VALUE}

and can be changed runtime with

    echo ${NAME}=${VALUE} /proc/irq/soft_moderation

/proc/irq/soft_moderation is also used to export statistics.

Moderation on individual interrupts can be turned on/off at runtime with

    echo 1 > /proc/irq/NN/moderation  # use 0 to disable

PERFORMANCE BENEFITS:
Below are some experimental results under high load comparing conventional
moderation with GSIM:

- 100Gbps NIC, 32 queues: rx goes from 50 Gbps to 92.8 Gbps (line rate).
- 200Gbps NIC, 10 VMs (total 160 queues): rx goes from 30 Gbps to 190 Gbps (line rate).
- 12 SSD, 96 queues: 4K random read goes from 6M to 20.5M IOPS (device max).

In all cases, latency up to p95 is unaffected at low/moderate load,
even if compared with no moderation at all.

Changes in v3:
- clearly documented architecture in kernel/irq/irq_moderation.c
  including how to handle enable/disable/mask, interrupt migration,
  hotplug and suspend.
- split implementation in 4 files irq_moderation.[ch] and
  irq_moderation_hook.[ch] for better separation of control plane and
  "dataplane" (functions ran on each interrupt)
- limited scope to handle_edge_irq() and handle_fasteoi_irq() which
  have been tested on actual hardware.
- tested on Intel (also with intremap=posted_msi), AMD, ARM, with NIC,
  nvme, vfio

Changes in v2:
- many style fixes (mostly on comments) based on reviewers' comments on v1
- removed background from Documentation/core-api/irq/irq-moderation.rst
- split procfs handlers
- moved internal details to kernel/irq/irq_moderation.h
- use cpu hotplug for per-CPU setup, removed unnecessary arch-specific changes
- select suitable irqs based on !irqd_is_level_type(irqd) && irqd_is_single_target(irqd)
- use a static_key to enable/disable the feature

There are two open comments from v1 for which I would like maintainer's
clarifications

- handle_irq_event() calls irq_moderation_hook() after releasing the lock,
  so it can call disable_irq_nosync(). It may be possible to move the
  call before releasing the lock and use __disable_irq(). I am not sure
  if there is any benefit in making the change.

- the timer callback calls handle_irq_event_percpu(desc) on moderated
  irqdesc (which have irqd_irq_disabled(irqd) == 1) without changing
  IRQD_IRQ_INPROGRESS. I am not sure if this should be protected with
  the following, and especially where it would make a difference
  (specifically because that the desc is disabled during this sequence).

     raw_spin_lock(&desc->lock);
     irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
     raw_spin_unlock(&desc->lock)

     handle_irq_event_percpu(desc); // <--

     raw_spin_lock(&desc->lock);
     irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
     raw_spin_unlock(&desc->lock)


Luigi Rizzo (3):
  genirq: Fixed Global Software Interrupt Moderation (GSIM)
  genirq: Adaptive Global Software Interrupt Moderation (GSIM)
  genirq: Configurable default mode for GSIM

 include/linux/irqdesc.h          |  28 ++
 kernel/irq/Kconfig               |  12 +
 kernel/irq/Makefile              |   1 +
 kernel/irq/chip.c                |  16 +-
 kernel/irq/irq_moderation.c      | 613 +++++++++++++++++++++++++++++++
 kernel/irq/irq_moderation.h      | 135 +++++++
 kernel/irq/irq_moderation_hook.c | 157 ++++++++
 kernel/irq/irq_moderation_hook.h | 102 +++++
 kernel/irq/irqdesc.c             |   1 +
 kernel/irq/manage.c              |   4 +
 kernel/irq/proc.c                |   3 +
 11 files changed, 1071 insertions(+), 1 deletion(-)
 create mode 100644 kernel/irq/irq_moderation.c
 create mode 100644 kernel/irq/irq_moderation.h
 create mode 100644 kernel/irq/irq_moderation_hook.c
 create mode 100644 kernel/irq/irq_moderation_hook.h

-- 
2.52.0.305.g3fc767764a-goog


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ