lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251112192408.3646835-1-lrizzo@google.com>
Date: Wed, 12 Nov 2025 19:24:02 +0000
From: Luigi Rizzo <lrizzo@...gle.com>
To: Thomas Gleixner <tglx@...utronix.de>, Marc Zyngier <maz@...nel.org>, 
	Luigi Rizzo <rizzo.unipi@...il.com>, Paolo Abeni <pabeni@...hat.com>, 
	Andrew Morton <akpm@...ux-foundation.org>, Sean Christopherson <seanjc@...gle.com>, 
	Jacob Pan <jacob.jun.pan@...ux.intel.com>
Cc: linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org, 
	Bjorn Helgaas <bhelgaas@...gle.com>, Willem de Bruijn <willemb@...gle.com>, 
	Luigi Rizzo <lrizzo@...gle.com>
Subject: [PATCH 0/6] platform wide software interrupt moderation

Platform wide software interrupt moderation specifically addresses a
limitation of platforms, from many vendors, whose I/O performance drops
significantly when the total rate of MSI-X interrupts is too high (e.g
1-3M intr/s depending on the platform).

Conventional interrupt moderation operates separately on each source,
hence the configuration should target the worst case. On large servers
with hundreds of interrupt sources, keeping the total rate bounded would
require delays of 100-200us; and adaptive moderation would have to reach
those delays with as little as 10K intr/s per source. These values are
unacceptable for RPC or transactional workloads.

To address this problem, this code measures efficiently the total and
per-CPU interrupt rates, so that individual moderation delays can be
adjusted based on actual global and local load. This way, the system
controls both global interrupt rates and individual CPU load, and
tunes delays so they are normally 0 or very small except during actual
local/global overload.

Configuration is easy and robust. System administrators specify the
maximum targets (moderation delay; interrupt rate; and fraction of time
spent in hardirq), and per-CPU control loops adjust actual delays to try
and keep metrics within the bounds.

There is no need for exact targets, because the system is adaptive.
Values like delay_us=100, target_irq_rate=1000000, hardirq_percent=70
are good almost everywhere.

The system does not rely on any special hardware feature except from
MSI-X Pending Bit Array (PBA), a mandatory component of MSI-X

Boot defaults are set via module parameters (/sys/module/irq_moderation
and /sys/module/${DRIVER}) or at runtime via /proc/irq/moderation, which
is also used to export statistics.  Moderation on individual irq can be
turned on/off via /proc/irq/NN/moderation .

PERFORMANCE BENEFITS:
Below are some experimental results under high load (before/after rates
are measured with conventional moderation or with this change):

- 100Gbps NIC, 32 queues: rx goes from 50-60Gbps to 92.8 Gbps (line rate).
- 200Gbps NIC, 10 VMs (total 160 queues): rx goes from 30 Gbps to 190 Gbps (line rate).
- 12 SSD, 96 queues: 4K random read goes from 6M to 20.5 MIOPS (device max).

In all cases, latency up to p95 is unaffected at low/moderate load,
even if compared with no moderation at all.

IMPLEMENTATION
- Most of the code, including module parameters and procfs hooks for
  configuration and telemetry, is in two files
  include/linux/irq_moderation.h and kernel/irq/irq_moderation.c.

- struct irq_desc is extended with a list entry and one field indicating
  whether this source should use moderation

- handle_irq_event() and sysrec_posted_msi_notification() have small
  inline hooks to track interrupts and trigger moderation as needed.

- per-CPU state is initialized via hooks in per-architecture files

- optional device driver module parameters can be added to set driver
  defaults to enable/disable moderation

Luigi Rizzo (6):
  genirq: platform wide interrupt moderation: Documentation, Kconfig,
    irq_desc
  genirq: soft_moderation: add base files, procfs hooks
  genirq: soft_moderation: activate hooks in handle_irq_event()
  genirq: soft_moderation: implement adaptive moderation
  x86/irq: soft_moderation: add support for posted_msi (intel)
  genirq: soft_moderation: implement per-driver defaults (nvme and vfio)

 Documentation/core-api/irq/index.rst          |   1 +
 Documentation/core-api/irq/irq-moderation.rst | 215 ++++++++
 arch/x86/kernel/cpu/common.c                  |   1 +
 arch/x86/kernel/irq.c                         |  12 +
 drivers/irqchip/irq-gic-v3.c                  |   2 +
 drivers/nvme/host/pci.c                       |   3 +
 drivers/vfio/pci/vfio_pci_intrs.c             |   3 +
 include/linux/interrupt.h                     |  28 +
 include/linux/irq_moderation.h                | 265 ++++++++++
 include/linux/irqdesc.h                       |   5 +
 kernel/irq/Kconfig                            |  11 +
 kernel/irq/Makefile                           |   1 +
 kernel/irq/handle.c                           |   3 +
 kernel/irq/irq_moderation.c                   | 482 ++++++++++++++++++
 kernel/irq/irqdesc.c                          |   1 +
 kernel/irq/proc.c                             |   2 +
 16 files changed, 1035 insertions(+)
 create mode 100644 Documentation/core-api/irq/irq-moderation.rst
 create mode 100644 include/linux/irq_moderation.h
 create mode 100644 kernel/irq/irq_moderation.c

-- 
2.51.2.1041.gc1ab5b90ca-goog


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ