linux-kernel - Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMOZA0L3+ohfgNfDr50-rcNNnssK0q8Snde8FrWnfn8YcWH=Ew@mail.gmail.com>
Date: Fri, 21 Nov 2025 11:58:49 +0100
From: Luigi Rizzo <lrizzo@...gle.com>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Marc Zyngier <maz@...nel.org>, Luigi Rizzo <rizzo.unipi@...il.com>, 
	Paolo Abeni <pabeni@...hat.com>, Andrew Morton <akpm@...ux-foundation.org>, 
	Sean Christopherson <seanjc@...gle.com>, linux-kernel@...r.kernel.org, 
	linux-arch@...r.kernel.org, Bjorn Helgaas <bhelgaas@...gle.com>, 
	Willem de Bruijn <willemb@...gle.com>
Subject: Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation

On Wed, Nov 19, 2025 at 3:43 PM Thomas Gleixner <tglx@...utronix.de> wrote:
>
...
>
> What you really want is to define the semantics and requirements of this
> moderation mechanism versus (no particular order):
>
>     1) The existing semantics of handle_edge_irq() and
>        handle_fasteoi_irq(), which need to be guaranteed.
>
>        That's one of the most crucial points because edge type
>        interrupts have fire and forget semantics on the device side and
>        there is no guarantee that the device will send another one in
>        time. People including me wasted tons of time to decode such
>        problems in the past.
>
>     2) The existing semantics of disable/enable_irq(), irq_shutdown(),
>        synchronize_irq(), which need to be preserved.
>
>     3) Teardown of an interrupt, which also relates to shutdown and
>        synchronizing.
>
>     4) Interrupts coming in while in moderated state and "masked". That
>        can be spurious interrupts or happen due to unmasking by
>        management code, e.g. enable_irq().
>
>     5) CPU hotunplug. Define the exact point where the list has to be
>        cleaned out and the timer canceled and it's guaranteed that
>        there can't be interrupts added back to the moderation list.
>
>     6) Affinity changes. Are there any implications when the descriptor
>        is enqueued on the old target and waiting for the timer to unmask
>        it? That assumes that CPU hotunplug related affinity changes are
>        not affected because those interrupts are guaranteed not to be
>        moderated.
>
>     7) Suspend/resume. That needs some deep analysis of the potential
>        state space.
>
>     8) Eligibility beyond edge, single target, etc. That might need
>        actual opt-in support from the interrupt chip hierarchy as
>        otherwise this is going to end up in a continous stream of
>        undecodable bug reports, which others have to deal with :(
>        See my remark vs. unmaskable MSI above.
>
> Once you figured all of that out, it becomes pretty clear how the
> implementation has to look like and you can start building it up piece
> wise by doing the necessary refactoring and preparatory changes first so
> that the actual functionality falls into place at the end.
>
> If you start the other way round by glueing the functionality somewhere
> into interrupt handling first and then trying to address the resulting
> fallout, you'll get nowhere. Trust me that a lot of smart people have
> failed that way.
>
> That said, I'm happy to answer _informed_ questions about semantics,
> rules and requirements if they are not obvious from the code.

Addressing your bullets above here is the design,
if you have time let me know what you think,

BASIC MECHANISM
The basic control mechanism changes interrupt state as follows:

- for moderated interrupts, call __disable_irq(desc) and put the desc
  in a per-CPU list with IRQD_IRQ_INPROGRESS set.
  Set the _IRQ_DISABLE_UNLAZY status flag so disable will also
  mask the interrupt (it works even without that, but experiments
  with greedy NVME devices show clear benefits).

  NOTE: we need to patch irq_can_handle_pm() so it will return false
  on irqdesc that are in a moderation list, otherwise we have
  a deadlock when an interrupt (generated before the mask) comes in.

- when the moderation timer fires, remove the irqdesc from the list,
  clear IRQD_IRQ_INPROGRESS and call __enable_irq(desc).

RATIONALE:
- why disable/enable instead of mask/unmask:
  disable/enable already supports call from different layers
  (infrastructure and interrupt handlers), blocks stray interrupts,
  and reissues pending events on enable. If I were to use mask/unmask
  directly, I would have to replicate most of the disable logic and risk
  introducing subtle bugs.

  Setting _IRQ_DISABLE_UNLAZY makes the mask immediate, which based on
  tests (greedy NVME devices) seems beneficial.

- why IRQD_IRQ_INPROGRESS instead of using another flag:
  IRQD_IRQ_INPROGRESS seems the way to implement synchronization with
  infrastructure events (shutdown etc.). We can argue that when an
  irqdesc is in the timer list it is not "inprogress" and could be
  preempted (with huge pain, as it could be on a different CPU),
  but for practical purposes most of the actions that block on INPROGRESS
  should also wait for the irqdesc to come off the timer list.

- what about spurious interrupts:
  there is no way to block a device from sending them one after another,
  so using the existing protection (don't call the handler if disabled)
  seems the easiest way to deal with them.

HANDLING OF VARIOUS EVENTS:

INTERRUPT TEARDOWN
  free_irq() uses synchronize_irq() before proceeding, which waits until
  IRQD_IRQ_INPROGRESS is clear. This guarantees that a moderated interrupt
  will not be destroyed while it is on the timer list.

INTERRUPT MIGRATION
  Migration changes the affinity mask, but it takes effect only when
  trying to run the handler. That too is prevented by IRQD_IRQ_INPROGRESS
  being set, so the new CPU will be used only after the interrupt exits
  the timer list.

HOTPLUG
  During shutdown the system moves timers and reassigns interrupt affinity
  to a new CPU. The easiest way to guarantee that pending events are
  handled correctly is to use a per-CPU "moderation_on" flag managed as
  follows by hotplug callbacks on CPUHP_ONLINE (or nearby)

  - on setup, set the flag. Only now interrupts can be moderated.
  - on shutdown, with interrupts disabled, clear the flag thus preventing
    more interrupts to be moderated on that CPU; process the list of moderated
    interrupts (as if the timer had fired), and cancel the timer.

  This avoids depending on a specific state: no moderation before fully
  online, and cleanup before any teardown may interfere with moderation
  timers.


SUSPEND
  The hotplug callbacks are also invoked during deep suspend so that
makes it safe.

BOOT PROCESSOR
  Hotplug callbacks are not invoked for the boot processor. However the
  boot processor is the last one to go, and since there is no other place
  to run the timer callbacks, they will be run where they are supposed to.

---
cheers
luigi