linux-kernel - Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <875xb2jh2f.ffs@tglx>
Date: Sat, 22 Nov 2025 15:08:24 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Luigi Rizzo <lrizzo@...gle.com>
Cc: Marc Zyngier <maz@...nel.org>, Luigi Rizzo <rizzo.unipi@...il.com>,
 Paolo Abeni <pabeni@...hat.com>, Andrew Morton
 <akpm@...ux-foundation.org>, Sean Christopherson <seanjc@...gle.com>,
 linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org, Bjorn Helgaas
 <bhelgaas@...gle.com>, Willem de Bruijn <willemb@...gle.com>
Subject: Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation

On Fri, Nov 21 2025 at 11:58, Luigi Rizzo wrote:
> On Wed, Nov 19, 2025 at 3:43 PM Thomas Gleixner <tglx@...utronix.de> wrote:
> Addressing your bullets above here is the design,
> if you have time let me know what you think,
>
> BASIC MECHANISM
> The basic control mechanism changes interrupt state as follows:
>
> - for moderated interrupts, call __disable_irq(desc) and put the desc
>   in a per-CPU list with IRQD_IRQ_INPROGRESS set.
>   Set the _IRQ_DISABLE_UNLAZY status flag so disable will also
>   mask the interrupt (it works even without that, but experiments
>   with greedy NVME devices show clear benefits).

What sets that flag? If you want to fiddle with it in this moderation
muck behind the scene, no.

With this new scheme where the moderation handling never invokes the
handler after the time is over, you can't actually mask upfront
unconditionally unless there is a clear indication from the underlying
interrupt chain that this works correctly with edge type interrupts.

It only works when it is guaranteed that the interrupt chip which
actually provides the masking functionality will latch an interrupt
raised in the device and then raise it down the chain when unmasked.

PCIe mandates that behaviour for MSI-X and maskable MSI, but it's not
guaranteed for other device types. And out of painful experience it's
not even true for some broken PCIe devices.

>   NOTE: we need to patch irq_can_handle_pm() so it will return false
>   on irqdesc that are in a moderation list, otherwise we have
>   a deadlock when an interrupt (generated before the mask) comes in.

I told you that you need state :)

> - when the moderation timer fires, remove the irqdesc from the list,
>   clear IRQD_IRQ_INPROGRESS and call __enable_irq(desc).
>
> RATIONALE:
> - why disable/enable instead of mask/unmask:
>   disable/enable already supports call from different layers
>   (infrastructure and interrupt handlers), blocks stray interrupts,
>   and reissues pending events on enable. If I were to use mask/unmask
>   directly, I would have to replicate most of the disable logic and risk
>   introducing subtle bugs.

It's not that much to replicate, but fair enough as long as it happens
under the lock when entering the handler. 

>   Setting _IRQ_DISABLE_UNLAZY makes the mask immediate, which based on
>   tests (greedy NVME devices) seems beneficial.

See above. What you really want is:

    1) An indicator in the irq chip which tells that the above
       requirement of raising a pending interrupt on unmask is
       "guaranteed" :)

    2) A dedicated variant of __disable_irq() which invokes
       __irq_disable() with the following twist

       disable_moderated_irq(desc)
          force = !irqd_irq_masked(..) && irq_moderate_unlazy(desc[chip]);
          if (!desc->disable++ || force)
          	irq_disable_moderated(desc, force)
                    force |= irq_settings_disable_unlazy(desc);
                    __irq_disable(desc, force);

That requires to have a new MODERATE_UNLAZY flag, which can be set by
infrastructure, e.g. the PCI/MSI core as that knows whether the
underlying device MSI implementation provides masking. That should be
probably a feature in the interrupt chip.

That needs a command line option to prevent MODERATE_UNLAZY from being
effective.

That preserves the lazy masking behaviour for non-moderated situations,
which is not only a correctness measure (see above) but preferred even
for correctly working hardware because it avoids going out to the
hardware (e.g. writing to PCI config space twice) in many cases.

Though all of this wants to be a separate step _after_ the initial
functionality has been worked out and stabilized. It's really an
optimization on top of the base functionality and we are not going to
optimize prematurely.

> - why IRQD_IRQ_INPROGRESS instead of using another flag:
>   IRQD_IRQ_INPROGRESS seems the way to implement synchronization with
>   infrastructure events (shutdown etc.). We can argue that when an
>   irqdesc is in the timer list it is not "inprogress" and could be
>   preempted (with huge pain, as it could be on a different CPU),
>   but for practical purposes most of the actions that block on INPROGRESS
>   should also wait for the irqdesc to come off the timer list.

As long as those functions are not polling the bit with interrupts
disabled that's a correct assumption.

> - what about spurious interrupts:
>   there is no way to block a device from sending them one after another,
>   so using the existing protection (don't call the handler if disabled)
>   seems the easiest way to deal with them.

Correct.

> HANDLING OF VARIOUS EVENTS:
>
> INTERRUPT TEARDOWN
>   free_irq() uses synchronize_irq() before proceeding, which waits until
>   IRQD_IRQ_INPROGRESS is clear. This guarantees that a moderated interrupt
>   will not be destroyed while it is on the timer list.

That requires a new variant for __enable_irq() because free_irq() shuts
the interrupt down and __enable_irq() would just start it up again...

> INTERRUPT MIGRATION
>   Migration changes the affinity mask, but it takes effect only when
>   trying to run the handler. That too is prevented by IRQD_IRQ_INPROGRESS
>   being set, so the new CPU will be used only after the interrupt exits
>   the timer list.

How is affinity setting related to trying to run the handler? It steers
the interrupt to a new target CPU so that the next interrupt raised in
the device ends up at the new target.

So if that happens (and the interrupt is not masked) then the handler on
the new target CPU will poll for $DELAY until the other side fires the
timer and clears the bit. That's insane and needs to be prevented by
checking moderated state.

> HOTPLUG
>   During shutdown the system moves timers and reassigns interrupt affinity
>   to a new CPU. The easiest way to guarantee that pending events are
>   handled correctly is to use a per-CPU "moderation_on" flag managed as
>   follows by hotplug callbacks on CPUHP_ONLINE (or nearby)

That can be anywhere in the state machine between ONLINE and ACTIVE.

>   - on setup, set the flag. Only now interrupts can be moderated.
>   - on shutdown, with interrupts disabled, clear the flag thus preventing
>     more interrupts to be moderated on that CPU; process the list of moderated
>     interrupts (as if the timer had fired), and cancel the timer.

Ok.

> SUSPEND
>   The hotplug callbacks are also invoked during deep suspend so that
>   makes it safe.
>
> BOOT PROCESSOR
>   Hotplug callbacks are not invoked for the boot processor. However the
>   boot processor is the last one to go, and since there is no other place
>   to run the timer callbacks, they will be run where they are supposed to.

That's only true for hibernation, aka. suspend to disk, but not for
suspend to RAM/IDLE.

You want to treat all of these scenarios the same and just make sure
that the moderation muck is disabled and nothing hangs on a list
especially not with the INPROGRESS bit set.

Thanks,

        tglx