linux-kernel - Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <877bvmm6b2.ffs@tglx>
Date: Wed, 19 Nov 2025 15:43:29 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Luigi Rizzo <lrizzo@...gle.com>
Cc: Marc Zyngier <maz@...nel.org>, Luigi Rizzo <rizzo.unipi@...il.com>,
 Paolo Abeni <pabeni@...hat.com>, Andrew Morton
 <akpm@...ux-foundation.org>, Sean Christopherson <seanjc@...gle.com>,
 linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org, Bjorn Helgaas
 <bhelgaas@...gle.com>, Willem de Bruijn <willemb@...gle.com>
Subject: Re: [PATCH v2 3/8] genirq: soft_moderation: implement fixed moderation

On Wed, Nov 19 2025 at 00:06, Luigi Rizzo wrote:
> On Tue, Nov 18, 2025 at 7:25 PM Luigi Rizzo <lrizzo@...gle.com> wrote:
>> > That's kinda true for the per interrupt accounting, but if you look at
>> > it from a per CPU accounting perspective then you still can handle them
>> > individually and mask them when they arrive within or trigger a delay
>> > window. That means:
>>
>> ok, so to sum up, what are the correct methods to do the masking
>> you suggest, should i use mask_irq()/unmask_irq() ?
>
> I tried the following instead of disable_irq()/enable_irq().
> This seems to work also for posted_msi with no arch-specific code, as
> you suggested.

:)

> The drain_desc_list() function below should be usable directly with
> hotplug remove callbacks.

Emphasis on *should*.

> I still need to figure out if there is anything special needed when
> an interrupt is removed, or the system goes into suspend,
> while irq_desc's are in the moderation list.

Yes quite some and you forgot to mention affinity changes, interrupt
disable/enable semantics, synchronization, state consistency ...

> /* run this to mask and moderate an interrupt */
> void irq_mod_mask(struct irq_desc *desc)
> {
>         scoped_irqdesc_get_and_buslock(desc->irq_data.irq,
> IRQ_GET_DESC_CHECK_GLOBAL) {

Seriously? Why do you want to look up the descriptor again when you
already have the pointer? Copying random code together does not qualify
as programming.

>                 struct irq_mod_state *ms = this_cpu_ptr(&irq_mod_state);
>
>                 /* Add to list of moderated desc on this CPU */
>                 list_add(&desc->mod.ms_node, &ms->descs);
>                 mask_irq(scoped_irqdesc);
>                 ms->mask_irq++;
>                 if (!hrtimer_is_queued(&ms->timer))
>                         irq_moderation_start_timer(ms);
>         }

Assuming you still invoke this from handle_irq_event(), which I think is
the wrong place, then I really have to ask how that is supposed to work
correctly especially with handle_fasteoi_irq(), which is used on
ARM64. Why?

handle_fasteoi_irq()
        ....
        handle_irq_event() {
             ...
             moderation()
                mask();
        }
        
        cond_unmask_eoi_irq()
             unmask();

You need state to prevent that. There is a similar, but more subtle
issue in handle_edge_irq(), which is used on x86.

You got away with this so far due to the disable_irq() abuse and with
your unmask PoC you didn't notice the rare issue on x86, but with
handle_fasteoi_irq() this clearly can't ever work.

> /* run this on the timer callback */
> static int drain_desc_list(struct irq_mod_state *ms)
> {
>         struct irq_desc *desc, *next;
>         uint srcs = 0;
>
>         /* Last round, remove from list and unmask_irq(). */
>         list_for_each_entry_safe(desc, next, &ms->descs, mod.ms_node) {
>                 list_del_init(&desc->mod.ms_node);
>                 srcs++;
>                 ms->unmask_irq++;
>                 scoped_irqdesc_get_and_buslock(desc->irq_data.irq,
> IRQ_GET_DESC_CHECK_GLOBAL) {
>                         unmask_irq(scoped_irqdesc);

That's broken vs. state changes which occured while the descriptor was
queued. This cannot unconditionally unmask for bloody obvious reasons.

There is also the opposite: Interrupt is moderated, masked, queued and
some management code unmasks. Is that correct while it is still in the
list and marked moderated? It probably is, but only if done right and
clearly documented.

While talking about mask/unmask. You need to ensure that the interrupt
is actually maskable to be elegible for moderation. MSI does not
guarantee that, which is not an issue on ARM64 as that can mask at the
CPU level, but it is an issue on x86 which does not provide such a
mechanism.

You need to understand the intricacies of interrupt handling and
interrupt management first instead of jumping again to "works for me and
should be usable" conclusions.

Now that you know that the mask approach "works", you can stop this
"hack it into submission" frenzy and take a step back and really study
the overall problem space to come up with a proper design and
integration for this.

What you really want is to define the semantics and requirements of this
moderation mechanism versus (no particular order):

    1) The existing semantics of handle_edge_irq() and
       handle_fasteoi_irq(), which need to be guaranteed.

       That's one of the most crucial points because edge type
       interrupts have fire and forget semantics on the device side and
       there is no guarantee that the device will send another one in
       time. People including me wasted tons of time to decode such
       problems in the past.

    2) The existing semantics of disable/enable_irq(), irq_shutdown(),
       synchronize_irq(), which need to be preserved.

    3) Teardown of an interrupt, which also relates to shutdown and
       synchronizing.

    4) Interrupts coming in while in moderated state and "masked". That
       can be spurious interrupts or happen due to unmasking by
       management code, e.g. enable_irq().

    5) CPU hotunplug. Define the exact point where the list has to be
       cleaned out and the timer canceled and it's guaranteed that
       there can't be interrupts added back to the moderation list.

    6) Affinity changes. Are there any implications when the descriptor
       is enqueued on the old target and waiting for the timer to unmask
       it? That assumes that CPU hotunplug related affinity changes are
       not affected because those interrupts are guaranteed not to be
       moderated.

    7) Suspend/resume. That needs some deep analysis of the potential
       state space.

    8) Eligibility beyond edge, single target, etc. That might need
       actual opt-in support from the interrupt chip hierarchy as
       otherwise this is going to end up in a continous stream of
       undecodable bug reports, which others have to deal with :(
       See my remark vs. unmaskable MSI above.

Once you figured all of that out, it becomes pretty clear how the
implementation has to look like and you can start building it up piece
wise by doing the necessary refactoring and preparatory changes first so
that the actual functionality falls into place at the end.

If you start the other way round by glueing the functionality somewhere
into interrupt handling first and then trying to address the resulting
fallout, you'll get nowhere. Trust me that a lot of smart people have
failed that way.

That said, I'm happy to answer _informed_ questions about semantics,
rules and requirements if they are not obvious from the code.

Good luck!

Thanks,

        tglx