linux-kernel - Re: [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMOZA0KKJ9S45-LnLtYKn-L8dL71tsfs29c6ZL3bkuTcNXorAw@mail.gmail.com>
Date: Mon, 17 Nov 2025 22:34:13 +0100
From: Luigi Rizzo <lrizzo@...gle.com>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Marc Zyngier <maz@...nel.org>, Luigi Rizzo <rizzo.unipi@...il.com>, 
	Paolo Abeni <pabeni@...hat.com>, Andrew Morton <akpm@...ux-foundation.org>, 
	Sean Christopherson <seanjc@...gle.com>, Jacob Pan <jacob.jun.pan@...ux.intel.com>, 
	linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org, 
	Bjorn Helgaas <bhelgaas@...gle.com>, Willem de Bruijn <willemb@...gle.com>
Subject: Re: [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation

On Mon, Nov 17, 2025 at 9:51 PM Thomas Gleixner <tglx@...utronix.de> wrote:
>
> On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
> > Add two control parameters (target_irq_rate and hardirq_percent)
> > to indicate the desired maximum values for these two metrics.
> >
> > Every update_ms the hook in handle_irq_event() recomputes the total and
> > local interrupt rate and the amount of time spent in hardirq, compares
> > the values with the targets, and adjusts the moderation delay up or down.
> >
> > The interrupt rate is computed in a scalable way by counting interrupts
> > per-CPU, and aggregating the value in a global variable only every
> > update_ms. Only CPUs that actively process interrupts are actually
> > accessing the shared variable, so the system scales well even on very
> > large servers.
>
> You still fail to explain why this is required and why a per CPU
> moderation is not sufficient and what the benefits of the approach are.

It was explained in the first patch of the series and in Documentation/.

(First world problem, for sure: I have examples for AMD, Intel, Arm,
all of them with 100+ CPUs per numa node, and 160-480 CPUs total)
On some of the above platforms, MSIx interrupts cause heavy serialization
of all other PCIe requests. As a result, when the total interrupt rate exceeds
1-2M intrs/s, I/O throughput degrades by up to 4x and more.

To deal with this with per CPU moderation, without shared state, each CPU
cannot allow more than some 5Kintrs/s, which means fixed moderation
should be set at 200us, and adaptive moderation should jump to such
delays as soon as the local rate reaches 5K intrs/s.

In reality, it is very unlikely that all CPUs are actively handling such high
rates, so if we know better, we can adjust or remove moderation individually,
based on actual local and total interrupt rates and number of active CPUs.

The purpose of this global mechanism is to figure out whether we are
approaching a dangerous rate, and do individual tuning.

> > +     /* Compare with global and per-CPU targets. */
> > +     global_rate_high = irq_rate > target_rate;
> > +     my_rate_high = my_irq_rate * active_cpus * irq_mod_info.scale_cpus > target_rate * 100;
> > [...]
> > +     /* Moderate on this CPU only if both global and local rates are high. */
>
> Because it's desired that CPUs can be starved by interrupts when enough
> other CPUs only have a very low rate? I'm failing to understand that
> logic and the comprehensive rationale in the change log does not help either.

The comment could be worded better, s/moderate/bump delay up/

The mechanism aims to make total_rate < target, by gently kicking
individual delays up or down based on the condition
 total_rate > target && local_rate > target / active_cpus ? bump_up()
: bump_down()

If the control is effective, the total rate will be within bound and
nobody suffers,
neither the CPUs handing <1K intr/s, nor the lonely CPU handling 100K+ intr/s

If suddenly the rates go up, the CPUs with higher rates will be moderated more,
hopefully converging to a new equilibrium.
As any control system it has limits on what it can do.

> > [...]
> > +     if (target_rate > 0 && irqrate_high(ms, delta_time, target_rate, steps))
> > +             below_target = false;
> > +
> > +     if (hardirq_percent > 0 && hardirq_high(ms, delta_time, hardirq_percent))
> > +             below_target = false;
>
> So that rate limits a per CPU overload, but only when IRQTIME accounting
> is enabled. Oh well...

I can add checks to disallow setting the per-CPU overload when IRQTIME
accounting is not present.

>
> > +     /* Controller: adjust delay with exponential decay/grow. */
> > +     if (below_target) {
> > +             ms->mod_ns -= ms->mod_ns * steps / (steps + irq_mod_info.decay_factor);
> > +             if (ms->mod_ns < 100)
> > +                     ms->mod_ns = 0;
>
> These random numbers used to limit things are truly interresting and
> easy to understand - NOT.
>
> > +     } else {
> > +             /* Exponential grow does not restart if value is too small. */
> > +             if (ms->mod_ns < 500)
> > +                     ms->mod_ns = 500;
> > +             ms->mod_ns += ms->mod_ns * steps / (steps + irq_mod_info.grow_factor);
> > +             if (ms->mod_ns > ms->delay_ns)
> > +                     ms->mod_ns = ms->delay_ns;
> > +     }
>
> Why does this need separate grow and decay factors? Just because more
> knobs are better?

Like in TCP, brake aggressively (grow factor is smaller) to respond
quickly to overload,
and accelerate prudently (decay factor is higher) to avoid reacting
too optimistically.

thanks again for the attention
luigi