linux-kernel - Re: [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87cy5g6z0q.ffs@tglx>
Date: Tue, 18 Nov 2025 00:05:25 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Luigi Rizzo <lrizzo@...gle.com>
Cc: Marc Zyngier <maz@...nel.org>, Luigi Rizzo <rizzo.unipi@...il.com>,
 Paolo Abeni <pabeni@...hat.com>, Andrew Morton
 <akpm@...ux-foundation.org>, Sean Christopherson <seanjc@...gle.com>,
 Jacob Pan <jacob.jun.pan@...ux.intel.com>, linux-kernel@...r.kernel.org,
 linux-arch@...r.kernel.org, Bjorn Helgaas <bhelgaas@...gle.com>, Willem de
 Bruijn <willemb@...gle.com>
Subject: Re: [PATCH v2 4/8] genirq: soft_moderation: implement adaptive
 moderation

On Mon, Nov 17 2025 at 22:34, Luigi Rizzo wrote:
> On Mon, Nov 17, 2025 at 9:51 PM Thomas Gleixner <tglx@...utronix.de> wrote:
>> On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
>> > Add two control parameters (target_irq_rate and hardirq_percent)
>> > to indicate the desired maximum values for these two metrics.
>> >
>> > Every update_ms the hook in handle_irq_event() recomputes the total and
>> > local interrupt rate and the amount of time spent in hardirq, compares
>> > the values with the targets, and adjusts the moderation delay up or down.
>> >
>> > The interrupt rate is computed in a scalable way by counting interrupts
>> > per-CPU, and aggregating the value in a global variable only every
>> > update_ms. Only CPUs that actively process interrupts are actually
>> > accessing the shared variable, so the system scales well even on very
>> > large servers.
>>
>> You still fail to explain why this is required and why a per CPU
>> moderation is not sufficient and what the benefits of the approach are.
>
> It was explained in the first patch of the series and in Documentation/.

Change logs of a patch have to be self contained whether you like it or
not. I'm not chasing random bits of information accross a series or
cover letter and once a patch is applied it becomes even harder to make
sense of it when the change log contains zero information.

> (First world problem, for sure: I have examples for AMD, Intel, Arm,
> all of them with 100+ CPUs per numa node, and 160-480 CPUs total)
> On some of the above platforms, MSIx interrupts cause heavy serialization
> of all other PCIe requests. As a result, when the total interrupt rate exceeds
> 1-2M intrs/s, I/O throughput degrades by up to 4x and more.
> 
> To deal with this with per CPU moderation, without shared state, each CPU
> cannot allow more than some 5Kintrs/s, which means fixed moderation
> should be set at 200us, and adaptive moderation should jump to such
> delays as soon as the local rate reaches 5K intrs/s.
>
> In reality, it is very unlikely that all CPUs are actively handling such high
> rates, so if we know better, we can adjust or remove moderation individually,
> based on actual local and total interrupt rates and number of active CPUs.
> 
> The purpose of this global mechanism is to figure out whether we are
> approaching a dangerous rate, and do individual tuning.

Makes sense, but I'm not convinced yet that this needs to be as complex
as it is.

>> > +     /* Compare with global and per-CPU targets. */
>> > +     global_rate_high = irq_rate > target_rate;
>> > +     my_rate_high = my_irq_rate * active_cpus * irq_mod_info.scale_cpus > target_rate * 100;
>> > [...]
>> > +     /* Moderate on this CPU only if both global and local rates are high. */
>>
>> Because it's desired that CPUs can be starved by interrupts when enough
>> other CPUs only have a very low rate? I'm failing to understand that
>> logic and the comprehensive rationale in the change log does not help either.
>
> The comment could be worded better, s/moderate/bump delay up/
>
> The mechanism aims to make total_rate < target, by gently kicking
> individual delays up or down based on the condition
>  total_rate > target && local_rate > target / active_cpus ? bump_up()
> : bump_down()
>
> If the control is effective, the total rate will be within bound and
> nobody suffers,
> neither the CPUs handing <1K intr/s, nor the lonely CPU handling 100K+ intr/s
>
> If suddenly the rates go up, the CPUs with higher rates will be moderated more,
> hopefully converging to a new equilibrium.
> As any control system it has limits on what it can do.

I understand that, but without proper information in code and change log
anyone exposed to this code 6 months down the road will bump his head on
the wall when staring at it (including you).

>> > [...]
>> > +     if (target_rate > 0 && irqrate_high(ms, delta_time, target_rate, steps))
>> > +             below_target = false;
>> > +
>> > +     if (hardirq_percent > 0 && hardirq_high(ms, delta_time, hardirq_percent))
>> > +             below_target = false;
>>
>> So that rate limits a per CPU overload, but only when IRQTIME accounting
>> is enabled. Oh well...
>
> I can add checks to disallow setting the per-CPU overload when IRQTIME
> accounting is not present.

That solves what? It disables the setting, but that does not make the
functionality any different. Also the compiler is smart enough to
eliminate all that code because the return value of hardirq_high() is
constant.

>> > +     } else {
>> > +             /* Exponential grow does not restart if value is too small. */
>> > +             if (ms->mod_ns < 500)
>> > +                     ms->mod_ns = 500;
>> > +             ms->mod_ns += ms->mod_ns * steps / (steps + irq_mod_info.grow_factor);
>> > +             if (ms->mod_ns > ms->delay_ns)
>> > +                     ms->mod_ns = ms->delay_ns;
>> > +     }
>>
>> Why does this need separate grow and decay factors? Just because more
>> knobs are better?
>
> Like in TCP, brake aggressively (grow factor is smaller) to respond
> quickly to overload,
> and accelerate prudently (decay factor is higher) to avoid reacting
> too optimistically.

Why do I have to ask for all this information piecewise?

Thanks,

        tglx