linux-kernel - Re: [PATCH v2 4/8] genirq: soft_moderation: implement adaptive moderation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87jyzo757y.ffs@tglx>
Date: Mon, 17 Nov 2025 21:51:29 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Luigi Rizzo <lrizzo@...gle.com>, Marc Zyngier <maz@...nel.org>, Luigi
 Rizzo <rizzo.unipi@...il.com>, Paolo Abeni <pabeni@...hat.com>, Andrew
 Morton <akpm@...ux-foundation.org>, Sean Christopherson
 <seanjc@...gle.com>, Jacob Pan <jacob.jun.pan@...ux.intel.com>
Cc: linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org, Bjorn Helgaas
 <bhelgaas@...gle.com>, Willem de Bruijn <willemb@...gle.com>, Luigi Rizzo
 <lrizzo@...gle.com>
Subject: Re: [PATCH v2 4/8] genirq: soft_moderation: implement adaptive
 moderation

On Sun, Nov 16 2025 at 18:28, Luigi Rizzo wrote:
> Add two control parameters (target_irq_rate and hardirq_percent)
> to indicate the desired maximum values for these two metrics.
>
> Every update_ms the hook in handle_irq_event() recomputes the total and
> local interrupt rate and the amount of time spent in hardirq, compares
> the values with the targets, and adjusts the moderation delay up or down.
>
> The interrupt rate is computed in a scalable way by counting interrupts
> per-CPU, and aggregating the value in a global variable only every
> update_ms. Only CPUs that actively process interrupts are actually
> accessing the shared variable, so the system scales well even on very
> large servers.

You still fail to explain why this is required and why a per CPU
moderation is not sufficient and what the benefits of the approach are.

I'm happy to refer you to Documentation once more. It's well explained
there.

> +/* Measure and assess time spent in hardirq. */
> +static inline bool hardirq_high(struct irq_mod_state *ms, u64 delta_time, u32 hardirq_percent)
> +{
> +	bool above_threshold = false;
> +
> +	if (IS_ENABLED(CONFIG_IRQ_TIME_ACCOUNTING)) {
> +		u64 irqtime, cur = kcpustat_this_cpu->cpustat[CPUTIME_IRQ];
> +
> +		irqtime = cur - ms->last_irqtime;
> +		ms->last_irqtime = cur;
> +
> +		above_threshold = irqtime * 100 > delta_time * hardirq_percent;
> +		ms->hardirq_high += above_threshold;
> +	}
> +	return above_threshold;
> +}
> +
> +/* Measure and assess total and per-CPU interrupt rates. */
> +static inline bool irqrate_high(struct irq_mod_state *ms, u64 delta_time,
> +				u32 target_rate, u32 steps)
> +{
> +	u64 irq_rate, my_irq_rate, tmp, delta_irqs, active_cpus;
> +	bool my_rate_high, global_rate_high;
> +
> +	my_irq_rate = (ms->irq_count * NSEC_PER_SEC) / delta_time;
> +	/* Accumulate global counter and compute global irq rate. */
> +	tmp = atomic_long_add_return(ms->irq_count, &irq_mod_info.total_intrs);
> +	ms->irq_count = 1;
> +	delta_irqs = tmp - ms->last_total_irqs;
> +	ms->last_total_irqs = tmp;
> +	irq_rate = (delta_irqs * NSEC_PER_SEC) / delta_time;
> +
> +	/*
> +	 * Count how many CPUs handled interrupts in the last interval, needed
> +	 * to determine the per-CPU target (target_rate / active_cpus).
> +	 * Each active CPU increments the global counter approximately every
> +	 * update_ns. Scale the value by (update_ns / delta_time) to get the
> +	 * correct value. Also apply rounding and make sure active_cpus > 0.
> +	 */
> +	tmp = atomic_long_add_return(1, &irq_mod_info.total_cpus);
> +	active_cpus = tmp - ms->last_total_cpus;
> +	ms->last_total_cpus = tmp;
> +	active_cpus = (active_cpus * ms->update_ns + delta_time / 2) / delta_time;
> +	if (active_cpus < 1)
> +		active_cpus = 1;
> +
> +	/* Compare with global and per-CPU targets. */
> +	global_rate_high = irq_rate > target_rate;
> +	my_rate_high = my_irq_rate * active_cpus * irq_mod_info.scale_cpus > target_rate * 100;
> +
> +	/* Statistics. */
> +	smooth_avg(&ms->irq_rate, irq_rate, steps);
> +	smooth_avg(&ms->my_irq_rate, my_irq_rate, steps);
> +	smooth_avg(&ms->scaled_cpu_count, active_cpus * 256, steps);
> +	ms->my_irq_high += my_rate_high;
> +	ms->irq_high += global_rate_high;
> +
> +	/* Moderate on this CPU only if both global and local rates are high. */

Because it's desired that CPUs can be starved by interrupts when enough
other CPUs only have a very low rate? I'm failing to understand that
logic and the comprehensive rationale in the change log does not help either.

> +	return global_rate_high && my_rate_high;
> +}
> +
> +/* Adjust the moderation delay, called at most every update_ns. */
> +void __irq_moderation_adjust_delay(struct irq_mod_state *ms, u64 delta_time)
> +{
> +	u32 hardirq_percent = READ_ONCE(irq_mod_info.hardirq_percent);
> +	u32 target_rate = READ_ONCE(irq_mod_info.target_irq_rate);
> +	bool below_target = true;
> +	u32 steps;
> +
> +	if (target_rate == 0 && hardirq_percent == 0) {
> +		/* Use fixed moderation delay. */
> +		ms->mod_ns = ms->delay_ns;
> +		ms->irq_rate = 0;
> +		ms->my_irq_rate = 0;
> +		ms->scaled_cpu_count = 0;
> +		return;
> +	}
> +
> +	/* Compute decay steps based on elapsed time, bound to a reasonable value. */
> +	steps = delta_time > 10 * ms->update_ns ? 10 : 1 + (delta_time / ms->update_ns);
> +
> +	if (target_rate > 0 && irqrate_high(ms, delta_time, target_rate, steps))
> +		below_target = false;
> +
> +	if (hardirq_percent > 0 && hardirq_high(ms, delta_time, hardirq_percent))
> +		below_target = false;

So that rate limits a per CPU overload, but only when IRQTIME accounting
is enabled. Oh well...

> +	/* Controller: adjust delay with exponential decay/grow. */
> +	if (below_target) {
> +		ms->mod_ns -= ms->mod_ns * steps / (steps + irq_mod_info.decay_factor);
> +		if (ms->mod_ns < 100)
> +			ms->mod_ns = 0;

These random numbers used to limit things are truly interresting and
easy to understand - NOT.

> +	} else {
> +		/* Exponential grow does not restart if value is too small. */
> +		if (ms->mod_ns < 500)
> +			ms->mod_ns = 500;
> +		ms->mod_ns += ms->mod_ns * steps / (steps + irq_mod_info.grow_factor);
> +		if (ms->mod_ns > ms->delay_ns)
> +			ms->mod_ns = ms->delay_ns;
> +	}

Why does this need separate grow and decay factors? Just because more
knobs are better?

> +}
> +
>  /* Moderation timer handler. */
>  static enum hrtimer_restart timer_cb(struct hrtimer *timer)
>  {
> @@ -169,8 +289,10 @@ static inline int set_moderation_mode(struct irq_desc *desc, bool enable)
>  /* Print statistics */
>  static int moderation_show(struct seq_file *p, void *v)
>  {
> +	ulong irq_rate = 0, irq_high = 0, my_irq_high = 0, hardirq_high = 0;
>  	uint delay_us = irq_mod_info.delay_us;
> -	int j;
> +	u64 now = ktime_get_ns();
> +	int j, active_cpus = 0;
>  
>  #define HEAD_FMT "%5s  %8s  %8s  %4s  %4s  %8s  %11s  %11s  %11s  %11s  %11s  %11s  %11s  %9s\n"
>  #define BODY_FMT "%5u  %8u  %8u  %4u  %4u  %8u  %11u  %11u  %11u  %11u  %11u  %11u  %11u  %9u\n"
> @@ -182,6 +304,23 @@ static int moderation_show(struct seq_file *p, void *v)
>  
>  	for_each_possible_cpu(j) {
>  		struct irq_mod_state *ms = per_cpu_ptr(&irq_mod_state, j);
> +		u64 age_ms = min((now - ms->last_ns) / NSEC_PER_MSEC, (u64)999999);

Not new in this patch: All of those accesses to remote CPU state are
racy and need to be annotated. This clearly never saw any testing with
KCSAN enabled. I'm happy to point to Documentation once again.

What's new is that this one is obviously broken on 32-bit because read
and write of a 64-bit value are split.

> +		if (age_ms < 10000) {
> +			/* Average irq_rate over recently active CPUs. */
> +			active_cpus++;
> +			irq_rate += ms->irq_rate;
> +		} else {
> +			ms->irq_rate = 0;
> +			ms->my_irq_rate = 0;
> +			ms->scaled_cpu_count = 64;
> +			ms->scaled_src_count = 64;
> +			ms->mod_ns = 0;
> +		}
> +
> +		irq_high += ms->irq_high;
> +		my_irq_high += ms->my_irq_high;
> +		hardirq_high += ms->hardirq_high;
>  
>  		seq_printf(p, BODY_FMT,
>  			   j, ms->irq_rate, ms->my_irq_rate,
> @@ -195,9 +334,34 @@ static int moderation_show(struct seq_file *p, void *v)
>  	seq_printf(p, "\n"
>  		   "enabled              %s\n"
>  		   "delay_us             %u\n"
> -		   "timer_rounds         %u\n",
> +		   "timer_rounds         %u\n"
> +		   "target_irq_rate      %u\n"
> +		   "hardirq_percent      %u\n"
> +		   "update_ms            %u\n"
> +		   "scale_cpus           %u\n"
> +		   "count_timer_calls    %s\n"
> +		   "count_msi_calls      %s\n"
> +		   "decay_factor         %u\n"
> +		   "grow_factor          %u\n",
>  		   str_yes_no(delay_us > 0),
> -		   delay_us, irq_mod_info.timer_rounds);
> +		   delay_us, irq_mod_info.timer_rounds,
> +		   irq_mod_info.target_irq_rate, irq_mod_info.hardirq_percent,
> +		   irq_mod_info.update_ms, irq_mod_info.scale_cpus,
> +		   str_yes_no(irq_mod_info.count_timer_calls),
> +		   str_yes_no(irq_mod_info.count_msi_calls),
> +		   irq_mod_info.decay_factor, irq_mod_info.grow_factor);
> +
> +	seq_printf(p,
> +		   "irq_rate             %lu\n"
> +		   "irq_high             %lu\n"
> +		   "my_irq_high          %lu\n"
> +		   "hardirq_percent_high %lu\n"
> +		   "total_interrupts     %lu\n"
> +		   "total_cpus           %lu\n",
> +		   active_cpus ? irq_rate / active_cpus : 0,
> +		   irq_high, my_irq_high, hardirq_high,
> +		   READ_ONCE(*((ulong *)&irq_mod_info.total_intrs)),
> +		   READ_ONCE(*((ulong *)&irq_mod_info.total_cpus)));

The more I look at this insane amount of telemetry the more I am
convinced that this is overly complex just for complexity sake.

Thanks,

        tglx