linux-kernel - Re: [PATCH 4/6] genirq: soft_moderation: implement adaptive moderation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87o6p6ck7q.ffs@tglx>
Date: Thu, 13 Nov 2025 11:15:05 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Luigi Rizzo <lrizzo@...gle.com>, Marc Zyngier <maz@...nel.org>, Luigi
 Rizzo <rizzo.unipi@...il.com>, Paolo Abeni <pabeni@...hat.com>, Andrew
 Morton <akpm@...ux-foundation.org>, Sean Christopherson
 <seanjc@...gle.com>, Jacob Pan <jacob.jun.pan@...ux.intel.com>
Cc: linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org, Bjorn Helgaas
 <bhelgaas@...gle.com>, Willem de Bruijn <willemb@...gle.com>, Luigi Rizzo
 <lrizzo@...gle.com>
Subject: Re: [PATCH 4/6] genirq: soft_moderation: implement adaptive moderation

On Wed, Nov 12 2025 at 19:24, Luigi Rizzo wrote:
> Add two control parameters (target_irq_rate and hardirq_percent)
> to indicate the desired maximum values for these two metrics.
>
> Every update_ms the hook in handle_irq_event() recomputes the total and
> local interrupt rate and the amount of time spent in hardirq, compares
> the values with the targets, and adjusts the moderation delay up or down.
>
> The interrupt rate is computed in a scalable way by counting interrupts
> per-CPU, and aggregating the value in a global variable only every
> update_ms. Only CPUs that actively process interrupts are actually
> accessing the shared variable, so the system scales well even on very
> large servers.

This explains the what. But it does not provide any rationale for it.

> +/* Adjust the moderation delay, called at most every update_ns */
> +void __irq_moderation_adjust_delay(struct irq_mod_state *ms, u64 delta_time, u64 update_ns)
> +{
> +	/* Fetch configuration */
> +	u32 target_rate = clamp(irq_mod_info.target_irq_rate, 0u, 50000000u);
> +	u32 hardirq_percent = clamp(irq_mod_info.hardirq_percent, 0u, 100u);

This is all racy against a concurrent write, so that requires
READ_ONCE(), no?

> +	bool below_target = true;
> +	/* Compute decay steps based on elapsed time */
> +	u32 steps = delta_time > 10 * update_ns ? 10 : 1 + (delta_time / update_ns);
> +
> +	if (target_rate == 0 && hardirq_percent == 0) {	/* use fixed delay */
> +		ms->mod_ns = ms->delay_ns;
> +		ms->irq_rate = 0;
> +		ms->my_irq_rate = 0;
> +		ms->cpu_count = 0;
> +		return;
> +	}
> +
> +	if (target_rate > 0) {	/* control total and individual CPU rate */
> +		u64 irq_rate, my_irq_rate, tmp, delta_irqs, num_cpus;
> +		bool my_rate_ok, global_rate_ok;
> +
> +		/* Update global number of interrupts */
> +		if (ms->irq_count < 1)	/* make sure it is always > 0 */

And how does it become < 1?

> +			ms->irq_count = 1;
> +		tmp = atomic_long_add_return(ms->irq_count, &irq_mod_info.total_intrs);
> +		delta_irqs = tmp - ms->last_total_irqs;
> +
> +		/* Compute global rate, check if we are ok */
> +		irq_rate = (delta_irqs * NSEC_PER_SEC) / delta_time;
> +		global_rate_ok = irq_rate < target_rate;
> +
> +		ms->last_total_irqs = tmp;

I really had to find this update. Can you please just keep that stuff
together?

        tmp = ...;
        delta = tmp - ms->xxxx;
        ms->xxxx = tmp;

> +		/*
> +		 * num_cpus is the number of CPUs actively handling interrupts
> +		 * in the last interval. CPUs handling less than the fair share
> +		 * target_rate / num_cpus do not need to be throttled.
> +		 */
> +		tmp = atomic_long_add_return(1, &irq_mod_info.total_cpus);
> +		num_cpus = tmp - ms->last_total_cpus;
> +		/* scale proportionally to time, reduce errors if we are idle for too long */
> +		num_cpus = 1 + (num_cpus * update_ns + delta_time / 2) / delta_time;

I still fail to see why this global scale is required and how it is
correct. This can starve out a particular CPU which handles the bulk of
interrupts, no?

> +		/* Short intervals may underestimate sources. Apply a scale factor */
> +		num_cpus = num_cpus * get_scale_cpus() / 100;
> +
> +		/* Compute our rate, check if we are ok */
> +		my_irq_rate = (ms->irq_count * NSEC_PER_SEC) / delta_time;
> +		my_rate_ok = my_irq_rate * num_cpus < target_rate;
> +
> +		ms->irq_count = 1;	/* reset for next cycle */
> +		ms->last_total_cpus = tmp;
> +
> +		/* Use instantaneous rates to react. */
> +		below_target = global_rate_ok || my_rate_ok;
> +
> +		/* Statistics (rates are smoothed averages) */
> +		smooth_avg(&ms->irq_rate, irq_rate, steps);
> +		smooth_avg(&ms->my_irq_rate, my_irq_rate, steps);
> +		smooth_avg(&ms->cpu_count, num_cpus * 256, steps); /* scaled */
> +		ms->my_irq_high += !my_rate_ok;
> +		ms->irq_high += !global_rate_ok;
> +	}
> +
> +	if (hardirq_percent > 0) {		/* control time spent in hardirq */
> +		u64 cur = kcpustat_this_cpu->cpustat[CPUTIME_IRQ];

Depends on CONFIG_IRQ_TIME_ACCOUNTING=y, no?

> +		u64 irqtime = cur - ms->last_irqtime;
> +		bool hardirq_ok = irqtime * 100 < delta_time * hardirq_percent;
> +
> +		below_target &= hardirq_ok;
> +		ms->last_irqtime = cur;
> +		ms->hardirq_high += !hardirq_ok;	/* statistics */
> +	}
> +
> +	/* Controller: move mod_ns up/down if we are above/below target */
> +	if (below_target) {
> +		ms->mod_ns -= ms->mod_ns * steps / (steps + get_decay_factor());
> +		if (ms->mod_ns < 100)
> +			ms->mod_ns = 0;
> +	} else if (ms->mod_ns < 500) {
> +		ms->mod_ns = 500;

Random numbers pulled out of thin air. That's all over the place.

> +	} else {
> +		ms->mod_ns += ms->mod_ns * steps / (steps + get_grow_factor());
> +		if (ms->mod_ns > ms->delay_ns)
> +			ms->mod_ns = ms->delay_ns;	/* cap to delay_ns */
> +	}
> +}

Thanks,

        tglx