linux-kernel - Re: [PATCH v4 2/3] genirq: Fixed-delay Global Software Interrupt Moderation (GSIM)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87sec6368h.ffs@tglx>
Date: Thu, 15 Jan 2026 22:39:58 +0100
From: Thomas Gleixner <tglx@...nel.org>
To: Luigi Rizzo <lrizzo@...gle.com>, Marc Zyngier <maz@...nel.org>, Luigi
 Rizzo <rizzo.unipi@...il.com>, Paolo Abeni <pabeni@...hat.com>, Andrew
 Morton <akpm@...ux-foundation.org>, Sean Christopherson
 <seanjc@...gle.com>, Jacob Pan <jacob.jun.pan@...ux.intel.com>
Cc: linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org, Bjorn Helgaas
 <bhelgaas@...gle.com>, Willem de Bruijn <willemb@...gle.com>, Luigi Rizzo
 <lrizzo@...gle.com>
Subject: Re: [PATCH v4 2/3] genirq: Fixed-delay Global Software Interrupt
 Moderation (GSIM)

On Thu, Jan 15 2026 at 15:59, Luigi Rizzo wrote:
> Some platforms show reduced I/O performance when the total device
> interrupt rate across the entire platform becomes too high.
>
> Interrupt moderation can be used to rate-limit individual interrupts,
> and as a consequence also the total rate. Not all devices implement
> moderation in hardware, or they may have impractically coarse granularity
> (e.g. NVME specifies 100us).
>
> GSIM implements interrupt moderation in software, allowing control of
> interrupt rates even without hardware support. It als provides a building
> block for more sophisticated, adaptive mechanisms.
>
> Moderation is enabled/disabled per interrupt source with
>
>   echo 1 > /proc/irq/NN/soft_moderation # use 0 to disable
>
> For interrupts with moderation enabled, the delay is fixed, and equal
> for all.  It is configured via procfs (range 0-500, 0 means disabled):
>
>   echo delay_us=XX > /proc/irq/soft_moderation

Interesting that you can successfully write into a directory...

> Per CPU statistics on how often moderation is used are available via
>
>   cat /proc/irq/soft_moderation/stats
>
> +/**
> + * struct irq_desc_mod - interrupt moderation information
> + * @ms_node:		per-CPU list of moderated irq_desc

How is that node a per-CPU list?

> + */
> +struct irq_desc_mod {
> +#ifdef CONFIG_IRQ_SW_MODERATION
> +	struct list_head	ms_node;
> +#endif
> +};

> +/* Actually start moderation. */
> +bool irq_moderation_do_start(struct irq_desc *desc, struct irq_mod_state *m)
> +{
> +	lockdep_assert_irqs_disabled();

You rather assert that desc::lock is held, which implies interrupts disabled.

> +	if (!hrtimer_is_queued(&m->timer)) {
> +		const u32 min_delay_ns = 10000;
> +		const u64 slack_ns = 2000;
> +
> +		/* Accumulate sleep time, no moderation if too small. */
> +		m->sleep_ns += m->mod_ns;
> +		if (m->sleep_ns < min_delay_ns)
> +			return false;
> +		/* We need moderation, start the timer. */
> +		m->timer_set++;
> +		hrtimer_start_range_ns(&m->timer, ns_to_ktime(m->sleep_ns),
> +				       slack_ns, HRTIMER_MODE_REL_PINNED_HARD);
> +	}
> +
> +	/*
> +	 * Add to the timer list and __disable_irq() to prevent serving subsequent

__disable_irq() is an interesting new verb.

> +	 * interrupts.
> +	 */
> +	if (!list_empty(&desc->mod_state.ms_node)) {
> +		/* Very unlikely, stray interrupt while moderated. */
> +		m->stray_irq++;

How does that happen? handle_..._irq() returns early when the moderated
flag is set, no? If at all then this needs a WARN_ON_ONCE() because then
the underlying logic is broken.

> +	} else {
> +		m->enqueue++;
> +		list_add(&desc->mod_state.ms_node, &m->descs);
> +		__disable_irq(desc);
> +	}
> +	irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS | IRQD_IRQ_MODERATED);
> +	return true;
> +}
> +
> +/* Initialize moderation state, used in desc_set_defaults() */
> +void irq_moderation_init_fields(struct irq_desc_mod *mod_state)
> +{
> +	INIT_LIST_HEAD(&mod_state->ms_node);
> +}

Why does that have to be a function and not a trivial inline?

> +
> +static int set_mode(struct irq_desc *desc, bool enable)
> +{
> +	struct irq_data *irqd = &desc->irq_data;
> +	struct irq_chip *chip = irqd->chip;
> +
> +	lockdep_assert_held(&desc->lock);
> +
> +	if (!enable) {
> +		irq_settings_clr_and_set(desc, _IRQ_SW_MODERATION, 0);
> +		return 0;
> +	}
> +
> +	/* Moderation is only supported in specific cases. */
> +	enable &= !irqd_is_level_type(irqd);
> +	enable &= irqd_is_single_target(irqd);
> +	enable &= !chip->irq_bus_lock && !chip->irq_bus_sync_unlock;
> +	enable &= chip->irq_mask && chip->irq_unmask;
> +	enable &= desc->handle_irq == handle_edge_irq || desc->handle_irq == handle_fasteoi_irq;
> +	if (!enable)
> +		return -EOPNOTSUPP;
> +
> +	irq_settings_clr_and_set(desc, 0, _IRQ_SW_MODERATION);
> +	return 0;
> +}
> +
> +/* Helpers to set and clamp values from procfs or at init. */
> +struct swmod_param {
> +	const char	*name;
> +	int		(*wr)(struct swmod_param *n, const char __user *s, size_t count);
> +	void		(*rd)(struct seq_file *p);
> +	void		*val;

Why is this a void pointer if all parameters are u32? They better
are because the min/max values are too. But then you would not have to
type cast @val in the read and write functions, which is superior over
proper data types, right?

Aside of that all of this should be unsigned int and not u32 because non
of this is hardware related.

> +	u32		min;
> +	u32		max;
> +};

I'm really not convinced that any of this is actually an improvement
over properly separated write functions.

> +
> +static int swmod_wr_u32(struct swmod_param *n, const char __user *s, size_t count)
> +{
> +	u32 res;
> +	int ret = kstrtouint_from_user(s, count, 0, &res);
> +
> +	if (!ret) {
> +		WRITE_ONCE(*(u32 *)(n->val), clamp(res, n->min, n->max));

This is just wrong. If the value is outside of the valid region, then
this needs to be rejected and not silently clamped.

> +		ret = count;
> +	}
> +	return ret;
> +}
> +
> +static void swmod_rd_u32(struct seq_file *p)
> +{
> +	struct swmod_param *n = p->private;
> +
> +	seq_printf(p, "%u\n", *(u32 *)(n->val));
> +}
> +
> +static int swmod_wr_delay(struct swmod_param *n, const char __user *s, size_t count)
> +{
> +	int ret = swmod_wr_u32(n, s, count);
> +
> +	if (ret >= 0)
> +		update_enable_key();
> +	return ret;
> +}
> +
> +#define HEAD_FMT "%5s  %8s  %11s  %11s  %9s\n"
> +#define BODY_FMT "%5u  %8u  %11u  %11u  %9u\n"
> +
> +#pragma clang diagnostic error "-Wformat"

I told you before that this is bogus:

kernel/irq/irq_moderation.c:217: warning: ignoring ‘#pragma clang diagnostic’ [-Wunknown-pragmas]
  217 | #pragma clang diagnostic error "-Wformat"

Not everyone uses clang.

> +/* Print statistics */
> +static void rd_stats(struct seq_file *p)
> +{
> +	uint delay_us = irq_mod_info.delay_us;
> +	int cpu;
> +
> +	seq_printf(p, HEAD_FMT,
> +		   "# CPU", "delay_ns", "timer_set", "enqueue", "stray_irq");
> +
> +	for_each_possible_cpu(cpu) {
> +		struct irq_mod_state cur;
> +
> +		/* Copy statistics, will only use some 32bit values, races ok. */
> +		data_race(cur = *per_cpu_ptr(&irq_mod_state, cpu));
> +		seq_printf(p, BODY_FMT,
> +			   cpu, cur.mod_ns, cur.timer_set, cur.enqueue, cur.stray_irq);
> +	}
> +
> +	seq_printf(p, "\n"
> +		   "enabled              %s\n"
> +		   "delay_us             %u\n",
> +		   str_yes_no(delay_us > 0),
> +		   delay_us);

Again: Why is this printing anything when it's disabled. There is zero
value for that.

> +}
> +
> +static int moderation_show(struct seq_file *p, void *v)
> +{
> +	struct swmod_param *n = p->private;
> +
> +	if (!n || !n->rd)
> +		return -EINVAL;

How can either of these conditions be true?

> +	n->rd(p);
> +	return 0;
> +}
> +
> +static int moderation_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, moderation_show, pde_data(inode));
> +}
> +
> +static struct swmod_param param_names[] = {
> +	{ "delay_us", swmod_wr_delay, swmod_rd_u32, &irq_mod_info.delay_us, 0, 500 },
> +	{ "stats", NULL, rd_stats},

C89 struct initializers are broken. Either use proper C99 initializers
or use a macro which hides them away.

> +};
> +
> +static ssize_t moderation_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)
> +{
> +	struct swmod_param *n = (struct swmod_param *)pde_data(file_inode(f));
> +
> +	return n && n->wr ? n->wr(n, buf, count) : -EINVAL;

How is n = NULL?

> +}
> +
> +static const struct proc_ops proc_ops = {
> +	.proc_open	= moderation_open,
> +	.proc_read	= seq_read,
> +	.proc_lseek	= seq_lseek,
> +	.proc_release	= single_release,
> +	.proc_write	= moderation_write,
> +};
> +
> +/* Handlers for /proc/irq/NN/soft_moderation */
> +static int mode_show(struct seq_file *p, void *v)
> +{
> +	struct irq_desc *desc = p->private;
> +
> +	seq_puts(p, irq_settings_moderation_allowed(desc) ? "on\n" : "off\n");
> +	return 0;
> +}
> +
> +static ssize_t mode_write(struct file *f, const char __user *buf, size_t count, loff_t *ppos)

mode_show() and mode_write() are really as undescriptive as it gets.

> +{
> +	struct irq_desc *desc = (struct irq_desc *)pde_data(file_inode(f));
> +	bool enable;
> +	int ret = kstrtobool_from_user(buf, count, &enable);
> +
> +	if (!ret) {
> +		guard(raw_spinlock_irqsave)(&desc->lock);

raw_spinlock_irq - there is nothing to save. This is thread context.

> +		ret = set_mode(desc, enable);

Why is this set_mode() function 100 lines above instead of being just
next to the usage site? Makes following the code easier, right?

> +	}
> +	return ret ? : count;
> +}


> +/* Used on timer expiration or CPU shutdown. */
> +static void drain_desc_list(struct irq_mod_state *m)
> +{
> +	struct irq_desc *desc, *next;
> +
> +	/* Remove from list and enable interrupts back. */
> +	list_for_each_entry_safe(desc, next, &m->descs, mod_state.ms_node) {
> +		guard(raw_spinlock)(&desc->lock);
> +		list_del_init(&desc->mod_state.ms_node);
> +		irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS | IRQD_IRQ_MODERATED);
> +		/* Protect against competing free_irq(). */

It's not competing. It's concurrent. Can you please use precise wording?

> +		if (desc->action)
> +			__enable_irq(desc);
> +	}
> +}
> +
> +static enum hrtimer_restart timer_callback(struct hrtimer *timer)
> +{
> +	struct irq_mod_state *m = this_cpu_ptr(&irq_mod_state);
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	drain_desc_list(m);
> +	/* Prepare to accumulate next moderation delay. */
> +	m->sleep_ns = 0;
> +	return HRTIMER_NORESTART;
> +}
> +
> +/* Hotplug callback for setup. */
> +static int cpu_setup_cb(uint cpu)

unsigned int

> +{
> +	struct irq_mod_state *m = this_cpu_ptr(&irq_mod_state);
> +
> +	hrtimer_setup(&m->timer, timer_callback,
> +		      CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
> +	INIT_LIST_HEAD(&m->descs);
> +	m->moderation_allowed = true;
> +	return 0;
> +}
> +
> +/*
> + * Hotplug callback for shutdown.
> + * Mark the CPU as offline for moderation, and drain the list of masked
> + * interrupts. Any subsequent interrupt on this CPU will not be
> + * moderated, but they will be on the new target.
> + */
> +static int cpu_remove_cb(uint cpu)

Ditto

> +{
> +	struct irq_mod_state *m = this_cpu_ptr(&irq_mod_state);
> +
> +	guard(irqsave)();

Lacks a comment explanaining why this needs to be irqsave. The hotplug
callback does not require that.

> +	m->moderation_allowed = false;
> +	drain_desc_list(m);
> +	hrtimer_cancel(&m->timer);
> +	return 0;
> +}
> +
> +static void(mod_pm_prepare_cb)(void *arg)

What are those brackets around the function name for aside of making the
code unreadable?

> +{
> +	cpu_remove_cb(smp_processor_id());
> +}
> +
> +static void(mod_pm_resume_cb)(void *arg)

Ditto.

> +{
> +	cpu_setup_cb(smp_processor_id());
> +}
> +

> +static void __init clamp_parameter(u32 *dst, u32 val)
> +{
> +	struct swmod_param *n = param_names;
> +
> +	for (int i = 0; i < ARRAY_SIZE(param_names); i++, n++) {
> +		if (dst == n->val) {
> +			WRITE_ONCE(*dst, clamp(val, n->min, n->max));
> +			return;
> +		}
> +	}
> +}

> +static int __init init_irq_moderation(void)
> +{
> +	struct proc_dir_entry *dir;
> +	struct swmod_param *n;
> +	int i;
> +
> +	/* Clamp all initial values to the allowed range. */
> +	for (uint *cur = &irq_mod_info.delay_us; cur < irq_mod_info.params_end; cur++)
> +		clamp_parameter(cur, *cur);

This is required because someone might have put an invalid value into
the struct initializer, right?

> +	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "irq_moderation", cpu_setup_cb, cpu_remove_cb);

Can fail....

> +	register_pm_notifier(&mod_nb);

Ditto.

> +	update_enable_key();

This is required to make sure it's disabled as it might have enabled
itself magically?

> +	dir = proc_mkdir("irq/soft_moderation", NULL);
> +	if (!dir)
> +		return 0;
> +	for (i = 0, n = param_names; i < ARRAY_SIZE(param_names); i++, n++)
> +		proc_create_data(n->name, n->wr ? 0644 : 0444, dir, &proc_ops, n);

Can fail too.

> +	return 0;
> +}
> +

Pointless new line.

> +device_initcall(init_irq_moderation);

> +/**
> + * struct irq_mod_info - global configuration parameters and state
> + * @delay_us:		maximum delay
> + * @update_ms:		how often to update delay (epoch duration)
> + */
> +struct irq_mod_info {
> +	u32		delay_us;
> +	u32		update_ms;
> +	u32		params_end[];

What is this params_end[] for? A made up substitute for ARRAY_SIZE()?

> +};
> +
> +extern struct irq_mod_info irq_mod_info;

> +static inline void check_epoch(struct irq_mod_state *m)
> +{
> +	const u64 now = ktime_get_ns(), epoch_ns = now - atomic64_read(&m->epoch_start_ns);
> +	const u32 slack_ns = 5000;
> +
> +	/* Run approximately every update_ns, a little bit early is ok. */
> +	if (epoch_ns < m->update_ns - slack_ns)
> +		return;
> +	/* Fetch updated parameters. */
> +	m->update_ns = READ_ONCE(irq_mod_info.update_ms) * NSEC_PER_MSEC;
> +	m->mod_ns = READ_ONCE(irq_mod_info.delay_us) * NSEC_PER_USEC;
> +}

TBH, this is the most convoluted way to handle global parameters which I
have seen in a long time.

First you need the intr_count & 0xF check, then you read time and
subtract the start and have another comparison to avoid reading and
multiplying the global parameters on every invocation.

Just to make it more interresting:

     epoch_start_ns is never set, so this time based rate limit does not
     work at all.

I don't care whether you "fix" this in the next patch or not. Patches
have to be self contained, functional and comprehensible on their own.

That's just completely bonkers. These variables are updated by user
space every now and then, if at all.

So either you read directly from the global struct (you need obviously
nano seconds converted struct members for that) or you update the per
CPU instances from the write functions or you use a sequence counter as
everyone else does.

Thanks,

        tglx