[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <25e02043-257d-be82-b562-89ab896ab555@redhat.com>
Date: Fri, 30 Sep 2022 09:05:16 +0200
From: Hans de Goede <hdegoede@...hat.com>
To: Zhang Xincheng <zhangxincheng@...ontech.com>, tglx@...utronix.de
Cc: linux-kernel@...r.kernel.org, maz@...nel.org,
oleksandr@...alenko.name, bigeasy@...utronix.de,
mark.rutland@....com, michael@...le.cc
Subject: Re: [PATCH] interrupt: discover and disable very frequent interrupts
Hi,
On 9/30/22 08:40, Zhang Xincheng wrote:
> From: zhangxincheng <zhangxincheng@...ontech.com>
>
> In some cases, a peripheral's interrupt will be triggered frequently,
> which will keep the CPU processing the interrupt and eventually cause
> the RCU to report rcu_sched self-detected stall on the CPU.
>
> [ 838.131628] rcu: INFO: rcu_sched self-detected stall on CPU
> [ 838.137189] rcu: 0-....: (194839 ticks this GP) idle=f02/1/0x4000000000000004
> softirq=9993/9993 fqs=97428
> [ 838.146912] rcu: (t=195015 jiffies g=6773 q=0)
> [ 838.151516] Task dump for CPU 0:
> [ 838.154730] systemd-sleep R running task 0 3445 1 0x0000000a
>
> Signed-off-by: zhangxincheng <zhangxincheng@...ontech.com>
> Change-Id: I9c92146f2772eae383c16c8c10de028b91e07150
> Signed-off-by: zhangxincheng <zhangxincheng@...ontech.com>
> ---
> include/linux/irqdesc.h | 2 ++
> kernel/irq/spurious.c | 52 +++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 54 insertions(+)
>
> diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
> index 1cd4e36890fb..a3bd521c3557 100644
> --- a/include/linux/irqdesc.h
> +++ b/include/linux/irqdesc.h
> @@ -102,6 +102,8 @@ struct irq_desc {
> int parent_irq;
> struct module *owner;
> const char *name;
> + u32 gap_count;
> + u64 gap_time;
> } ____cacheline_internodealigned_in_smp;
>
> #ifdef CONFIG_SPARSE_IRQ
> diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
> index 02b2daf07441..b7162a10d92c 100644
> --- a/kernel/irq/spurious.c
> +++ b/kernel/irq/spurious.c
> @@ -222,6 +222,38 @@ static void __report_bad_irq(struct irq_desc *desc, irqreturn_t action_ret)
> raw_spin_unlock_irqrestore(&desc->lock, flags);
> }
>
> +/*
> + * Some bad hardware will trigger interrupts very frequently, which will
> + * cause the CPU to process hardware interrupts all the time. So when
> + * we find this out, the interrupt should be disabled.
> + */
> +static void __report_so_frequently_irq(struct irq_desc *desc, irqreturn_t action_ret)
> +{
> + unsigned int irq = irq_desc_get_irq(desc);
> + struct irqaction *action;
> + unsigned long flags;
> +
> + printk(KERN_ERR "irq %d: triggered too frequently\n",irq);
> + dump_stack();
> + printk(KERN_ERR "handlers:\n");
> +
> + /*
> + * We need to take desc->lock here. note_interrupt() is called
> + * w/o desc->lock held, but IRQ_PROGRESS set. We might race
> + * with something else removing an action. It's ok to take
> + * desc->lock here. See synchronize_irq().
> + */
> + raw_spin_lock_irqsave(&desc->lock, flags);
> + for_each_action_of_desc(desc, action) {
> + printk(KERN_ERR "[<%p>] %pf", action->handler, action->handler);
> + if (action->thread_fn)
> + printk(KERN_CONT " threaded [<%p>] %pf",
> + action->thread_fn, action->thread_fn);
> + printk(KERN_CONT "\n");
> + }
> + raw_spin_unlock_irqrestore(&desc->lock, flags);
> +}
> +
This is basically a copy of __report_bad_irq() please instead just give
__report_bad_irq() a "const char *msg" argument and drop this copy.
Note __report_bad_irq() is currently called twice, once from
report_bad_irq() and in that case "if (bad_action_ret(action_ret))" is
always true, so inside report_bad_irq() you will now want to use:
__report_bad_irq(desc, action_ret, KERN_ERR "irq event %d: bogus return value %x\n");
And then there is a second call in note_interrupt() in which case
"if (bad_action_ret(action_ret))" is always false so there you
will want to use:
__report_bad_irq(desc, action_ret, KERN_ERR "irq %d: nobody cared (try booting with the \"irqpoll\" option)\n");
And then in __report_bad_irq() itself replace the if with 2 printk() calls with:
if (bad_action_ret(action_ret))
printk(msg, irq, action_ret);
else
printk(msg, irq);
Note the if is still necessary so that if any of this gets inlined
(I sure hope not ...) that the argument count then is correct and
we don't get the compiler warning about the number of arguments
not matching the format-string.
And then replace your __report_so_frequently_irq() call with:
__report_bad_irq(desc, action_ret, KERN_ERR "irq %d: triggered too frequently\n");
> static void report_bad_irq(struct irq_desc *desc, irqreturn_t action_ret)
> {
> static int count = 100;
> @@ -273,6 +305,26 @@ void note_interrupt(struct irq_desc *desc, irqreturn_t action_ret)
> {
> unsigned int irq;
>
> + if((desc->gap_count & 0xffff0000) == 0)
> + desc->gap_time = get_jiffies_64();
> +
> + desc->gap_count ++;
> +
> + if((desc->gap_count & 0x0000ffff) >= 2000) {
> + if((get_jiffies_64() - desc->gap_time) < HZ) {
> + desc->gap_count += 0x00010000;
> + desc->gap_count &= 0xffff0000;
> + } else {
> + desc->gap_count = 0;
> + }
> +
> + if((desc->gap_count >> 16) > 30) {
> + __report_so_frequently_irq(desc, action_ret);
> + irq_disable(desc);
> + }
> + }
> +
> +
I believe this should be moved to below this block:
if (bad_action_ret(action_ret)) {
report_bad_irq(desc, action_ret);
return;
}
Since we don't want this check when polling and we want to keep
using / prefer the existing handling for bad interrupt return
values.
> if (desc->istate & IRQS_POLL_INPROGRESS ||
> irq_settings_is_polled(desc))
> return;
Please note I'm in no way an expert on the interrupt subsystem, so
I have no idea if this patch is a good idea in general. I just noticed
the code duplication which is why I did a partial review of this patch.
Regards,
Hans
Powered by blists - more mailing lists