[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <868rm16tbu.wl-maz@kernel.org>
Date: Fri, 30 Sep 2022 11:37:57 +0100
From: Marc Zyngier <maz@...nel.org>
To: "Zhang Xincheng" <zhangxincheng@...ontech.com>
Cc: "tglx" <tglx@...utronix.de>,
"linux-kernel" <linux-kernel@...r.kernel.org>,
"oleksandr" <oleksandr@...alenko.name>,
"Hans de Goede" <hdegoede@...hat.com>,
"bigeasy" <bigeasy@...utronix.de>,
"mark.rutland" <mark.rutland@....com>,
"michael" <michael@...le.cc>
Subject: Re: [PATCH] interrupt: discover and disable very frequent interrupts
On Fri, 30 Sep 2022 10:57:17 +0100,
"=?utf-8?B?WmhhbmcgWGluY2hlbmc=?=" <zhangxincheng@...ontech.com> wrote:
>
> > Irrespective of the patch itself, I would really like to understand
> > why you consider that it is a better course of action to kill a device
> > (and potentially the whole machine) than to let the storm eventually
> > calm down? A frequent interrupt is not necessarily the sign of
> > something going wrong. It is the sign of a busy system. I prefer my
> > systems busy rather than dead.
>
> Because I found that some peripherals will send interrupts to the
> CPU very frequently in some cases, and the interrupts will be
> handled correctly, which will cause the CPU to do nothing but handle
> the interrupts. At the same time, the RCU system will report the
> following logs:
>
> [ 838.131628] rcu: INFO: rcu_sched self-detected stall on CPU
> [ 838.137189] rcu: 0-....: (194839 ticks this GP) idle=f02/1/0x4000000000000004 softirq=9993/9993 fqs=97428
> [ 838.146912] rcu: (t=195015 jiffies g=6773 q=0)
> [ 838.151516] Task dump for CPU 0:
> [ 838.154730] systemd-sleep R running task 0 3445 1 0x0000000a
> [ 838.161764] Call trace:
> [ 838.164198] dump_backtrace+0x0/0x190
> [ 838.167846] show_stack+0x14/0x20
> [ 838.171148] sched_show_task+0x134/0x160
> [ 838.175057] dump_cpu_task+0x40/0x4c
> [ 838.178618] rcu_dump_cpu_stacks+0xc4/0x108
> [ 838.182788] rcu_check_callbacks+0x6e4/0x898
> [ 838.187044] update_process_times+0x2c/0x88
> [ 838.191214] tick_sched_handle.isra.5+0x3c/0x50
> [ 838.195730] tick_sched_timer+0x48/0x98
> [ 838.199552] __hrtimer_run_queues+0xec/0x2f8
> [ 838.203808] hrtimer_interrupt+0x10c/0x298
> [ 838.207891] arch_timer_handler_phys+0x2c/0x38
> [ 838.212321] handle_percpu_devid_irq+0x88/0x228
> [ 838.216837] generic_handle_irq+0x2c/0x40
> [ 838.220833] __handle_domain_irq+0x60/0xb8
> [ 838.224915] gic_handle_irq+0x7c/0x178
> [ 838.228650] el1_irq+0xb0/0x140
> [ 838.231778] __do_softirq+0x84/0x2e8
> [ 838.235340] irq_exit+0x9c/0xb8
> [ 838.238468] __handle_domain_irq+0x64/0xb8
> [ 838.242550] gic_handle_irq+0x7c/0x178
> [ 838.246285] el1_irq+0xb0/0x140
> [ 838.249413] resume_irqs+0xfc/0x148
> [ 838.252888] resume_device_irqs+0x10/0x18
> [ 838.256883] dpm_resume_noirq+0x10/0x20
> [ 838.260706] suspend_devices_and_enter+0x170/0x788
> [ 838.265483] pm_suspend+0x41c/0x4cc
> [ 838.268958] state_store+0xbc/0x160
> [ 838.272433] kobj_attr_store+0x14/0x28
> [ 838.276168] sysfs_kf_write+0x40/0x50
> [ 838.279817] kernfs_fop_write+0xcc/0x1e0
> [ 838.283726] __vfs_write+0x18/0x140
> [ 838.287201] vfs_write+0xa4/0x1b0
> [ 838.290503] ksys_write+0x4c/0xb8
> [ 838.293804] __arm64_sys_write+0x18/0x20
> [ 838.297713] el0_svc_common+0x90/0x178
> [ 838.301449] el0_svc_handler+0x9c/0xa8
> [ 838.305184] el0_svc+0x8/0xc
>
> The log is from the process of waking up a sleeping machine,
> I left the machine in this state for a night and it successfully woke up,
> and then I saw from /proc/interrupts that a GPIO interrupt triggered
> more than 13 billion times.
>
> 29: 1368200001 0 0 0 0 0 0 0 phytium_gpio6 Edge ACPI:Event
Again: what makes you think that it is better to kill the interrupt
than suffering a RCU stall? Yes, that's a lot of interrupts. But
killing it and risking the whole system isn't an acceptable outcome.
M.
--
Without deviation from the norm, progress is not possible.
Powered by blists - more mailing lists