linux-kernel - Re: [PATCHv2 1/2] watchdog/softlockup: low-overhead detection of interrupt storm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <42d0e3f4-39e0-4404-87d9-7df89f8b8c0c@linux.alibaba.com>
Date: Wed, 31 Jan 2024 19:06:07 +0800
From: Bitao Hu <yaoma@...ux.alibaba.com>
To: Liu Song <liusong@...ux.alibaba.com>, dianders@...omium.org,
 akpm@...ux-foundation.org, pmladek@...e.com, kernelfans@...il.com
Cc: linux-kernel@...r.kernel.org
Subject: Re: [PATCHv2 1/2] watchdog/softlockup: low-overhead detection of
 interrupt storm



On 2024/1/31 09:19, Liu Song wrote:
> 
> 在 2024/1/30 15:47, Bitao Hu 写道:
>> The following softlockup is caused by interrupt storm, but it cannot be
>> identified from the call tree. Because the call tree is just a snapshot
>> and doesn't fully capture the behavior of the CPU during the soft lockup.
>>    watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
>>    ...
>>    Call trace:
>>      __do_softirq+0xa0/0x37c
>>      __irq_exit_rcu+0x108/0x140
>>      irq_exit+0x14/0x20
>>      __handle_domain_irq+0x84/0xe0
>>      gic_handle_irq+0x80/0x108
>>      el0_irq_naked+0x50/0x58
>>
>> Therefore，I think it is necessary to report CPU utilization during the
>> softlockup_thresh period (report once every sample_period, for a total
>> of 5 reportings), like this:
>>    watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
>>    CPU#28 Utilization every 4s during lockup:
>>      #1: 0% system, 0% softirq, 100% hardirq, 0% idle
>>      #2: 0% system, 0% softirq, 100% hardirq, 0% idle
>>      #3: 0% system, 0% softirq, 100% hardirq, 0% idle
>>      #4: 0% system, 0% softirq, 100% hardirq, 0% idle
>>      #5: 0% system, 0% softirq, 100% hardirq, 0% idle
>>    ...
>>
>> This would be helpful in determining whether an interrupt storm has
>> occurred or in identifying the cause of the softlockup. The criteria for
>> determination are as follows:
>>    a. If the hardirq utilization is high, then interrupt storm should be
>>    considered and the root cause cannot be determined from the call tree.
>>    b. If the softirq utilization is high, then we could analyze the call
>>    tree but it may cannot reflect the root cause.
>>    c. If the system utilization is high, then we could analyze the root
>>    cause from the call tree.
>>
>> Signed-off-by: Bitao Hu <yaoma@...ux.alibaba.com>
>> ---
>>   kernel/watchdog.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 72 insertions(+)
>>
>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>> index 81a8862295d6..0efe9604c3c2 100644
>> --- a/kernel/watchdog.c
>> +++ b/kernel/watchdog.c
>> @@ -23,6 +23,8 @@
>>   #include <linux/sched/debug.h>
>>   #include <linux/sched/isolation.h>
>>   #include <linux/stop_machine.h>
>> +#include <linux/kernel_stat.h>
>> +#include <linux/math64.h>
>>   #include <asm/irq_regs.h>
>>   #include <linux/kvm_para.h>
>> @@ -441,6 +443,73 @@ static int is_softlockup(unsigned long touch_ts,
>>       return 0;
>>   }
>> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
>> +#define NUM_STATS_GROUPS    5
>> +#define STATS_SYSTEM        0
>> +#define STATS_SOFTIRQ        1
>> +#define STATS_HARDIRQ        2
>> +#define STATS_IDLE        3
>> +#define NUM_STATS_PER_GROUP    4
> This is a set of related numbers; wouldn't it be better to use an enum?
Agree.
>> +static DEFINE_PER_CPU(u16, cpustat_old[NUM_STATS_PER_GROUP]);
>> +static DEFINE_PER_CPU(u8, 
>> cpustat_utilization[NUM_STATS_GROUPS][NUM_STATS_PER_GROUP]);
>> +static DEFINE_PER_CPU(u8, cpustat_tail);
>> +static enum cpu_usage_stat idx_to_stat[NUM_STATS_PER_GROUP] = {
>> +    CPUTIME_SYSTEM, CPUTIME_SOFTIRQ, CPUTIME_IRQ, CPUTIME_IDLE
>> +};
> To be honest, I'm not particularly fond of the name 'idx_to_stat' as the 
> concept of an
> index is already implied by the nature of an array, so adding 'idx' is 
> redundant.
> I suggest shortening the name.
OK, the name 'stats' is clear enough here.
> 
>> +
>> +static void update_cpustat(void)
>> +{
>> +    u8 i;
>> +    u16 *old = this_cpu_ptr(cpustat_old);
>> +    u8 (*utilization)[NUM_STATS_PER_GROUP] = 
>> this_cpu_ptr(cpustat_utilization);
>> +    u8 tail = this_cpu_read(cpustat_tail);
>> +    struct kernel_cpustat kcpustat;
>> +    u64 *cpustat = kcpustat.cpustat;
>> +    u16 sample_period_ms = sample_period >> 24LL; /* 2^24ns ~= 16.8ms */
> 
> There are two instances where right shift operations are used; it is 
> suggested to employ a helper macro for a more comfortable look.
OK.
> 
> 
>> +
>> +    kcpustat_cpu_fetch(&kcpustat, smp_processor_id());
>> +    for (i = STATS_SYSTEM; i < NUM_STATS_PER_GROUP; i++) {
>> +        /*
>> +         * We don't need nanosecond resolution. A granularity of 16ms is
>> +         * sufficient for our precision, allowing us to use u16 to store
>> +         * cpustats, which will roll over roughly every ~1000 seconds.
>> +         * 2^24 ~= 16 * 10^6
>> +         */
>> +        cpustat[idx_to_stat[i]] = 
>> lower_16_bits(cpustat[idx_to_stat[i]] >> 24LL);
>> +        utilization[tail][i] = 100 * (u16)(cpustat[idx_to_stat[i]] - 
>> old[i])
>> +                    / sample_period_ms;
>> +        old[i] = cpustat[idx_to_stat[i]];
>> +    }
>> +    this_cpu_write(cpustat_tail, (tail + 1) % NUM_STATS_GROUPS);
>> +}
>> +
>> +static void print_cpustat(void)
>> +{
>> +    u8 i, j;
>> +    u8 (*utilization)[NUM_STATS_PER_GROUP] = 
>> this_cpu_ptr(cpustat_utilization);
>> +    u8 tail = this_cpu_read(cpustat_tail);
>> +    u64 sample_period_second = sample_period;
>> +
>> +    do_div(sample_period_second, NSEC_PER_SEC);
>> +    /*
>> +     * We do not want the "watchdog: " prefix on every line,
>> +     * hence we use "printk" instead of "pr_crit".
>> +     */
>> +    printk(KERN_CRIT "CPU#%d Utilization every %llus during lockup:\n",
>> +        smp_processor_id(), sample_period_second);
>> +    for (j = STATS_SYSTEM, i = tail; j < NUM_STATS_GROUPS;
>> +        j++, i = (i + 1) % NUM_STATS_GROUPS) {
>> +        printk(KERN_CRIT "\t#%d: %3u%% system,\t%3u%% softirq,\t"
>> +            "%3u%% hardirq,\t%3u%% idle\n", j+1,
>> +            utilization[i][STATS_SYSTEM], utilization[i][STATS_SOFTIRQ],
>> +            utilization[i][STATS_HARDIRQ], utilization[i][STATS_IDLE]);
>> +    }
>> +}
>> +#else
>> +static inline void update_cpustat(void) { }
>> +static inline void print_cpustat(void) { }
>> +#endif
>> +
>>   /* watchdog detector functions */
>>   static DEFINE_PER_CPU(struct completion, softlockup_completion);
>>   static DEFINE_PER_CPU(struct cpu_stop_work, softlockup_stop_work);
>> @@ -504,6 +573,8 @@ static enum hrtimer_restart 
>> watchdog_timer_fn(struct hrtimer *hrtimer)
>>        */
>>       period_ts = READ_ONCE(*this_cpu_ptr(&watchdog_report_ts));
>> +    update_cpustat();
>> +
>>       /* Reset the interval when touched by known problematic code. */
>>       if (period_ts == SOFTLOCKUP_DELAY_REPORT) {
>>           if (unlikely(__this_cpu_read(softlockup_touch_sync))) {
>> @@ -539,6 +610,7 @@ static enum hrtimer_restart 
>> watchdog_timer_fn(struct hrtimer *hrtimer)
>>           pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
>>               smp_processor_id(), duration,
>>               current->comm, task_pid_nr(current));
>> +        print_cpustat();
>>           print_modules();
>>           print_irqtrace_events(current);
>>           if (regs)